journal of software engineering research and development, 2023, 11:10, doi: 10.5753/jserd.2023.3082 this work is licensed under a creative commons attribution 4.0 international license.. identifying and addressing problems in the estimation process: a case study applying action research ana m. debiasi duarte [ universidade do oeste de santa catarina | ana.duarte@unoesc.edu.br ] ieda margarete oro [ universidade do oeste de santa catarina | ieda.oro@unoesc.edu.br ] karine vidor [ universidade do oeste de santa catarina | karine.vidor@unoesc.edu.br ] denio duarte [ universidade federal da fronteira sul | duarte@uffs.edu.br ] abstract literature shows that a large part of software projects exceeds the amount of effort and estimation duration, even though we currently witness an evolution of software project management discipline. through its best practices, software engineering tries to reduce the flaws in software development. several techniques and resources have been presented to help to reduce this problem. this paper aims to propose an approach based on action research to improve the estimation process in software development tasks by identifying problems. a case study is carried out to show the effectiveness of our approach. the results show an improvement of 50% accuracy over the baseline estimation process. keywords: software estimation, process improvement, agile methodologies, action research 1 introduction agile software development (asd) is usually used as an alternative to more traditional approaches, e.g., waterfall or evolutionary. the key elements for the latter are extensive planning, rigorous reuse, and codified processes. on the other hand, asd is based on iterative and incremental development models (larman and basili, 2003; hohl et al., 2018). although asd intends to make software development easier compared to traditional ones, it still experiences the size estimation effort problem. effort estimation can be defined as the process by which effort is evaluated, and estimation is carried out in terms of the number of resources required to end project activity to deliver a product or service that meets the given functional and non-functional requirements to a customer (trendowicz and jeffery, 2014). several methods (metrics) have been proposed to estimate effort, e.g., planning poker, expert judgment, and wideband delphi. however, the accuracy of software effort estimation models for asd still remains inconsistent (pillai et al., 2017). the report proposed by the standish group chaos report (2018) showed that many software companies struggle to develop their products within strict schedules and budget constraints. either the companies finished their projects behind schedule and over budget (48% 65%) or failed to complete them (48% 56%) in 2018. the findings show that most projects’ planned efforts and schedules were overrun compared to the estimations. it is well known that cost underestimation brings inefficiencies to the project (nhung et al., 2019). gupta et al. (2019) present the lack of the most usual factors that cause flaws in software projects: (i) top management’s commitment and involvement/support; (ii) allocation of scarce resources; (iii) communication among various stakeholders; (iv) team configuration and structure; and (v) social cohesion in the team and the complexity of the project and organizational culture. in this paper, we focus on software development effort estimation. we intend to offer an approach to minimize the error of one of the software project problems: estimation effort. the action research method (mckay and marshall, 2001) allows us to involve researchers and developers in finding an approach to solve the target problem. based on the steps performed in action research (see figure 1), we proposed an approach to improve the estimation effort using a case study. an asd team from a software development company and the researchers participate in all phases of our approach to get an ideal process to estimate effort. using historical data, we find problems that decay the effort estimation process. according to those problems, we develop our approach. the results show that our proposal improves the process effort estimation accuracy in 1.5 times. we believe that the promising results can help companies using asd in minimizing the flaws in software projects. the rest of this paper is organized as follows: section 2 briefly presents software development effort estimation and management, and section 3 presents works related to ours. next, we introduce our methods. section 5 presents our approach and its application as a case study. finally, section 6 concludes this paper. 2 background software development effort estimation plays a crucial role in software development projects. building reliable software processes for executing software projects to meet the delivery on time, respecting the budget, and in a cost-effective manner is challenging (sommerville, 2015). developers have struggled with software development effort estimation since the 1960s (gautam and singh, 2018). effort estimation plays a crucial role when it comes to finish a project on time and respecting the budget. effort estimation can be stated as the process by which https://orcid.org/0000-0001-8054-0063 mailto:ana.duarte@unoesc.edu.br https://orcid.org/0000-0002-2239-531x mailto:ieda.oro@unoesc.edu.br mailto:karine.vidor@unoesc.edu.br https://orcid.org/0000-0003-4936-4748 mailto:duarte@uffs.edu.br identifying and addressing problems in the estimation process duarte et al. 2023 effort is assessed, and estimation is performed as to the number of resources required to end project activity to deliver a product or service that meets the given functional and nonfunctional requirements to a customer (trendowicz and jeffery, 2014). if the effort estimations are accurate, they can contribute to the success of software development projects, while incorrect estimations can negatively affect product development, leading to monetary losses (altaleb and gravell, 2018). software project estimation involves estimating the effort, size, staffing, schedule (time), and cost involved in creating a unit of the software product (jorgensen and shepperd, 2006; pillai et al., 2017). the ratio between the amount of work spent on software development and its size is called productivity (fenton and bieman, 2014). it can be measured in several ways, but function point analysis (fpa) is the most common. fpa can be applied before the program writing, based on system requirements, so it is possible to estimate the effort and the schedule to develop activities. many variables can impact a software development team’s productivity: one is time management. inadequate time management usually occurs because of a lack of planning for the day, non-management of compromises, and accepting more tasks than possible, among others (sá et al., 2017). however, some techniques help to manage time in a better way. one of those is the pomodoro technique, created by cirillo (2022), which aims to address the time spent on activities and eliminate internal and external distractions. planning and supervising the project is needed to check the development team’s productivity and software quality. those tasks are essential in the software development process. according to the project management body of knowledge (pmbok) (pmi, 2021), a project is a temporary effort to progressively create a product, service, or single result. project managing means applying knowledge, abilities, and tools to support scheduled requirements. according to pressman (2014), successful project management begins with an accurate estimation of development effort; however, estimation is still imprecise contributing to failed software projects. usually, estimation is made by using techniques along historical project bases. however, maxwell (2001) claims that, more than simply registering productivity, data is needed to improve the estimate process, so analysis is important to understand its influences on projects and their productivity contexts. according to kirmani and wahid (2015), efficiency, product delivery on time, and the desired quality level are features that influence the software development process. therefore, collecting data through the measurements taken during the project execution, usually based on qualitative and quantitative information, is crucial. software projects are complicated in any context and are especially prone to failure (bannerman, 2008). there is no fail-proof project, but it is possible to be ready for unforeseen problems. the agile methods, including scrum, were created precisely to deal with project uncertainties instead of traditional methods that try planning everything before the development starts. scrum is a lightweight framework that helps people, teams and organizations generate value through adaptive solutions for complex problems. in each iteration, the team analyzes the requirements, technology, and abilities and then splits themselves between creating and delivering the best software they can, adapting daily as complexities and surprises arise (schwaber and sutherland, 2020). scrum employs an iterative, incremental approach to optimize predictability and engages groups of people who collectively have all the skills and expertise to do the work and share or acquire such skills as needed. 3 related work action research (ar) has been applied in several study cases, from software development to healthcare (elg et al., 2020; cordeiro and soares, 2018). its basic principle is that the researchers change their role from external observers to participants in solving concrete problems (bradbury-huang, 2010). regarding software engineering (se), there are several proposals to apply ar to deal with se problems. in 2006, dingsøyr et al. (2006) used an action research study to apply the scrum software development process in a small crossorganizational development project. more recently, hoda et al. (2014) combined action research as the overall research framework, elements of user-centered (for evaluation by end-users) and participatory design as the design frameworks, and scrum as the software development framework. marinho et al. (2015) presented the development of an uncertainty management guide designed by action research. the proposed guide was applied in a software development company aiming to reduce the uncertainties in software projects. conversely, choraś et al. (2020) proposed a set of metrics that measure the agile software development process in small and medium company types. they were built as part of an action-research collaboration involving a team of researchers. action research is also applied to developing students’ competencies during the learning and teaching process in software engineering using thinking-based learning (flores and de alencar, 2020). the works cited show that the action research method is widely used as a support tool to help the industry to improve its processes. in this work, we intend to contribute to the industry by proposing using action research to address problems in the estimation software development process. 4 methods this paper applies qualitative research using a case study approach to evaluate our proposal (gil et al., 2002; godoy, 1995). according to creswell (2010), in qualitative studies, the researcher uses a particular language to describe what is expected to be understood, mainly through findings or theories. besides, they must encounter a minimum amount of literature, enough to discuss the issue. the researcher uses a particular language to describe what they expect to understand, discover or develop as a theory. the research was developed using the action research method. thiollent (2011) defines action research as a theoretical and methodologiidentifying and addressing problems in the estimation process duarte et al. 2023 cal approach responsible for an essential contribution to the methodology in social phenomenons investigations, getting known as a research line directed to collective actions. the method is based on joining research and action in a process in which the implicated actors and researchers look to interactively enlighten the reality in which they are and identify common issues by searching and experimenting with solutions in real situations. the search for knowledge conducted by the research is treated as a composite construction (peruzzo, 2016). in our case study, researchers and software engineering work together in all project phases. the collaboration intends to solve a given problem in a software project. we use an action research (ar) process adapted from (mckay and marshall, 2001) and pictorially shown in figure 1. note that the ar process is composed of 8 steps. in the following, we present and discuss every step regarding our proposal. 5 case study planning, execution, and results our primary goal in this work is to apply action research in the context of software effort estimation in an asd approach. to accomplish that, we carried out a case study involving our proposal. in this section, we present the target company and development team and the implementation of our proposal. figure 1 guides us to show how ar is applied in the estimation effort problem. 5.1 characterization of the company and team the case study was conducted in a software development company that provided previous software effort estimations for analysis. as described in (gil, 2008), a case study is an analysis of situations that occur in real life; it is applied to obtain detailed knowledge to present conclusions. the available estimations are composed of 31 sprints. the historical data comprises 100 different functionalities, 302 stories, and 568 programming tasks. the target company uses scrum-like process for its software development, so they are familiar with scrum and its good practices. besides, points are to estimate the sprint size. for every sprint task, the size is calculated, and the sum of all task sizes gives the sprint size in points. the points and the corresponding task complexities are calculated regarding the development effort applied previously, i.e., the company’s historical data. the participants in this case study (i.e., the development team) are 7 software engineers. to build our estimation methodology, we first study the company’s current process. this allowed us to propose a new estimates support method regarding ar. the project office defines all the product phases following the estimation phase. in the estimation phase, demands are presented to the development team, and the team estimates the size of the demands for every sprint. then the project development phase starts, the projects team prioritizes demands inside the sprint, and the development team starts working. after 15 days, all deliverables for a given sprint are produced. step 1 problem identification the first step of the ar process was applied using a bibliography survey. we searched papers in seven different academic databases with the following search strings: “agile project management” and “risks analysis” and “software engineering” and “software estimation” and “agile management” and “agile methods” and “scrum” and “software metrics”. the search retrieved 1,006 papers. to reduce the number of working papers, we apply the following exclusion criteria: (i) papers written in english, (ii) abstracts showing that estimation effort and scrum are used in the approach, and (iii) the reputation of publication vehicle (in this case, we use h-index and number of citations as guide). using those criteria, we selected 23 papers. the team and the researchers read and discussed the papers to make all the involved people aware of the literature about software estimation in asd. during the discussions, we identified several classical software development problems like imprecise schedules, unplanned costs, and delays that might influence the negotiation with the customer. based on the discussions, the team built a sheet containing variables about the 31 sprints used in our case study. to make the development process adequate, we identified the variables that help understand the company’s historical productivity database. table 1 presents the built sheet from the collected data, where (i) sprint represents the sprint number (identifier), (ii) story stores the number of stories, (iii) task represents the number of tasks, (iv) avail. time (h) is the available time (in hours) to accomplish the sprint, (v) est. points shows the number of estimate points (effort), and (vi) del. points represents the number of delivered points. table 1 presents the historical data since they have started using scrum. note that the estimation effort is not very accurate, and, in the beginning, the team did not even registered the delivery points (first eight sprints). we decided to use the last eight to measure the variance between the planned and execution time. our choice was based in the team’s maturity to plan the sprint. table 2 details the mean of variation in planned and execution time is around 35%. note that the standard deviation is also high, meaning that the variation ranged from 15% to 55%. for example, in sprint #30 the variation was 41%, whereas in sprint #29 was 6%. those numbers showed that the team’s estimation could have been more accurate. to calculate the variation, we used equation 1. this equation is also used in bilgaiyan et al. (2017) and de souza (2013): v ar = et − p t p t (1) where var is the estimation variation, et is the estimated time, and pt is the estimated planning time. the main problems from the data analysis of the effort and size estimates were: • lack of precision of the effort and size estimation beidentifying and addressing problems in the estimation process duarte et al. 2023 figure 1. steps performed in action research table 1. historical data from 31 finished sprints. sprint story task avail. time (h) est. points del. points sprint story task avail. time (h) est. points del. points #1 15 31 274 6 #17 9 18 274 77 130 #2 9 16 183 72 #18 3 16 274 70 67 #3 9 17 204 109 #19 3 10 218 44 58 #4 12 30 134 93 #20 6 20 274 49 138 #5 4 13 134 59 #21 10 27 204 133 190 #6 4 18 204 40 #22 6 10 204 95 81 #7 9 21 204 136 #23 13 7 204 130 62 #8 2 4 204 33 #24 12 19 204 130 65 #9 9 20 204 131 118 #25 28 37 183 127 122 #10 6 17 183 78 99 #26 36 39 274 122 166 #11 4 6 183 28 129 #27 13 28 183 110 74 #12 3 4 134 24 #28 13 31 274 131 79 #13 2 2 183 48 48 #29 11 11 309 12 145 #14 9 12 204 143 81 #30 10 20 344 92 98 #15 2 11 183 15 62 #31 27 44 274 128 131 #16 3 9 183 31 65 total 302 568 cause of register failures or lack of information in historical bases; • new demands (e.g., corrections or crucial new requirements) are not formally specified and, sometimes, without further details. step 2 recognizing facts about the problem to collect data from problems in the estimation process, we applied a survey containing 48 objective questions divided into four categories: general, specifications and estimation, sprint, and effort estimation (see table 3). we used the likert scale (likert, 1932) to evaluate the estimated problems. the respondents could choose between the alternatives: “always”, “usually”, “sometimes”, “rarely”, and “never”. answering the survey took an average time of 30 minutes. even though the respondents remained anonymous, we note that some were uncomfortable criticizing the company’s estimation process. we tried to mitigate this by asking them to answer that survey in separate rooms and using the same type of pen. even though we noticed that some criticisms might have been omitted, we believe that the results coherently showed the reality of the development team. table 3 shows the proposed categories along their subcategories (if it is the case). we encoded each category to make it easier to present our solution for the problems. in the following, we briefly discuss each proposed category. in the category general, the result shows there are situations in which the workers stop planned to do unplanned tasks that were not expected when the schedule was created (code rsc01 in table 3). the code rsc02 (category specifications and estimation) means that there are situations where estimates need to be carried out. this compromises delivery in several ways such as imprecise schedules and unplanned costs. the category sprint represents when participants did not use the burndown chart as a stimulus to reach the sprint goals (see rsc03 in table 3). not using the chart contributes to the fact that the development team does not follow the sprint performance, meaning there is no way to know if it follows the schedule. another identified problem is that rarely, or only sometimes, the tasks are appropriately delivered to use, withidentifying and addressing problems in the estimation process duarte et al. 2023 table 2. time variation: planned versus executed sprint planned time (hours) executed time (hours) difference variation #24 141 192 +51 36% #25 260 304 +44 17% #26 328 459 +131 40% #27 147 263 +116 79% #28 161 205 +44 27% #29 84 89 +5 6% #30 153 215 +62 41% #31 274 366 +92 34% variation mean 35% (± 20.01%) table 3. problems influencing estimations category code description general rsc01 performing non-sprint tasks specification and estimation rsc02 size estimation process not performed sprint rsc03 burndown chart not used rsc04 lack of commitment in the delivery of tasks effort estimation rsc05 effort estimation process not performed rsc06 estimates of urgent tasks not performed rsc07 lack of technical knowledge out any faults, pointing to the fact that there are tasks that will need to go through unscheduled corrections (see rsc04 in table 3). as there is no periodic maintenance in the burndown chart in daily meetings, the team does not compromise with the day-to-day delivery of goals. the category effort estimation is composed of three other problems. rsc05 states that the effort estimation process does not occur frequently, which means there are situations where estimates are not made. another problem is that some urgent tasks are added during the sprint without effort estimation. this may cause delays in task performance and a problem in the estimation measurements (see rsc06 in table 3). finally, rsc07 states that developers need to be aware of all the pre-existing codes in the application, and those codes are rarely consulted during the estimate process. this lack of orientation causes uncertainty in estimation, meaning that the codes will likely have to be updated during development. from this analysis, we present in table 3 the problems that must be examined and discussed to propose a method that minimizes the effects as much as possible. the study and discussion of relevant literature, in addition to the survey’s result, let us conclude the following: (i) the team does not have much experience in measuring the effort, (ii) there are not much historical data about productivity, and (iii) the team are not very confident regarding the effort measurement. the team usually estimates backlog stories; however, it is rarely estimated when a new story is inserted in a running sprint. based on the two previous steps, we plan how to solve or minimize the problems faced by the team. this is the third step of action research. step 3 activity planning in step 2, we identified the estimation process problems. we proved a problem in estimation through the analysis, and this served as a process improvement opportunity. we analyzed the current process used by the company and proposed an approach for improving it according to the problems identified in table 3. rsc01 – the proposed solution was to implement a kanban (stellman and greene, 2014; dos santos et al., 2018), so the unplanned tasks may be executed without interfering with the progress of the current sprint. this process may also be used to attend to urgent corrections; the kanban should run along with the sprint. rsc02 and rsc05 – we proposed changes in the way of estimating. the estimation order was inverted in the proposed model: before the sprint meeting starts, size estimation is done, and the backlog must be prioritized. then, the effort estimation process, which happens during sprint planning, can begin. rsc03 – the daily meetings should update the burndown plot. developers must answer three questions: “what have you done today?”, “what will you do tomorrow?” and “which problems have you faced?”. these three questions were inspired from scrum guide 2017 (schwaber and sutherland, 2017) currently used by the company. rsc04 – there should be regular updates on the plot in each daily meeting, and the team should justify internally (between the developers) the daily results that concern the goal. rsc06 – this problem will be minimized with kanban, proposed in rsc01. in this process, at least one developer will be ready to rapidly solve the unplanned income tasks. rsc07 – to address this problem, the company must provide specialized training to the developers concerning the subjects in which they face more problems. identifying and addressing problems in the estimation process duarte et al. 2023 after the discussion about the proposed approach, its implementation must be defined. first, the project office plans the definition phase, from the requirements to the implementation. the planning feeds a project management tool to better control the outputs. in this study case, redmine1 was the chosen tool. later, the planning and product project phase starts when the project team estimates the size of the demands and then prioritizes the backlog (rsc07). in case of a correction or urgency, the demand is sent to kanban (rsc01 and rsc06). if not, it goes to the sprint, and a meeting with the developers is called. in this phase, the demands, requirements, and related interfaces are presented to the development team. the team debates what was presented and estimates the effort in those demands (rsc05). the phase of project development starts as the sprint opening is done. the project team selects the prioritized stories to develop in the sprint. the team starts working, and the stories are finished by the end of 15 days and presented in the sprint meeting. step 4 implementation this step was the implementation of what was planned in the sprint. the team compares the planned estimation to the actual size. two sprints were used as pilots: sprints #36 and #37. rsc01 and rsc06 – kanban was implemented to reduce these identified factors. as of now, a developer is ready to solve any unforeseeable issues that may occur during the sprint and make corrections. the task is developed, tested, and integrated directly into the main branch in kanban. rsc02 and rsc05 – before an opening sprint meeting, the planning team decides which tasks will be in the sprint and estimates their sizes. then, the team may analyze if it is necessary to add or remove any tasks to fit in the upcoming sprint. lastly, the development team estimates the effort, and the opening meeting is done. rsc03 and rsc04 – in the current model, the developers have the daily meeting at the end of the afternoon and answer the three proposed questions: “what have you done today?”, “what will you do tomorrow?”, and “which drawbacks have you faced?”. besides the meeting, the development team fills the burndown plot, making it possible to analyze the plot and explain the daily results relating to the goal. rsc07 – the impact of this factor was reduced by offering the development team opportunities to improve their technical knowledge. according to singh et al. (2019), the people involved in the working process should be trained to guarantee their tasks are executed in the best way possible to fit the company’s goal. to allow this and reduce the impact of the identified problems, the company provided online training to the employees, besides intensifying knowledge-sharing practices. step 5 monitoring this step was the active participation of the researchers in implementing process change measurements by using sug1www.redmine.org gestions and helping validate the action results. the critical point of this step was to check the project’s evolution and ensure that the schedule was adequate to reach the initial goals. step 6 assessment of the results during the study case, there were meetings to evaluate the results and discuss problems. these meetings raised issues about interruptions affecting the teams’ efficiency. the team could not control these interruptions. the interruptions were treated as a new problem so that improvement could be implemented, and then the action plan was improved as described in the next step. step 7 improving the action plans after implementing technical improvements and evaluating the effects, there were still many interruptions in the development environment. an interruption can be internal – by a team member – or external – by someone outside the sprint. those interruptions reduce productivity. see problems in table 4. table 4. new problems that may influence the estimates category code description general rsc08 external interruptions rsc09 internal interruptions we suggested the pomodoro technique to solve the issue. during the pomodoro time, nobody may interrupt a colleague – except for very urgent issues. an online timer will be used to control each pomodoro time2, and a sign was developed to inform the workers that the developer is in pomodoro; it is visible to everyone. one side says “pomodoro”, and the other says “clear”. to use the technique, the worker picks a task and counts 25 minutes in pomodoro. then, for each pomodoro, the working time must be put in redmine. each worker must turn the pomodoro sign according to their status and pause for 5 minutes maximum. for every four pomodoro tasks, a longer pause (around 15 minutes) can be done. step 8 action-research cycle conclusion our approach was tested in eight new sprints to identify its performance in addressing problems of task estimation effort. in total, nine problems should be treated to improve the estimate process in the company. the improvement brought by our approach is shown in table 5. the average variation between the planned and executed time is 14.5% (standard deviation equals 6.8%). compared to table 2 (35% (± 20.01%)), the accuracy improvement is of approximately 1.5 times. the results indicated the action-research method, which involves cooperation between the researchers and the study participants, is helpful in improving software estimation effort errors. 2www.tomatotimers.com www.redmine.org www.tomatotimers.com identifying and addressing problems in the estimation process duarte et al. 2023 table 5. time variation: time planned versus time executed after the improvements sprint planned time (hours) executed time (hours) difference variation #36 134 119 -15 11% ↓ #37 80 102 22 28% ↑ #38 106 115 9 8% ↑ #39 69 83 13 19% ↑ #40 94 110 16 17% ↑ #41 81 87 6 7% ↑ #42 205 187 -18 9% ↓ #43 167 195 28 17% ↑ variation mean 14.5% (± 6.8%) 6 conclusion this paper presented a case study to investigate how action research can help developers to address problems in the estimation process. we first studied the target company estimation process and analyzed the historical data; then, we surveyed the development team to find the reasons for the effort estimation errors. using the action research method involving the researchers and developers, we propose an approach to help the development team estimate better the task effort. we accomplished our goal by identifying the problems and implementing changes in the current software estimation process. after implementing the suggested procedures, the results indicated that we reached the main goal: addressing the problems in the estimation process. by comparing the estimation time with and without our method, we improved estimation accuracy 1.5 times compared to the historical data. the research action method guided the whole process of our proposal and proved very effective in our case study. there are some threats to the validation of our approach; however, using 31 sprints as historical data and eight sprints to compare the results can satisfactorily validate our results. recommendations for future works are: (i) increase the number of case studies to compare the results; (ii) apply the analysis of methods that use statistics to treat historical productivity data in short and long-term estimates; and (iii) to evaluate known estimation problems in the software development process by analyzing the techniques for solving them. acknowledgments the authors thank fapesc for the financial support to the paper proofreading. project approved with grant term n. 2021tr001877. references altaleb, a. and gravell, a. (2018). effort estimation across mobile app platforms using agile processes: a systematic literature review. journal of software, 13(4):242. bannerman, p. l. (2008). risk and risk management in software projects: a reassessment. j. syst. softw., 81(12):2118–2133. bilgaiyan, s., sagnika, s., mishra, s., and das, m. (2017). a systematic review on software cost estimation in agile software development. journal of engineering science & technology review, 10(4). bradbury-huang, h. (2010). what is good action research? why the resurgent interest? action research, 8(1):93–109. choraś, m., springer, t., kozik, r., lópez, l., martínezfernández, s., ram, p., rodriguez, p., and franch, x. (2020). measuring and improving agile processes in a small-size software development company. ieee access, 8:78452–78466. cirillo, f. (2022). pomodoro technique. [online; accessed 10-dec-2022]. cordeiro, l. and soares, c. b. (2018). action research in the healthcare field: a scoping review. jbi evidence synthesis, 16(4):1003–1047. creswell, j. w. (2010). projeto de pesquisa métodos qualitativo, quantitativo e misto. in projeto de pesquisa métodos qualitativo, quantitativo e misto. penso editora. de souza, l. l. c. (2013). suporte ao processo de monitoramento e controle de projetos de software: uma abordagem inteligente com base na teoria do valor agregado. dissertação mestrado, universidade estadual do ceará. dingsøyr, t., hanssen, g. k., dybå, t., anker, g., and nygaard, j. o. (2006). developing software with scrum in a small cross-organizational project. in european conference on software process improvement, pages 5–15. springer. dos santos, p. s. m., beltrão, a. c., de souza, b. p., and travassos, g. h. (2018). on the benefits and challenges of using kanban in software engineering: a structured synthesis study. journal of software engineering research and development, 6(1):1–29. elg, m., gremyr, i., halldorsson, á., and wallo, a. (2020). service action research: review and guidelines. journal of services marketing. fenton, n. and bieman, j. (2014). software metrics: a rigorous and practical approach. crc press, inc., usa, 3rd edition. flores, a. p. m. and de alencar, f. m. r. (2020). competencies development based on thinking-based learning in software engineering: an action-research. in proceedings of the 34th brazilian symposium on software engineering, pages 680–689. gautam, s. s. and singh, v. (2018). the state-of-the-art in software development effort estimation. journal of software: evolution and process, 30(12):e1983. identifying and addressing problems in the estimation process duarte et al. 2023 gil, a. c. (2008). métodos e técnicas de pesquisa social. 6. ed. editora atlas sa. gil, a. c. et al. (2002). como elaborar projetos de pesquisa, volume 4. atlas são paulo. godoy, a. s. (1995). pesquisa qualitativa: tipos fundamentais. revista de administração de empresas, pages 20–29. gupta, s. k., gunasekaran, a., antony, j., gupta, s., bag, s., and roubaud, d. (2019). systematic literature review of project failures: current trends and scope for future research. computers & industrial engineering, 127:274– 285. hoda, r., henderson, a., lee, s., beh, b., and greenwood, j. (2014). aligning technological and pedagogical considerations: harnessing touch-technology to enhance opportunities for collaborative gameplay and reciprocal teaching in nz early education. international journal of childcomputer interaction, 2(1):48–59. hohl, p., klünder, j., van bennekum, a., lockard, r., gifford, j., münch, j., stupperich, m., and schneider, k. (2018). back to the future: origins and directions of the “agile manifesto”–views of the originators. journal of software engineering research and development, 6(1):1–27. jorgensen, m. and shepperd, m. (2006). a systematic review of software development cost estimation studies. ieee transactions on software engineering, 33(1):33–53. kirmani, m. m. and wahid, a. (2015). article: use case point method of software effort estimation: a review. international journal of computer applications, 116(15):43–47. full text available. larman, c. and basili, v. r. (2003). iterative and incremental developments. a brief history. computer, 36(6):47–56. likert, r. (1932). a technique for the measurement of attitudes. number nº 136-165 in a technique for the measurement of attitudes. publisher not identified. marinho, m., lima, t., sampaio, s., and moura, h. (2015). uncertainty management in software projects an action research. in experimental software engineering track – xviii cibse iberoamerican conference on software engineering. cibse. maxwell, k. d. (2001). collecting data for comparability: benchmarking software development productivity. ieee software, 18(5):22–25. mckay, j. and marshall, p. (2001). the dual imperatives of action research. information technology & people. nhung, h. l. t. k., hoc, h. t., and hai, v. v. (2019). a review of use case-based development effort estimation methods in the system development context. in proceedings of the computational methods in systems and software. springer. peruzzo, c. (2016). epistemologia e método da pesquisaação. uma aproximação aos movimentos sociais e à comunicação. anais xxv encontro anual da compós, pages 1–22. pillai, s. p., madhukumar, s., and radharamanan, t. (2017). consolidating evidence based studies in software cost/effort estimation — a tertiary study. in tencon 2017 2017 ieee region 10 conference, pages 833–838. pmi, p. m. i. (2021). a guide to the project management body of knowledge (pmbok©guide). project management institute (pmi), usa, 7th edition. pressman, r. (2014). software engineering: a practitioner’s approach. mcgraw-hill, inc., usa, 8 edition. schwaber, k. and sutherland, j. (2017). the scrum guide. the definitive guide to scrum: the rules of the game. scrumguides. schwaber, k. and sutherland, j. (2020). the definitive guide to scrum: the rules of the game. singh, s. k., gupta, s., busso, d., and kamboj, s. (2019). top management knowledge value, knowledge sharing practices, open innovation and organizational performance. journal of business research. sommerville, i. (2015). software engineering. pearson education limited, 10th edition edition. stellman, a. and greene, j. (2014). learning agile: understanding scrum, xp, lean, and kanban. ” o’reilly media, inc.”. sá, m., silva, a., oliveira, g., and silveira, j. (2017). o método getting things done (gtd) e as ferramentas de gerenciamento de tempo e produtividade. navus revista de gestão e tecnologia, 8(1):72–87. the standish group chaos report (2018). decision latency theory: it’s all about the interval. technical report, the standish group international. thiollent, m. (2011). metodologia da pesquisa-ação. 18ª. são paulo: cortez. trendowicz, a. and jeffery, r. (2014). software project effort estimation. foundations and best practice guidelines for success, constructive cost model–cocomo pags, 12:277–293. introduction background related work methods case study planning, execution, and results characterization of the company and team conclusion journal of software engineering research and development, 2019, 6:1,doi: 10.5753/jserd.2019.17 this work is licensed under a creative commons attribution 4.0 international license.. improving energy efficiency through automatic refactoring luis cruz [ inesc-id, university of porto | luiscruz@fe.up.pt ] rui abreu [ inesc-id, ist, university of lisbon | rui@computer.org ] abstract the ever-growing popularity of mobile phones has brought additional challenges to the software development lifecycle. mobile applications ought to provide the same set of features as conventional software, with limited resources: such as limited processing capabilities, storage, screen and, not less important, power source. although energy efficiency is a valuable requirement, developers often lack knowledge of best practices. in this paper, we propose a tool to improve the energy efficiency of android applications using automatic refactoring — leafactor. the tool features five energy code smells that tend to go unnoticed. in addition, to evaluate the effectiveness of our approach, we run an experiment over a dataset of 140 free and open source apps. as a result, we detected and fixed code smells in 45 android apps, from which 40% have successfully merged our changes into the official repository. keywords: automatic refactoring, mobile computing, energy efficiency, software engineering 1 introduction in the past decade, the advent of mobile devices has brought new challenges and paradigms to the existing computing models. one of the major challenges is the fact that mobile phones have limited battery life. as a consequence, users need to frequently charge their devices to prevent their inoperability. hence, energy efficiency is an important nonfunctional requirement in mobile software, with a valuable impact on usability. a study in 2013 reported that 18% of apps have feedback from users that is related to energy consumption (wilke et al., 2013). other studies have nonetheless found that most developers lack the knowledge about best practices for energy efficiency in mobile applications (apps) (pang et al., 2015; sahin et al., 2014). hence, it is important to provide developers with actionable documentation and toolsets that aim to help deliver energy efficient apps. previously, we have identified five code smells with significant impact on the energy consumption of android apps (cruz and abreu, 2017) — we refer to them as energyrelated smells. we used a hardware-based approach to assess the energy efficiency improvement of fixing eight performance-based code smells described in the official android documentation. the impact on energy efficiency was evaluated by manually refactoring the codebases of five open-source android applications. the energy consumption was measured for every pair of versions: before and after the refactoring. the measurements were performed by mimicking real use-case scenarios while collecting power data with the single-board computer odroid1, which features power sensors for energy measurements. from those eight refactorings, five were found to yield a significant improvement in the energy consumption of mobile apps. however, certify that code is complying with these optimizations is time-consuming and prone to errors. thus, in this paper we study how automatic refactor can help develop code that follows energy best practices. 1odroid is a single-board computer that runs android and is used for mobile application development and iot applications. there are state-of-the-art tools that provide automatic refactoring for android and java apps (for instance, autorefactor2, walkmod3, facebook pfff 4, kadabra5). although these tools help developers creating better code, they do not feature energy-related refactorings for android. thus, we leverage five energy optimizations in an automatic refactoring tool, leafactor, which is publicly available with an open source license. in addition, the toolset has the potential to serve as an educative tool to aid developers in understanding which practices can be used to improve energy efficiency. on top of that, we analyze how android developers are addressing energy-related smells and how an automatic refactoring tool would help ship more energy efficient mobile software. we have used the results of our tool to contribute to real android app projects, validating the value of adopting an automatic refactoring tool in the development stack of mobile apps. in a dataset of 140 free and open source software (foss) android apps, we have found that a considerable part (32%) is released with energy inefficiencies. we have fixed 222 energy-related smells in 45 apps, from which 18 have successfully merged our changes into the official branch. results show that automatic refactoring tools can be very helpful to improve the energy footprint of apps. this paper is an extension of our previous work, in which we introduced the automatic refactoring tool leafactor (cruz et al., 2017; cruz and abreu, 2018) for the first time. we provide a self-contained report of our work on improving energy efficiency of mobile apps via automatic refactorings, by adding details of the architecture of the toolset and the available set of refactorings. moreover, we make a more comprehensive description of the dataset used in the empirical study, including complexity metrics. combined, our work makes the following contributions: 2autorefactor: http://autorefactor.org (august 17, 2019). 3walkmod: http://walkmod.com (august 17, 2019). 4facebook pfff : https://github.com/facebookarchive/pfff (august 17, 2019). 5kadabra: http://specs.fe.up.pt/tools/kadabra/ (august 17, 2019). https://orcid.org/0000-0002-1615-355x mailto:luiscruz@fe.up.pt https://orcid.org/0000-0003-3734-3157 mailto:rui@computer.org http://autorefactor.org http://walkmod.com https://github.com/facebookarchive/pfff http://specs.fe.up.pt/tools/kadabra/ cruz et al. 2019 • an automated refactoring tool, leafactor, to improve energy efficiency of android applications. • an empirical study of the prevalence of five energyrelated code smells in foss android applications. • the submission of 59 pull requests to the official code bases of 45 foss android applications, comprehending 222 energy efficiency refactorings. the remainder of this paper is organized as follows: section 2 details energy refactorings and corresponding impact on energy consumption; in section 3, we present the automatic refactor toolset that was implemented; section 4 describes the experimental methodology used to validate our tool, followed by sections 5 and 6 with results and discussion; in section 7 we present the related work in this field; and finally section 8 summarizes our findings and discusses future work. 2 energy refactorings we use static code analysis and automatic refactoring to apply android-specific optimizations of energy efficiency. in this section, we describe refactorings which are known to improve the energy consumption of android apps. each of them has an indication of the energy efficiency improvement (), as assessed in previous work (cruz and abreu, 2017), and the fix priority provided by the official lint documentation6. the priority reflects the impact of the refactoring in terms of performance and is given on a scale of 1 to 10, with 10 being the most effective. the severity is not necessarily correlated with energy performance. in addition, we also provide examples where the refactorings are applied. all refactorings are in java with the exception obsoletelayoutparam which is in xml — the markup language used in android to define the user interface (ui). 2.1 viewholder:add viewholder to scrolling lists energy efficiency improvement (): 4.5%. lint priority: ■■■■■□□□□□ 5/10. this refactoring is used to make a smoother scroll in list views, with no lags. when in a list view, the system has to draw each item separately. to make this process more efficient, data from the previous drawn item should be reused. this technique decreases the number of calls to the method findviewbyid(), which is known for being a very inefficient method (linares-vásquez et al., 2014). the following code snippet provides an example of how to apply viewholder. // ... @override public view getview(final int position, view convertview, viewgroup parent) { convertview = layoutinflater.from(getcontext()).inflate ( ¶ r.layout.subforsublist, parent, false ); 6lint is a tool provided with the android sdk which detects problems related with the structural quality of the code. website: https:// developer.android.com/studio/write/lint (august 17, 2019). final textview t = ((textview) convertview.findviewbyid (r.id.name)); · // ... optimized version: // ... private static class viewholderitem { ¸ private textview t; } @override public view getview(final int position, view convertview, viewgroup parent) { viewholderitem viewholderitem; if (convertview == null) { ¹ convertview = layoutinflater.from(getcontext()). inflate( r.layout.subforsublist, parent, false ); viewholderitem = new viewholderitem(); viewholderitem.t = ((textview) convertview. findviewbyid(r.id.name)); convertview.settag(viewholderitem); } else { viewholderitem = (viewholderitem) convertview.gettag (); } final textview t = viewholderitem.t; º // ... ¶ in every iteration of the method getview, a new layoutinflater object is instantiated, overwriting the method’s parameter convertview. · each item in the list has a view to display text — a textview object. this view is being fetched in every iteration, using the method findviewbyid(). ¸ a new class is created to cache common data between list items. it will be used to store the textview object and prevent it from being fetched in every iteration. ¹ this block will run only in the first item of the list. subsequent iterations will receive the convertview from parameters. º it is no longer needed to call findviewbyid() to retrieve the textview object. one might argue that the version of the code after refactoring is considerably less intuitive. this is, in fact true, which might be a reason for developers to ignore optimizations. however, regardless of whether this optimization should be addressed by the system, it is the recommended approach, as stated in the android official documentation7. see more on this discussion in section 6. 2.2 drawallocation: remove allocations within drawing code 1.5%. lint priority: ■■■■■■■■■□ 9/10. draw operations are very sensitive to performance. it is a bad practice allocating objects during such operations since it can create noticeable lags. the recommended fix is allocating objects upfront and reusing them for each drawing operation, as shown in the following example: public class drawallocationsampletwo extends button { public drawallocationsampletwo(context context) { super(context); } @override protected void ondraw(android.graphics.canvas canvas) { 7viewholder explanation in the official documentation: https://developer.android.com/guide/topics/ui/layout/ recyclerview visited in august 17, 2019. https://developer.android.com/studio/write/lint https://developer.android.com/studio/write/lint https://developer.android.com/guide/topics/ui/layout/recyclerview https://developer.android.com/guide/topics/ui/layout/recyclerview cruz et al. 2019 super.ondraw(canvas); integer i = new integer(5);¶ // ... return; } } optimized version: public class drawallocationsampletwo extends button { public drawallocationsampletwo(context context) { super(context); } integer i = new integer(5);· @override protected void ondraw(android.graphics.canvas canvas) { super.ondraw(canvas); // ... return; } } ¶ a new instance of integer is created in every execution of ondraw. · the allocation of the instance of integer is removed from the drawing operation and is now executed only once during the app execution. 2.3 wakelock: fix incorrect wakelock usage 1.5%. lint priority: ■■■■■■■■■□ 9/10. wakelocks are mechanisms to control the power state of a mobile device. this can be used to prevent the screen or the cpu from entering a sleep state. if an application fails to release a wakelock or uses it without being strictly necessary, it can drain the battery of the device. the following example shows an activity that uses a wake lock: extends activity { private wakelock wl; @override protected void oncreate(bundle savedinstancestate) { super.oncreate(savedinstancestate); powermanager pm = (powermanager) this. getsystemservice( context.power_service ); wl = pm.newwakelock( powermanager.screen_dim_wake_lock | powermanager. on_after_release, "wakelocksample" ); wl.acquire();¶ } } ¶ using the method acquire() the app asks the device to stay on. until further instruction, the device will be deprived of sleep. since no instruction is stopping this behavior, the device will not be able to enter a sleep mode. although in exceptional cases this might be intentional, it should be fixed to prevent battery drain. the recommended fix is to override the method onpause() in the activity: //... @override protected void onpause(){ super.onpause(); if (wl != null && !wl.isheld()) { wl.release(); } } //... with this solution, the lock is released before the app switches to background. 2.4 recycle: fix missing recycle() calls 0.7%. lint priority: ■■■■■■■□□□ 7/10. there are collections such as typedarray that are implemented using singleton resources. hence, they should be released so that calls to different typedarray objects can efficiently use these same resources. the same applies to other classes (e.g., database cursors, motion events, etc.). the following snippet shows an object of typedarray that is not being recycled after use: public void wrong1(attributeset attrs, int defstyle) { final typedarray a = getcontext(). obtainstyledattributes( attrs, new int[] { 0 }, defstyle, 0 ); string example = a.getstring(0); } solution: public void wrong1(attributeset attrs, int defstyle) { final typedarray a = getcontext(). obtainstyledattributes( attrs, new int[] { 0 }, defstyle, 0 ); string example = a.getstring(0); if (a != null) { a.recycle();¶ } } ¶ calling the method recycle() when the object is no longer needed, fixes the issue. the call is encapsulated in a conditional block for safety reasons. besides typedarray instances, this refactoring is also applied to instances of following classes: cursor, velocitytracker, motionevent, parcel, and contentproviderclient. 2.5 obsoletelayoutparam (olp): remove obsolete layout parameters 0.7%. lint priority: ■■■■■■□□□□ 6/10. during development, ui views might be refactored several times. in this process, some parameters might be left unchanged even when they have no effect in the view. this is a code smell that needs to be fixed since it causes useless attribute processing at runtime. the refactoring is applied by removing the obsolete parameters from the ui specification. as an example, consider the following code snippet (xml): /* deleteme */ ¶ ¶ the property android:layout_alignparentbottom is used for views inside a relativelayout to align the bottom edge of a view (i.e., the textview, in this example) with the bottom edge of the relativelayout. on contrary, linearlayout is not compatible with this property, having no effect in this example. it is safe to remove the property cruz et al. 2019 table 1. layout-related parameters that only have a visual effect when defined inside specific layouts. layout parameter allowed parent layout layout_x absolutelayout layout_y absolutelayout layout_weight linearlayout, actionmenuview, listrowhovercardview, listrowview, numberpicker, radiogroup, searchview, tabwidget, tablelayout, tablerow, textinputlayout, zoomcontrols layout_column gridlayout, tablelayout, tablerow layout_columnspan gridlayout layout_row gridlayout layout_rowspan gridlayout layout_alignleft relativelayout layout_alignstart relativelayout layout_alignright relativelayout layout_alignend relativelayout layout_aligntop relativelayout layout_alignbottom relativelayout layout_alignparenttop relativelayout layout_alignparentbottom relativelayout layout_alignparentleft relativelayout layout_alignparentstart relativelayout layout_alignparentright relativelayout layout_alignparentend relativelayout layout_alignwithparentmissing relativelayout layout_alignbaseline relativelayout layout_centerinparent relativelayout layout_centervertical relativelayout layout_centerhorizontal relativelayout layout_torightof relativelayout layout_toendof relativelayout layout_toleftof relativelayout layout_tostartof relativelayout layout_below relativelayout layout_above relativelayout autorefactor java files xml files java refactor engine xml refactor engine android project >_ cliplugin ui figure 1. architecture diagram of the automatic refactoring toolset. from the specification. in table 1, we detail all the cases featured in leafactor. 3 automatic refactoring tool in the scope of our study, we developed a tool to statically analyze and transform code, implementing android-specific energy efficiency refactorings — leafactor. the toolset receives a single file, a package, or a whole android project as input and looks for eligible files, i.e., java or xml source files. it automatically analyzes those files and generates a new compilable and optimized version. the architecture of leafactor is depicted in figure 1. there are two separate engines: one to handle java files and another to handle xml files. the refactoring engine for java is implemented as part of the open-source project autorefactor — an eclipse plugin to automatically refactor java code figure 2. developers can apply refactorings by selecting the “automatic refactoring” option or by using the key combination y . bases. 3.1 autorefactor autorefactor is an eclipse plugin that delivers automatic refactoring in java codebases. it is created as a complement to existing static analyzers such as sonarqube, findbugs, checkstyle and pmd. although they provide insightful warnings to developers, they do little in helping developers fixing all the issues lying in legacy codebases. it provides a comprehensive set of 103 common code cleanups to help deliver “smaller, more maintainable and more expressive code bases”8. the list goes from simple rules, such as enforcing the use of the method isempty() to check whether a collection is empty, instead of checking its size (rule isemptyratherthansize), to more complex ones, such as setratherthenlist choosing a more adequate collection type for specific use cases. in addition, autorefactor also supports cleanups for code comments, such as removing auto-generated or empty javadocs from the codebase (named by autorefactor as rule comments). eclipse marketplace9 reported 4459 successful installs of autorefactor. a common use case is presented in the screenshot of figure 2. developers can apply refactorings in single files, packages, or entire projects. under the hood, autorefactor integrates a handy and concise api to manipulate java abstract syntax trees (asts). we contributed to the project by implementing the java refactorings mentioned in section 2. 3.2 xml refactorings since xml refactorings are not supported by autorefactor, a separate refactoring engine was developed and integrated 8as described in the official website, visited in august 17, 2019: http: //autorefactor.org 9eclipse marketplace is an interface for browsing and installing plugins for the java ide eclipse: https://marketplace.eclipse.org visited in august 17, 2019. http://autorefactor.org http://autorefactor.org https://marketplace.eclipse.org cruz et al. 2019 7. commit & push changes 1. collect metadata from f-droid 2. fork repository 3. select optimization 4. create branch 5. apply leafactor 6. validate changes 8. submit pr figure 3. experiment’s procedure for a single app. into leafactor. the engine features a command line interface, that can be integrated with continuous integration environments. optionally, the tool can be set to simply flag warnings, without performing any refactoring transformation. as detailed in the previous section, only a single xml refactoring is offered — obsoletelayoutparam. 4 empirical evaluation we designed an experiment with the following goals: • study the benefits of using an automatic refactoring tool within the android development community. • study how foss android apps are adopting energy efficiency optimizations. • improve energy efficiency of foss android apps. we adopted the procedure explained in figure 3. starting with step 1, we collected data from the f-droid app store10 — a catalog for free and open-source software (foss) applications for the android platform. for each mobile application, we collected the git repository location which was used in step 2 to fork the repository and prepare it for a potential contribution to the project’s official code repository. following, in step 3 we selected one refactoring to be applied and consequently initiate a process that was repeated for all refactorings (steps 4–8): the project was analyzed and, if any transformation was applied, a new pull request (pr) was submitted to be considered by the project’s integrator. since we wanted to engage the community and get feedback about the refactorings, we manually created each pr with a personalized message, including a brief explanation of committed code changes. we analyzed 140 free and open-source android apps collected from f-droid11. apps were selected by publish date (i.e., it was given priority to newly released apps), considering exclusively java projects (e.g., kotlin projects are filtered out) with a github repository. we selected only one git service for the sake of simplicity. apps in the dataset are spread in 17 different categories, as depicted in figure 4. table 2 presents descriptive statistics for the source code and repository of the mobile applications in the dataset: number of lines of code (loc), mccabe’s cyclomatic complexity (cc), mean weighted methods per class12 (wmc), lack of cohesion of methods13 (lcom) (etzkorn et al., 10f-droid repository is available at https://f-droid.org visited in august 17, 2019. 11data was collected on nov 27, 2016, and it is available here: https: //doi.org/10.6084/m9.figshare.7637402 12weighted methods per class (wmc) is the sum of the complexity of methods in a class. 13lack of cohesion of methods (lcom) is a software code metric that measures the correlation between class members and methods. values fall between 0, indicating perfect cohesion, and 1, indicating a complete lack of cohesion. m ul tim ed ia se cu ri ty ph on e& sm s t he m in g m on ey d ev el op m en t in te rn et sy st em g am es r ea di ng c on ne ct iv ity sp or ts & h ea lth w ri tin g sc ie nc e& e du . ti m e n av ig at io n 0 2 4 6 8 10 8 21 2 33 9 5 3 4 1 3 22 11 categories n um be ro fa pp s figure 4. number of apps per category in the dataset. 1998), number of java files, number of xml files, number of github forks, github stars, and contributors. these metrics were collected using the static analysis tool designite14 and the github api v315. the dataset comprehends very diverse mobile applications. it goes from very simples apps, such as storage-usb16, with 13 loc and complexity cc of 2, to large apps, such as slide17 with almost 400k loc and complexity cc of 14631, or osmand18, with over 300k loc and complexity cc of 77889. the largest project in terms of java files is tinytraveltracker (1878), while newsblue is the largest in terms of xml files (2109). most apps in the dataset have reasonable cohesion, with lcom below 0.34 for 75% of the apps; apps with low/moderate cohesion were also analyzed, having lcom values up to 0.67. in total, we analyzed 2.8m lines of java code (loc) in 6.79gb of android projects in 4.5 hours — 15103 xml files, and 15308 java files. 5 results our experiment yielded a total of 222 refactorings, which were submitted to the original repositories as prs. multiple refactorings of the same type were grouped in a single pr to avoid creating too many prs for a single app. it resulted in 59 prs spread across 45 apps. this is a demanding process since each project has different contributing guidelines. nevertheless, by the time of writing, 18 apps had successfully merged our contributions for deployment. an example of the prs submitted to the projects is illustrated in figure 5. leafactorperformed the refactoring viewholder in the app slide19, and developers successfully merged our pr. the full thread can be found in the github project ccrama/slide with reference #234620. 14designite’s website: http://www.designite-tools.com visited in august 17, 2019. 15github api v3’s website:https://developer.github.com/v3/ visited in august 17, 2019. 16storage-usb basically launches storage settings directly from the apps drawer. github repository: https://github.com/enricocid/ storage-usb visited in august 17, 2019. 17slide is a browser for the social news forum reddit. github repository: https://github.com/ccrama/slide visited in august 17, 2019. 18osmand is a navigation app. github repository: https://github. com/osmandapp/osmand visited in august 17, 2019. 19slide’s website: http://trikita.co/slide/ visited in august 17, 2019. 20pr of the viewholder of app slide: https://github.com/ccrama/ https://f-droid.org https://doi.org/10.6084/m9.figshare.7637402 https://doi.org/10.6084/m9.figshare.7637402 http://www.designite-tools.com https://developer.github.com/v3/ https://github.com/enricocid/storage-usb https://github.com/enricocid/storage-usb https://github.com/ccrama/slide https://github.com/osmandapp/osmand https://github.com/osmandapp/osmand http://trikita.co/slide/ https://github.com/ccrama/slide/pull/2346 cruz et al. 2019 table 2. descriptive statistics of projects in the dataset. loc cc wmc lcom java files xml files github forks github stars contributors mean 20350 3532 17.41 0.29 103 102 65 179 15 min 13 2 1.00 0.00 0 4 0 0 1 25% 1444 271 11.14 0.23 13 23 3.75 7.75 2 median 4641 946 15.20 0.27 38 48 9 24 3 75% 14795 3007 21.50 0.34 106 97 39 111 10 max 388853 77889 82.82 0.67 1678 2109 1483 4488 323 total 2869394 – – – 15308 15103 9547 26484 2162 table 3. summary of refactoring results refactoring viewholder drawallocation wakelock recycle olp∗ total total refactorings 7 0 1 58 156 222 total projects 5 0 1 23 30 45 percentage of projects 4% 0% 1% 16% 21% 32% incidence per project 1.4× 1.0× 2.5× 5.2× 4.8× ∗olp — obsoletelayoutparam figure 5. an example of pull request submitted to the android project slide. table 3 presents the results for each refactoring. it shows the total number of applied refactorings, the total number of projects that were affected, the percentage of affected projects, and the average number of refactorings per affected project. in addition, the table presents the combined results for the occurrence of any type of refactoring (total). obsoletelayoutparam was the most frequent refactoring. it was applied 156 times in a total of 30 projects out of the 140 in our dataset (21%). in average, each affected project had 5 occurrences of this refactoring. recycle comes next, occurring in 23 projects (16%) with 58 refactorings. drawallocation and wakelock only showed marginal impact. in addition, figure 6 presents a plot bar summarizing the number of projects affected amongst all the studied refactorings. the mobile application with a bigger incidence of refactorings was the android application for the cloud platform nextcloud21. leafactor has refactored two occurrences of recycle, two of viewholder, and 6 of obsoletelayoutparam. in terms of the total number of refactorings, qr scanner22 was the app with a higher number of occurrences, with 35 occurrences of obsoletelayoutparam. slide/pull/2346 visited in august 17, 2019. 21nextcloud’s website: https://nextcloud.com visited in august 17, 2019. 22qr scanner’s entry on google play: https://play.google.com/store/apps/details?id=com.secuso. privacyfriendlycodescanner visited in august 17, 2019. w ak elo ck re cy cle dr aw al loc ati on vi ew ho lde r ol p to tal 0 20 40 1 23 05 30 45 n um be ro fa pp s af fe ct ed figure 6. number of apps affected per refactoring. for reproducibility and clarity of results, all the data collected in this study is publicly available23. in addition, all the prs are public and can be accessed through the official repositories of the apps. 6 discussion results show that an automatic refactoring tool can help developers ship more energy efficient apps. a considerable part of the apps in this study (32%) had at least one energy inefficiency. since these inefficiencies are only visible after long periods of app activity, they can easily go unnoticed. from the feedback developers provided in the prs, we have noticed that developers are open to recommendations from an automated tool. only in a few exceptions, developers expressed being unhappy with our contributions. reasons varied between seeing our pr as a critique of the programming skills of developers or simply because developers did not want to make changes in components of the app that were affected by the refactoring. nevertheless, most developers were curious about the refactorings, and they recognized being unaware of their impact on energy efficiency. this is consistent with previous work (pang et al., 2015; sahin et al., 2014). 23spreadsheet with all experimental results: https://doi.org/10. 6084/m9.figshare.7637402. https://github.com/ccrama/slide/pull/2346 https://github.com/ccrama/slide/pull/2346 https://github.com/ccrama/slide/pull/2346 https://github.com/ccrama/slide/pull/2346 https://github.com/ccrama/slide/pull/2346 https://github.com/ccrama/slide/pull/2346 https://nextcloud.com https://play.google.com/store/apps/details?id=com.secuso.privacyfriendlycodescanner https://play.google.com/store/apps/details?id=com.secuso.privacyfriendlycodescanner https://doi.org/10.6084/m9.figshare.7637402 https://doi.org/10.6084/m9.figshare.7637402 cruz et al. 2019 a positive outcome of our experimentation was that we were able to improve energy efficiency in the official release of 18 android apps. in a few cases, code smells were found in code that does not affect the energy consumption of the app itself (e.g., test code). in those cases, our prs were not merged24. nevertheless, we recommend consistently complying with these optimizations in all types of code since new developers often use tests to help understand how to contribute to a project. leafactor, akin to autorefactor, applies the refactorings without prompting developers for confirmation. this is a common approach for simple refactorings. nevertheless, in the case of energy code smells, a single refactoring entails changing several lines of code which the developer may not be able to interpret. during our experiments, this issue is mitigated since we submit a pr with a brief explanation of the code smell and the applied refactoring. it would be interesting to consider alternative approaches in which developers are informed or prompted while having their code refactored. the code smell related to obsoletelayoutparam was found in a considerable fraction of projects (21%). this relates to the fact that app views are often created in an iterative process with several rounds of trial and error. since some parameters have no effect under specific contexts, useless ui specification statements can go unnoticed by developers. recycle is frequent, too, being observed in 16% of projects. this smell is found in android api objects that can be found in most projects (e.g., database cursors). although a clean fix is to use the java try-with-resources statement25, it requires version 19 or earlier of android sdk (introduced with android 4.4 kitkat). however, developers resort to a more verbose approach for backward compatibility, which requires explicitly closing resources, hence prone to mistakes. our drawallocation checker did not yield any result. it was expected that developers were already aware of drawallocation. still, we were able to manually spot allocations that were happening inside a drawing routine. nevertheless, those allocations are using dynamic values to initialize the object. in our implementation, we scope only allocations that will not change between iterations. covering those missed cases would require updating the allocated object in every iteration. while spotting these cases is relatively easy, refactoring would require better knowledge of the class that is being instantiated. similarly, wakelocks are very complex mechanisms, and fixing all misuses still needs further work. in the case of viewholder, although it only impacted 4% of the projects, we believe it has to do with the fact that 1) some developers already know this refactoring due to its performance impact, and 2) many projects do not implement dynamic list views. viewholder is the most complex refactoring we have in terms of lines of code (loc) — a simple case can require changes in roughly 35 loc. although changes 24example of a pr of refactorings on test code: https://github.com/ hidroh/materialistic/pull/828 visited in august 17, 2019. 25documentation about the java try-with-resources statement: https://docs.oracle.com/javase/tutorial/essential/ exceptions/tryresourceclose.html visited in august 17, 2019. are easily understandable by developers, writing code that complies with viewholder is not intuitive. gainings on energy efficiency may vary depending on the application and the use cases in which they occur. measuring the effective impact on energy consumption is not trivial as it requires a complicated setup. previous work has found these refactorings to improve energy efficiency by up to 5% in real use case scenarios (cruz and abreu, 2017). nonetheless, these refactorings are recommended by the official android documentation26 as best practices for performance. a visible side effect of the refactorings featured by leafactor is the questionable maintainability of the code introduced. although the refactorings are implemented based on the official android documentation, the resulting code is considerably longer and less intuitive for refactorings such as viewholder and recycle. this is a threat to the adoption of energy-efficient practices in android applications. mobile frameworks should feature coding mechanisms aiming to improve energy efficiency without hindering code maintainability. 7 related work the energy efficiency of mobile apps is being addressed with many different approaches. some works opt by simplifying the process of measuring the energy consumption of mobile apps (zhang et al., 2010; pathak et al., 2012, 2011; hao et al., 2013; di nucci et al., 2017; couto et al., 2014). alternatively, other works study the energy footprint of software design choices and code patterns that will prevent developers from creating code with poor energy efficiency (li et al., 2014; li and halfond, 2014, 2015; linares-vásquez et al., 2017; malavolta et al., 2017; pereira et al., 2017). automatic detection of code smells for android has been studied before. fixing code smells in android has shown gains up to 5% in energy efficiency (cruz and abreu, 2017). the code was manually refactored in six real apps and energy consumption was measured using a hardware-based power monitor. our work extends this research by providing automatic refactoring to the resulting energy code smells. the frequency of code smells in android apps was studied in previous work (hecht et al., 2015). code smells were automatically detected in 15 apps using the tool paprika which was developed to perform static analysis in the bytecode of apps. although paprika provides valuable feedback on how to fix their code, developers need to manually apply the refactorings. our study differs by focusing on energy-related code smells and by applying automatic refactoring to resolve potential issues. previous work has also studied the importance of providing a catalog of bad smells that negatively influence the quality of android applications (reimann et al., 2014; reimann and aβmann, 2013). although the authors motivate the importance of using automatic refactoring, their approach lacks an extensive implementation of their catalog. related work has implemented 15 code-smells from this catalog proposed 26viewholder is documented here: https://developer.android. com/training/improving-layouts/smooth-scrolling visited in august 17, 2019. https://github.com/hidroh/materialistic/pull/828 https://github.com/hidroh/materialistic/pull/828 https://docs.oracle.com/javase/tutorial/essential/exceptions/tryresourceclose.html https://docs.oracle.com/javase/tutorial/essential/exceptions/tryresourceclose.html https://developer.android.com/training/improving-layouts/smooth-scrolling https://developer.android.com/training/improving-layouts/smooth-scrolling cruz et al. 2019 by reimann and aβmann (2013) in an automatic refactoring tool, adoctor (palomba et al., 2017). in our work, we use this approach to improve the energy efficiency of android applications. another work has focused exclusively on design patterns to improve the energy efficiency of ios and android mobile applications (cruz and abreu, 2019). however, no efforts were made regarding the automatic refactoring of the cataloged energy patterns. in our work, we implement automatic refactoring for five energy patterns. in addition, we validate our refactorings by applying leafactor in a large dataset of real android apps. moreover, we assess how automatic refactoring tools for energy can positively impact the android foss community. other works have detected energy-related code smells by analyzing source code as tgraphs (gottschalk et al., 2012; ebert et al., 2008). eight different code smell detectors were implemented and validated with a navigation app. fixing the code with automatic refactoring was discussed but not implemented. besides, although studied code smells are likely to have an impact on energy consumption, no evidence was presented. previous work has used the event flow graph of the app to optimize resource usage (e.g., gps, bluetooth) (banerjee and roychoudhury, 2016). results show significant gains in energy efficiency. nevertheless, although this process provides details on how to fix the code, it is not fully automated yet. other works have studied and applied automatic refactorings in android applications (sahin et al., 2014, 2016). however, these refactorings were not mobile-specific. besides refactoring source code, other works have focused on studying the impact of ui design decisions on energy consumption (linares-vásquez et al., 2017). agolli, t., et al. have proposed a methodology that suggests changes in the ui colors of apps. the new ui colors, despite being different, are almost imperceptible by users and lead to savings in the energy consumption of mobile phones’ displays (agolli et al., 2017). in our work, we strictly focus on changes that do not change the appearance of the app. 8 conclusion our work presents the automatic refactoring tool leafactor to improve the energy efficiency of android application codebases. in an empirical study with 140 foss android apps, we show the potential of using automatic refactoring tools to improve the energy efficiency of mobile applications. we have fixed 222 energy-related energy-related smells, improving the energy footprint of 45 android applications. results show that automatic refactoring can benefit developers to improve the energy efficiency for a considerable number of foss android applications. as future work, we plan to study and support more energy efficiency refactorings. in particular, some of the energy patterns studied in previous work (cruz and abreu, 2019; reimann et al., 2014; reimann and aβmann, 2013) could help increase the usefulness of leafactor. besides, it would be interesting to explore the detection of energyrelated smells using dynamic analysis. moreover, it would be interesting to integrate automatic refactoring in a continuous integration context. the integration would require two distinct steps: one for the detection and another for the code refactoring which would only be applied upon a granting action by a developer. one could also use this idea with an educational purpose. a detailed explanation of the code transformation along with its impact on energy efficiency could be provided whenever a developer pushes new changes to the repository. acknowledgements this work is financed by the erdf – european regional development fund through the operational program for competitiveness and internationalization compete 2020 program and by national funds through the portuguese funding agency, fct fundação para a ciência e a tecnologia within project poci-01-0145feder-016718. luis cruz is sponsored by an fct scholarship grant number pd/bd/52237/2013. references agolli, t., pollock, l., and clause, j. (2017). investigating decreasing energy usage in mobile apps via indistinguishable color changes. in proceedings of the 4th international conference on mobile software engineering and systems, pages 30–34. ieee press. banerjee, a. and roychoudhury, a. (2016). automated refactoring of android apps to enhance energy-efficiency. in proceedings of the international workshop on mobile software engineering and systems, pages 139–150. acm. couto, m., carção, t., cunha, j., fernandes, j. p., and saraiva, j. (2014). detecting anomalous energy consumption in android applications. in brazilian symposium on programming languages, pages 77–91. springer. cruz, l. and abreu, r. (2017). performance-based guidelines for energy efficient mobile applications. in proceedings of the 4th international conference on mobile software engineering and systems, pages 46–57. ieee press. cruz, l. and abreu, r. (2018). using automatic refactoring to improve energy efficiency of android apps. in cibse xxi ibero-american conference on software engineering. cruz, l. and abreu, r. (2019). catalog of energy patterns for mobile applications. empirical software engineering. cruz, l., abreu, r., and rouvignac, j.-n. (2017). leafactor: improving energy efficiency of android apps via automatic refactoring. in proceedings of the 4th international conference on mobile software engineering and systems, mobilesoft ’17, pages 205–206. ieee press. di nucci, d., palomba, f., prota, a., panichella, a., zaidman, a., and de lucia, a. (2017). petra: a software-based tool for estimating the energy profile of android applications. in proceedings of the 39th international conference on software engineering companion, pages 3–6. ieee press. ebert, j., riediger, v., and winter, a. (2008). graph technology in reverse engineering–the tgraph approach. in proc. cruz et al. 2019 10th workshop software reengineering. gi lecture notes in informatics. citeseer. etzkorn, l., davis, c., and li, w. (1998). a practical look at the lack of cohesion in methods metric. in journal of object-oriented programming. citeseer. gottschalk, m., josefiok, m., jelschen, j., and winter, a. (2012). removing energy code smells with reengineering services. gi-jahrestagung, 208:441–455. hao, s., li, d., halfond, w. g., and govindan, r. (2013). estimating mobile application energy consumption using program analysis. in software engineering (icse), 2013 35th international conference on, pages 92–101. ieee. hecht, g., rouvoy, r., moha, n., and duchien, l. (2015). detecting antipatterns in android apps. in proceedings of the second acm international conference on mobile software engineering and systems, pages 148–149. ieee press. li, d. and halfond, w. g. (2014). an investigation into energy-saving programming practices for android smartphone app development. in proceedings of the 3rd international workshop on green and sustainable software, pages 46–53. acm. li, d. and halfond, w. g. (2015). optimizing energy of http requests in android applications. in proceedings of the 3rd international workshop on software development lifecycle for mobile, pages 25–28. acm. li, d., hao, s., gui, j., and halfond, w. g. (2014). an empirical study of the energy consumption of android applications. in software maintenance and evolution (icsme), 2014 ieee international conference on, pages 121–130. ieee. linares-vásquez, m., bavota, g., bernal-cárdenas, c., oliveto, r., di penta, m., and poshyvanyk, d. (2014). mining energy-greedy api usage patterns in android apps: an empirical study. in proceedings of the 11th working conference on mining software repositories, pages 2–11. acm. linares-vásquez, m., bernal-cárdenas, c., bavota, g., oliveto, r., di penta, m., and poshyvanyk, d. (2017). gemma: multi-objective optimization of energy consumption of guis in android apps. in proceedings of the 39th international conference on software engineering companion, pages 11–14. ieee press. malavolta, i., procaccianti, g., noorland, p., and vukmirović, p. (2017). assessing the impact of service workers on the energy efficiency of progressive web apps. in proceedings of the 4th international conference on mobile software engineering and systems, pages 35–45. ieee press. palomba, f., di nucci, d., panichella, a., zaidman, a., and de lucia, a. (2017). lightweight detection of android-specific code smells: the adoctor project. in 2017 ieee 24th international conference on software analysis, evolution and reengineering (saner), pages 487–491. ieee. pang, c., hindle, a., adams, b., and hassan, a. e. (2015). what do programmers know about the energy consumption of software? peerj preprints, 3:e886v1. pathak, a., hu, y. c., and zhang, m. (2012). where is the energy spent inside my app?: fine grained energy accounting on smartphones with eprof. in proceedings of the 7th acm european conference on computer systems, pages 29–42. acm. pathak, a., hu, y. c., zhang, m., bahl, p., and wang, y.m. (2011). fine-grained power modeling for smartphones using system call tracing. in proceedings of the sixth conference on computer systems, pages 153–168. acm. pereira, r., carção, t., couto, m., cunha, j., fernandes, j. p., and saraiva, j. (2017). helping programmers improve the energy efficiency of source code. in proceedings of the 39th international conference on software engineering companion, pages 238–240. ieee press. reimann, j. and aβmann, u. (2013). quality-aware refactoring for early detection and resolution of energy deficiencies. in proceedings of the 2013 ieee/acm 6th international conference on utility and cloud computing, pages 321–326. ieee computer society. reimann, j., brylski, m., and aßmann, u. (2014). a tool-supported quality smell catalogue for android developers. in proc. of the conference modellierung 2014 in the workshop modellbasierte und modellgetriebene softwaremodernisierung–mmsm, volume 2014. sahin, c., pollock, l., and clause, j. (2014). how do code refactorings affect energy usage? in proceedings of the 8th acm/ieee international symposium on empirical software engineering and measurement, page 36. acm. sahin, c., pollock, l., and clause, j. (2016). from benchmarks to real apps: exploring the energy impacts of performance-directed changes. journal of systems and software, 117:307–316. wilke, c., richly, s., gotz, s., piechnick, c., and aßmann, u. (2013). energy consumption and efficiency in mobile applications: a user feedback study. in green computing and communications (greencom), 2013 ieee and internet of things (ithings/cpscom), ieee international conference on and ieee cyber, physical and social computing, pages 134–141. ieee. zhang, l., tiwana, b., qian, z., wang, z., dick, r. p., mao, z. m., and yang, l. (2010). accurate online power estimation and automatic battery behavior based power model generation for smartphones. in proceedings of the eighth ieee/acm/ifip international conference on hardware/software codesign and system synthesis, pages 105–114. acm. introduction energy refactorings viewholder: add view holder to scrolling lists drawallocation: remove allocations within drawing code wakelock: fix incorrect wakelock usage recycle: fix missing recycle() calls obsoletelayoutparam (olp): remove obsolete layout parameters automatic refactoring tool autorefactor xml refactorings empirical evaluation results discussion related work conclusion journal of software engineering research and development, 2022, 10:7, doi: 10.5753/jserd.2021.1978 this work is licensed under a creative commons attribution 4.0 international license.. using evidence from systematic studies to guide a phd research in requirements engineering: an experience report taciana novo kudo [ universidade federal de goiás | taciana@ufg.br ] renato f. bulcão-neto [ universidade federal de goiás | rbulcao@ufg.br ] auri marcelo rizzo vincenzi [ universidade federal de são carlos | auri@ufscar.br ] érica ferreira de souza [ universidade tecnológica federal do paraná | ericasouza@utfpr.edu.br ] katia romero felizardo [ universidade tecnológica federal do paraná | katiascannavino@utfpr.edu.br ] abstract conducting systematic studies during a postgraduate program, such as systematic review, systematic mapping, and tertiary review, can benefit the project’s success. they provide an overview of the literature considering currently available research findings, establish baselines for other research activities, and support decisions made throughout the research project. however, there is a shortage of research that presents systematic studies experiences in supporting academic projects. this paper’s main contribution is reporting our experience on how the evidence found in tertiary and secondary studies positively influenced a phd project’s decisions. initially, a tertiary study was conducted, followed by a systematic mapping. the evidence returned by the tertiary study led to the definition of the phd research proposal in the requirement engineering field. moreover, a systematic mapping contributed to the definition of the phd research problem. from this experience in undertaking systematic studies to support a phd project, the paper also presents lessons learned and recommendations to guide phd students’ decisions. keywords: evidence-based software engineering, graduate education, tertiary study, secondary study 1 introduction a systematic study1 aims to identify, select, evaluate, interpret, and summarize available studies considered relevant to a topic or phenomenon of interest. individual studies that contribute to a systematic study (systematic literature reviews – slr or systematic mapping – sm) are primary, while the systematic study itself is considered secondary. historically, systematic studies, especially slrs, have been employed in the medical area and are recognised as critical components to support evidence-based medicine (clarke and chalmers, 2018). inspired by the success in the medical field, evidence-based software engineering (ebse) was first proposed to advance and improve the discipline of software engineering (se) (kitchenham et al., 2015). currently, a larger community is formed around ebse and composed of researchers who have conducted systematic studies in se. informal literature reviews are relevant for research initiatives, especially in cases that use good practices. however, they lack scientific rigour, such as investigation bias. reviews based on a rigorous process ensure auditable, reproducible, and unbiased results for all stakeholders. one of the reasons systematic studies have been conducted in se compared to informal reviews is its advantages, including the reduction of biases in results and the possibility of identifying and combining the main differences between data from the various studies selected in the review (egger et al., 1997). another advantage is identifying gaps in cur1throughout this work, the term “systematic study” encompasses systematic literature review (slr), systematic mapping (sm), as a more open form of slr, and tertiary studies, as slr of slrs. details on functional similarities and differences between slr and sm are found elsewhere (napoleão et al., 2017). rent research, which may suggest new research themes and provide a suitable way to position these themes in the context of existing research. other benefits include (kitchenham and brereton, 2013): • a well-planned systematic study avoids bias in the analysis of primary studies; • a systematic study allows researchers to answer research questions that can not be answered based on a single primary study; • a systematic study can help researchers to test theoretical hypotheses that otherwise could not be tested based on primary studies; and • results of a systematic study can be used to understand the efficacy and the efficiency of a method or a technology; alternatively, they can point out the strengths and weaknesses of methods and technologies under certain circumstances. in that context, felizardo et al. (2020) affirm that systematic studies are valuable to graduate students. regarding the main benefits of conducting systematic studies during a phd research project, the most significant one is providing an overview of the literature, finding out research opportunities, learning from studies, and providing baselines to assist new research efforts. in particular, sms can significantly benefit researchers in establishing baselines for further research activities, such as choosing a dissertation topic for a phd degree considering research trends that can not be tracked over time (research gaps) (souza et al., 2015). another advantage includes using the reviews’ findings to support decisions made in the research project. https://orcid.org/0000-0002-7238-0562 mailto:taciana@ufg.br https://orcid.org/0000-0001-8604-0019 mailto:rbulcao@ufg.br https://orcid.org/0000-0001-5902-1672 mailto:auri@ufscar.br https://orcid.org/0000-0001-7262-7863 mailto:ericasouza@utfpr.edu.br https://orcid.org/0000-0001-9080-4165 mailto:katiascannavino@utfpr.edu.br mailto:katiascannavino@utfpr.edu.br kudo et al. 2022 one expects that phd students produce a compelling literature review. this review is a critical doctoral component since it allows students to thoroughly understand the topic they will work on and be familiar with the results obtained by other researchers. therefore, secondary studies are the proper methodology to write a compelling literature review. moreover, during the review conduction, students are trained in searching and selecting relevant literature, assessing the quality of the selected literature, and summarising/presenting the achievements. these are skills that every phd candidate must procure during his/her doctoral. there are numerous motivations for conducting a secondary study, such as those reported in felizardo et al. (2020): • systematic studies’ results may identify suitable areas for future research – i.e., the original topic of investigation and the research questions to be answered during a phd project – aiming at the advance of state of the art in the research topic; • those studies can replace traditional narrative literature providing the currently available research findings; • results of primary studies selected in a systematic study can be used as a baseline for comparison with ongoing, recent research results; • the findings of systematic studies guide phd research efforts, e.g., researchers could consider the systematic studies’ findings for choosing appropriate research methods; and • the systematic study may be published, externalising the acquired knowledge, contributing to the ebse field. because of these advantages, several se researchers advocate for phd students using systematic studies (clear, 2015; pejcinovic, 2015; kuhrmann, 2017; kaijanaho, 2017). for example, souza et al. (2015) describe a case of such a successful application of secondary studies to guide the decisions of a doctorate. this article reflects upon our experience using systematic studies in developing a phd project. therefore, this study aims to present how systematic studies’ findings impact an academic project. specifically, the main goals of this research are to: • present a successful case in which systematic studies had great importance in the conduction of a phd project; • exemplify how the best available evidence provided by systematic studies can base project’s decisions; • reinforce the importance of systematic studies in conducting a research project; • report our experiences conducting secondary and tertiary studies as part of a phd research project (kudo, 2021); and • inspire graduate students with our lessons learned and recommendations for undertaking systematic studies in their research projects. in summary, one tertiary review and one secondary study were conducted to support a phd project’s decisions in the requirements engineering (re) domain. our main conclusion is that systematic studies have many advantages, and therefore, graduate students should consider doing at least one review during the doctorate. the remainder of the paper is organized as follows. section 2 introduces the software requirements patterns theme. section 3 presents a phd research project showing how systematic studies’ results guided its conduction. section4 and 5 discuss the lessons learned and the threats to this work’s validity, respectively. section 6 addresses the related work, focusing on using systematic studies to guide a phd research. finally, section 7 presents our concluding remarks. 2 software requirement pattern incorrect, omitted, misinterpreted, or conflicting requirements usually result from poorly executed re activities (franch, 2015). as a result, software projects in such a scenario often struggle with software that does not meet quality requirements, cost and time overruns, and unsatisfied users. requirements reuse is a practical approach to mitigate those issues (irshad et al., 2018): the core idea is reusing the knowledge acquired in previous projects to make re activities more prescriptive and systematic. a widely discussed reuse approach is the software requirement pattern (srp) abstraction, which aggregates behaviours and services observed in multiple similar applications (withall, 2007). usually, srp guides requirements elicitation and specification through well-defined templates that can be reused in later specifications (costal et al., 2019). for instance, one can create an srp for representing a user authentication feature, commonly found in several applications, and make appropriate adaptations, if necessary. an srp’s anatomy defines its structure and content, not the requirements that might result from it. however, to be helpful as a guide to writing software requirements, the srp needs to consider situations likely to be encountered in the type of requirement built upon this srp. thus, srp is more substantial than a requirement, and its specification is quite a demanding task (withall, 2007). there are srp proposals for multiple sorts of systems such as embedded (konrad and cheng, 2002), cloud computing (beckers et al., 2014), and call-for-tender (costal et al., 2019). these studies demonstrate that srp can promote greater efficiency in requirements elicitation, quality and consistency improvement in the requirements specification, gain in the development team’s productivity, and better requirements management support. 3 from systematic studies to a phd research project this section’s goal is three-fold: first, it introduces research method types that helped ground the doctoral project; second, it describes two systematic studies performed from planning to results analysis; and it demonstrates how these studies’ results contributed to the definition of the phd research proposal (kudo, 2021). kudo et al. 2022 figure 1. an overview of our experience with phd research decisions based on evidence provided by systematic studies. figure 1 illustrates how the best available evidence provided by the systematic studies — tertiary review (kudo et al., 2020a) and systematic mappings (kudo et al., 2019a,b) — guided decisions during the phd project reported in this paper. each step in figure 1 is described next. 3.1 research method despite the differences between the methods, systematic studies (slr and sm) are conducted using a process composed of three main phases (kitchenham et al., 2015): planning, conduction, and reporting. during the first phase, the review objectives and a protocol are defined. the protocol formalises the criteria and procedures for selecting, extracting, and summarising the data, including the research questions’ definition through the search strategy to the final report. the protocol aims to reduce likely bias and ensure researchers can reproduce the review, adopting the same criteria and procedures. according to the protocol, primary studies are retrieved, selected, and evaluated during the conduction phase. then, in the reporting phase, studies that meet the review purpose are summarised, together with data extraction and synthesis that can be descriptive, complemented with a quantitative summary obtained through a statistical calculation. sms and tertiary studies are other types of reviews that complement slrs. sm is a more open form of slr, providing an overview of a research area to assess the quantity of evidence existing on a topic of interest (petersen et al., 2015). a tertiary study is considered a review that focuses only on secondary studies (slr/sm). the conduction of a tertiary study is proper in domains where some high-quality slrs or sms exist. the process used to conduct a tertiary study is the same as slrs’ (kitchenham et al., 2015). as depicted in figure 1, we conducted a pilot search for slrs/sms on the srp topic performed by third parties. we then conducted a tertiary review on the state of the art and practice in srp (kudo et al., 2020a) as we found some highquality secondary studies on the same topic. in the tertiary review, we mapped the main topics covered and research gaps on srp (the tertiary study’s main contribution in figure 1). we elaborated on a seven-item research agenda with lines of investigation (details in the next section) to approximate academics’ and professionals’ interests regarding improving requirements quality through srp to lessen these gaps. remarkably, we noticed that secondary studies reported srp only in the re phase (item 1 in figure 1). as software requirements influence the remaining phases of the development process, we have identified a potential research gap about the benefits of using srp in other development phases besides re. this finding motivated us to conduct an sm to identify primary studies reporting the use of srp in software design, construction, testing, and maintenance. the sm results pointed out eight primary studies in srp applied to design, one to construction, one to testing, and none to maintenance. these results revealed a research problem to investigate: the lack of evidence on the srp benefits for other development phases (the sm’s main contribution in figure 1). as re activities significantly impact other development phases, such as testing, we contributed to a novel approach to aligning re and testing in which reuse through srp and software test patterns (stp) are core elements (phd research proposal in figure 1). an stp is an abstraction for generic testing solutions to recurrent behaviours from different scenarios. unfortunately, recent literature reports that most companies still face adverse effects (cost, rework, and delay) from a weak alignment between requirements and testing (bjarnason and borg, 2017; ebert and ray, 2021). further details about how the findings of the tertiary review and the sm drove our efforts throughout the doctoral research are presented next. 3.2 tertiary study on requirement patterns recognised the importance of systematic studies for powerfully grounding a phd research proposal, a question arose: are there already systematic literature studies on srp? to respond to this question, a tertiary study was performed, as described follows. the tertiary review employed the methodology used in classic tertiary studies in se (kitchenham et al., 2010). besides, it took advantage of the start tool (fabbri et al., 2016) kudo et al. 2022 support throughout the study protocol, from planning to reporting. the tertiary review protocol included three general research questions (rq) defined in the planning phase: rq1 – what is the state of the art in requirement patterns? rq2 – what are the most searched topics on requirement patterns? rq3 – what are the current gaps in requirement patterns research? activities performed in this tertiary review include automatic search, elimination of duplicate, selection of secondary studies on srp, snowballing (wohlin, 2014), quality assessment (zhou et al., 2015), and data extraction and synthesis. the following is the final search string used in the automatic search activity: (“requirement pattern” or “requirement template”) and (survey or “systematic review” or “systematic literature review” or “systematic mapping” or “systematic literature mapping”) this process identified 40 secondary studies organised as follows: acm dl (4), engineering village (13), ieee xplore (2), science direct (11), and scopus (10). after excluding duplicate papers and applying selection criteria, four secondary studies remained. concerning the snowballing technique, we examined the bibliographic references of each of these four papers to identify further relevant studies. however, we found no relevant paper. next, we assessed the quality of each secondary study using four criteria: description level of inclusion and exclusion criteria, search coverage, primary studies quality evaluation, and description level of primary studies. as no paper was removed after data extraction, four secondary studies on srp (irshad et al., 2018; palomares et al., 2017; da silva and benitti, 2011; justo et al., 2018) contributed to formulating answers to the review’s research questions. the conclusions made are as follows: • the number and publication dates of secondary studies on srp (representing 44 non-duplicate primary studies) confirm that srp is not a stagnant research topic, with contributions throughout the decade – (rq1). • the most searched topics regarding srp are representation format, availability, scope, and purpose – (rq2). • research gaps found include the professionals’ unfamiliarity with srp, few validations in the industry, the need for metrics and tools to enable the effective use of srp in the industry, and the lack of secondary studies on how srp benefits the software life cycle – (rq3). the analysis of those four secondary studies resulted in a research agenda to cover the gaps found between the states of art and practice in srp. a research agenda is a formal plan of actions that summarises specific activities to guide the phd conduction and the time to execute them. as depicted in figure 1, the tertiary review’s main contribution is a research agenda composed of the following items: 1. the demonstration of the benefits of srp in other phases of the software development process in industry software projects – none of the secondary studies analysed explicitly identified this gap; 2. traceability mechanisms between requirements represented as patterns and artefacts produced in other development phases — this is another research topic not reported in any of the secondary studies analysed, and it is complementary to the item 1; 3. the joint use of srp and existing and well-established methodologies in the software industry, such as agile approaches; 4. the development of tools that effectively support professionals’ practices in the use of srp; 5. the dissemination of current and future catalogues of srp in a systematised manner; 6. the definition of objective metrics to help professionals measure the impact of the use of srp as described in items 1 to 3; 7. collecting evidence of the effective use of srp, particularly in the re process of industry software projects. 3.3 sm on requirement patterns and software life cycle according to brereton et al. (2009), summarising the results of primary studies through secondary studies is a valuable research mechanism for providing knowledge of a given topic and supporting the identification of topics for future research. therefore, influenced by items 1 and 2 of the tertiary review’s research agenda (see figure 1), an sm was planned and conducted (kudo et al., 2019a,b) to investigate the srp usage in other phases of the software development life cycle (sdlc) and the traceability between srp and specifications produced in these phases. based on this goal, the sm included three research questions: rq1 – at what sdlc phases are srp used: design, construction, testing, and/or maintenance? rq2 – is there evidence of srp usage in practice at those sdlc phases? rq3 – are there reported benefits of using requirement patterns at those phases? if so, what metrics are used to measure these benefits? a trade-off analysis between coverage and relevance of the results of a pilot search preceded the definition of the final search string presented next: (“requirement pattern” or “requirement patterns” or “requirements pattern” or “requirements patterns”) and (“software development” or “development process” or “life cycle” or design or construction or coding or implementation or test or integration or maintenance) activities performed in this sm study include automatic search, elimination of duplicates, the application of selection criteria, snowballing, quality assessment, and data extraction and synthesis. target studies in this slm are, thus, primary studies on srp not employed in re. kudo et al. 2022 the automatic search identified 303 primary studies organised as follows: acm dl (26), engineering village (106), ieee xplore (25), science direct (9), scopus (76) and web of science (61). ten primary studies remained after excluding duplicate papers, applying selection criteria, and full-text reading (155, 107 and 31 papers excluded, respectively). after data extraction, we examined the bibliographic references and citations of these ten relevant papers found (i.e., backward and forward snowballing). alike, we found no relevant additional paper. we also assessed the quality of each primary study using nine criteria, including description level of the research problem and design, contributions, insights, limitations, and srporiented criteria. as a result, sm included a ten-primary-study group whose individual contributions regarding the use of srp were analysed and synthesised in the form of a bubble chart, as depicted in figure 2. such contributions allowed us the formulation of the following answers to the research questions: • eight primary studies used srp in software design, one study in software construction and testing, and none in software maintenance – (rq1). • from these ten primary studies, eight are still at the proof of concept level, and none reports evidence of srp usage in the software industry – (rq2). • there is only one primary study that demonstrates, through metrics and experimentation, that srp integrated with software design artefacts implies significant development time savings; the corresponding metrics are drr (degree of requirement realisation) and dpr (degree of pattern realisation) – (rq3). figure 2. mapping of the types of requirements and validation on srps for softwaredesign,construction, testing,andmaintenance(kudoetal.,2019b). then, we drew two conclusions from the analysis of these data: 1. there is an open field for research on srp adoption in other sdlc stages (only 10); in contrast, most research efforts still focused on re (76 primary studies). 2. there was little empirical evidence of the benefits of srp beyond re as we found only one case study and one experiment report. those sm results contributed to the phd research problem definition, as shown in figure 1: the lack of research on srp in other sdlc stages. 3.4 project’s decisions based on the best available evidence this section recaps the evidence found in our secondary studies that guided a phd research in re. moreover, it associates each primary study developed by that graduate student with the pieces of evidence resulting from the tertiary review and the systematic mapping. additional information on each thesis product is available in kudo (2021) and kudo et al. (2019c, 2020b,c, 2022). items 1 and 2 of the research agenda (kudo et al., 2020a) inspired the conduction of the sm as none of the target studies described the benefits of srps in other sdlc phases, nor how to trace such support upon the development process. the evidence found in the sm (kudo et al., 2019a,b), in turn, led to the definition of the phd research problem: the lack of research on srp beyond re, motivated by the potential benefits that srp can bring to the sdlc (e.g., better quality specifications, reduced development time, and improved team productivity). moreover, the tertiary review’s research agenda items combined with the strong influence of re activities on software testing contributed to the definition of the phd research proposal (see figure 1): the alignment of re and testing phases through patterns, i.e., srp and stp. except for item 7, every research agenda item guided this phd work experience, as illustrated in figure 1. following item 2, the phd proposal endeavoured a novel srp approach, called software pattern metamodel (sopamm) (kudo et al., 2019c, 2020b). sopamm is a metamodel that represents, relates, and classifies software patterns in general and srp and stp in particular. influenced by item 3, sopamm borrows concepts and practices from the behaviour-driven development (bdd) agile methodology (chelimsky et al., 2010). in sopamm, functional requirement patterns (frp) are described as user stories associated with behaviours and test data using the gerkhin language. frp’s behaviours, in turn, are linked to acceptance test patterns (atp) through test cases. inspired by item 4, the terminal model editor (tmed) tool was developed to help with the elaboration of sopammbased pattern catalogues. a catalogue is a means of systematically gathering patterns, usually addressing the most common problems for a particular application domain. what differentiates tmed from related tools (palomares et al., 2011; barcelos and penteado, 2017) is that it handles other software patterns instead of srp only. with the support of the tmed tool, four pattern catalogues with srp and stp aligned (the research agenda’s item 5) were developed. one supports the certification of electronic health record systems (kudo et al., 2019c, 2020b; martins et al., 2021), another represents behaviour-driven requirements of internet of things (iot) systems, and two catalogues describe common functionalities and behaviours for user authentication and registration. finally, as the quality of the sopamm metamodel may impact the quality of pattern catalogues, which may influence software specifications quality, the metamodel quality requirements and evaluation (mquare) framework was devised (kudo et al., 2020c). mquare is a metamodel qualkudo et al. 2022 ity requirements and metrics, a metamodel quality model, and an integrated evaluation process. using mquare, the sopamm’s levels of compliance, conceptual suitability, usability, maintainability, and portability were recently evaluated in a controlled experiment (kudo et al., 2022). thus, mquare is the first effort toward addressing the research agenda’s item 6, providing objective metrics to evaluate metamodel’s characteristics that may affect the quality of the software artefacts relied upon it. finally, the research agenda’s item 7 is a future work of the phd thesis reported. it demands empirical work in collaboration with the software industry, a later effort of our research group. 4 discussion this section presents our lessons learned from undertaking systematic studies in an academic context. we believe these lessons can help phd candidates perform systematic studies in their research. 1. choose the correct systematic study type – phd students can conduct three types of reviews: slr, sm, or tertiary review. in particular, in the example of this paper, the phd student conducted two different systematic literature studies, one tertiary and one secondary (an sm). the choice for conducting a systematic study must consider, for example, the amount of evidence available. an sm may be more appropriate than an slr in domains with very little evidence related to a research topic, or the topic is vast. on the other hand, in domains where several slrs exist already, it may be possible to conduct a tertiary review (an slr of slrs) to answer broader research questions. sms may also be helpful to phd students who are required to draw an overview of the existing evidence concerning a research topic. despite that, it is essential to consider that the mapping study results may be more limited than the slr. an slr would be inappropriate if the research question is too vague or broad but also if the question is too narrow. the first case would yield hundreds of studies, and the second one would yield too few studies to be helpful. the conduction of a tertiary review is potentially less resource-intensive than conducting a new slr. however, its conduction is dependent on sufficient quality slrs being available. in our experience, the quality aspect of existing systematic studies on srp geared us towards a tertiary review on that topic. moreover, the tertiary review’s results were determinant for the conduction of an sm. 2. use systematic studies conducted by third parties, when appropriate – phd students should consider three critical points to using systematic studies conducted by third parties: • it is indicated to use the results from already published slr in se since it meets the phd’s goal, i.e., the slr research questions are related to the subject the student wants to investigate; • if the published slr uses valid methods and was well-conducted to ensure its credibility; and • if the slr is updated, avoid the understanding of the outdated state of the art. in this context, mendes et al. (2020) recommend updating slrs in se using a third-party decision framework to decide whether they need updates. we have recently noticed an increasing number of slr published in the se area. however, occasionally we see that independent research teams have been conducting slrs on the same topic, leading to duplication of the reviews and potentially wasted efforts. therefore, before undertaking a systematic study, phd students should ensure that a review is necessary, i.e., they should identify and review any existing study related to their research focus. in addition, when phd students decide to conduct a systematic investigation, they must be aware that findings may be helpful for future students. moreover, conducting an slr that does not benefit only specific research can yield benefits: avoid duplicate work from other students, increase confidence in findings, and catalyse new collaborations among students and other researchers. in our experience, we conducted a novel sm on the srp topic as the existing secondary studies focused on srp solely applied to the re phase. research collaborations have arisen from the findings reported in this phd experience (martins et al., 2021; kudo et al., 2022). 3. the need for a previous pilot search step – a pilot search is a reasonable first step before conducting systematic studies on the same or closely related target topic. a pilot search may reveal high-quality systematic studies on the topic of interest, motivating a tertiary review’s conduction (as we did) rather than a new secondary study. 4. experience reduces effort – establishing the first review protocol was a complex task and consumed considerable time and effort. however, it was essential for the assurance of the tertiary review quality. the knowledge acquired from the first review facilitated elaborating the sm protocol since procedures and forms were reused and adapted. moreover, we could find the quality level of candidate studies more quickly, comparing them to studies previously read. the access to information was also faster as we already knew its organisation (i.e., the paper structure in the srp context). finally, an experienced researcher familiar with the review subject must compose the review team. in our experience, she supported in defining keywords and synonyms for the search string’s main terms, synthesis of results, among other important decisions. 5. attention to open research issues in secondary studies – when conducting a tertiary review, identify the open research issues described in each secondary study. under the assessment for an experienced researcher, e.g., the phd advisor, these open issues may result in candidate research gaps. in this phd experience, we found three open issues found from the secondary studies analysis: the lack of kudo et al. 2022 professionals’ knowledge about srp, the low number of evaluation research on srp, and the need for tools for the effective use of srp in the industry. these were essential to derive the seven lines of action of the tertiary review’s research agenda. 6. sm results may identify suitable areas for future research – sm results are usually synthesised in a bubble chart, as depicted in figure 2. when synthesising the findings of an sm, a phd student should choose and group three relevant pieces of information, assign the most important one according to the study’s objective to the ordinate axis, and distribute the remainder to the positive and negative abscissa axes. in our sm, sdlc phases, software requirements types, and research validation types are the ordinate axis and the negative and positive abscissa axes, respectively. then, the phd student should count the number of primary studies addressing two information axes simultaneously — for instance, crossing information from the ordinate and negative abscissa axes. the smaller the number of primary studies crossing two axes, the smaller the bubble size. in our experience, we identified suitable areas for future research in figure 2: a few research on srp in construction (one), testing (one), and maintenance (none), the predominance of studies in non-functional requirement patterns (8 of 10), and the need for more mature research on srp in the sdlc (1 of 10). 5 threats to validity finding all relevant papers on a particular theme is challenging. for this reason, both systematic studies included a previous pilot search under the supervision of an srp expert, and a standard vocabulary for se helped the search strings definition process. furthermore, both search strategies comprise automatic search – in at least five relevant sources for se – and the snowballing technique. we also assess the quality of target papers to reduce a likely bias in the analysis and synthesis steps. the tertiary review protocol includes quality criteria widely accepted (centre for reviews and dissemination, 2002; cruzes and dybå, 2011), and the sm protocol describes nine criteria regarding general (jamshidi et al., 2013) and specific aspects of primary studies on srp. thus, both the quality criteria and the scores for each study analysed better weighed the value of individual studies after synthesising results, guaranteeing the evidence’s reliability. besides, three researchers conceived the protocols of both systematic studies: • researcher a has expertise in re and conducted the identification, selection, quality assessment, extraction, and synthesis of relevant secondary and primary studies; • researcher b is an expert in se, and to mitigate the possibility of biases throughout the process, he verified all results phases; and • researcher c is the team leader with vast experience in se, being consulted in the case of divergences not solved between researchers a and b. finally, we summarise our recommendations to common threats that phd students can face during the planning and conduction of a systematic study. these general recommendations include: • a previous pilot search for systematic studies on the graduate’s topic of interest; • the aid of both an expert and a standard glossary in the search string definition process; • a hybrid search strategy to expand the search coverage; • and the quality assessment of target studies and a wellcoordinated team both to mitigate research bias. 6 related work felizardo et al. (2020) highlight the relevance of using secondary studies as a research methodology for conducting se research projects. this study aimed to explore se researchers’ perceptions, mainly msc and phd students and their supervisors, about the value of secondary studies and how these perceptions impact decisions on conducting their research. the authors conducted two empirical research methods. first, they performed an sm to identify primary studies that used secondary ones as a research methodology for conducting se research projects. second, the authors surveyed se researchers to determine their perception of the value of performing secondary studies to support their research projects. in summary, felizardo et al. (2020) showed the main benefits of using secondary studies as a research methodology, identifying relevant research, finding reasons to explain why a research project should be approved, and supporting decisions made. the study reflected upon the value of secondary studies in developing academic projects. in agreement with other authors (dybå et al., 2005; kitchenham et al., 2011; zhang and babar, 2011), felizardo et al. (2020) highlight that a systematic secondary study is a valuable research mechanism for providing knowledge of a given topic and identifying gaps for future research. however, what is not clear yet is how this knowledge helps to conduct msc/phd research projects. one of the categories investigated in the sm shown by felizardo et al. (2020) was the application of secondary studies. this classification summarises how the findings of such analyses can guide efforts in research projects. to the best of the authors’ knowledge, only souza et al. (2015)’s work fits this category. souza et al. (2015) show how the findings of the sm drove their research efforts in conducting a project on knowledge management (km) in software testing. among the sm results, the following stand out: (i) the central problem in software organisations related to software testing is low knowledge reuse rate and barriers in knowledge transfer; and (ii) reuse of test cases is the perspective that has received more attention. from sm results, the authors decided to conduct two slrs, developed an ontology testing, and performed a survey to define a scenario to apply km in software testing. the survey aimed to identify the testing activities in which kudo et al. 2022 km is more valuable or appropriate for reuse. from the survey results, the most suitable scenario in the software testing domain was established for applying km. finally, considering the survey results and ontology, a km system was developed to manage testing knowledge repositories, such as test case reuse. comparatively, our work followed a similar approach. the results of the secondary studies served as a basis for followon research activities. before accomplishing a secondary study, we and souza et al. (2015) performed a tertiary review looking for secondary studies investigating the same topic. likewise, based on the results of the tertiary review, an sm was planned and conducted in both studies, directing the project’s decisions or defining other empirical approaches later used. 7 conclusion especially for phd research projects, originality is mandatory. moreover, once students research the advanced state of the art, it is essential to do it correctly. this work reports our experience conducting a phd research guided by systematic studies. we also highlight our lessons learned and recommendations that other researchers can use to guide their doctoral process. we explained the criteria phd candidates should choose to undertake the correct systematic study type and use a systematic study conducted by a third party. we also showed that a previous pilot search is desirable before conducting a secondary study on any topic. in addition, the experience acquired performing systematic studies reduces effort in similar works. moreover, a deep analysis of the open research issues found in secondary studies may be valuable to delimit gaps that can gear other investigations on the same topic, e.g., including a new secondary study with a more profound view of that theme. we also explained how the results of a systematic mapping help identify future research. finally, we also helped phd students with recommendations to mitigate common threats they can face during a systematic study. we believe any phd candidate can adapt or reuse the lessons and recommendations outlined in our experience in research to any area of study. acknowledgements this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior – brasil (capes) – finance code 001. references barcelos, l. and penteado, r. (2017). elaboration of software requirements documents by means of patterns instantiation. j softw eng res dev, 5(3):1–23. beckers, k., côté, i., and goeke, l. (2014). a catalog of security requirements patterns for the domain of cloud computing systems. in proceedings of the acm symposium on applied computing, pages 337–342. bjarnason, e. and borg, m. (2017). aligning requirements and testing: working together toward the same goal. ieee software, 34(1):20–23. brereton, p. o., turner, m., and kaur, r. (2009). pair programming as a teaching tool: a student review of empirical studies. in 22nd conference on software engineering education and training (csee&t’ 09). centre for reviews and dissemination (2002). title=the database of abstracts of reviews of effects (dare). effectiveness matters, 6(2):1–4. chelimsky, d., astels, d., helmkamp, b., north, d., dennis, z., and hellesoy, a. (2010). the rspec book: behaviour driven development with rspec, cucumber, and friends. pragmatic bookshelf, raleigh, nc, 1st edition. clarke, m. and chalmers, i. (2018). reflections on the history of systematic reviews. bmj evidence-based medicine, 23:121–122. clear, t. (2015). follow the moon’ development: writing a systematic literature review on global software engineering education. in 15th koli calling conference on computing education research, koli calling ’15, pages 1–4. acm. costal, d., franch, x., lópez, l., palomares, c., and quer, c. (2019). on the use of requirement patterns to analyse request for proposal documents. in laender, a. h. f., pernici, b., lim, e., and de oliveira, j. p. m., editors, conceptual modeling 38th international conference, er 2019, salvador, brazil, november 4-7, 2019, proceedings, volume 11788 of lecture notes in computer science, pages 549–557. springer. cruzes, d. and dybå, t. (2011). research synthesis in software engineering: a tertiary study. information & software technology, 53(5):440–455. da silva, r. c. and benitti, f. b. v. (2011). writing standards requirements: a systematic literature mapping. in proceedings of the 14th workshop on requirements engineering, pages 259–272, rio de janeiro, rj, brazil. dybå, t., kitchenham, b. a., and jørgensen, m. (2005). evidence-based software engineering for practitioners. ieee software, 22(1):58–65. ebert, c. and ray, r. (2021). test-driven requirements engineering. ieee software, 38(01):16–24. egger, m., smith, g., and philips, a. (1997). meta-analysis: principles and procedures. bmj, 315(1533–1537). fabbri, s. c. p. f., silva, c., hernandes, e. m., octaviano, f., di thommazo, a., and belgamo, a. (2016). improvements in the start tool to better support the systematic review process. in 20th international conference on evaluation and assessment in software engineering (ease’ 16), pages 21:1–21:5. felizardo, k. r., de souza, e. f., napoleão, b. m., vijaykumar, n. l., and baldassarre, m. t. (2020). secondary studies in the academic context: a systematic mapping and survey. journal of systems and software, 170:110734. franch, x. (2015). software requirements patterns: a state of the art and the practice. in proceedings of the 37th international conference on software engineering volume 2, icse ’15, pages 943–944, piscataway, nj, usa. ieee press. kudo et al. 2022 irshad, m., petersen, k., and poulding, s. (2018). a systematic literature review of software requirements reuse approaches. inf. softw. technol., 93(c):223–245. jamshidi, p., ghafari, m., ahmad, a., and pahl, c. (2013). a framework for classifying and comparing architecturecentric software evolution research. in 2013 17th european conference on software maintenance and reengineering, pages 305–314. justo, j. l. b., benitti, f. b. v., and leal, a. c. (2018). software patterns and requirements engineering activities in real-world settings: a systematic mapping study. computer standards & interfaces, 58:23–42. kaijanaho, a.-j. (2017). teaching master’s degree students to read research literature: experience in a programming languages course 2002–2017. in 17th koli calling int. conference on computing education research (koli calling ’17), pages 143–147, new york, ny, usa. acm. kitchenham, b. and brereton, o. (2013). a systematic review of systematic review process research in software engineering. information and software technology, 55(12):2049–2075. kitchenham, b., budgen, d., and brereton, o. (2011). using mapping studies as the basis for further research a participant-observer case study. information and software technology, 53(6):638–651. kitchenham, b., budgen, d., and brereton, p. (2015). evidence-based software engineering and systematic reviews. chapman & hall/crc innovations in software engineering and software development series. chapman & hall/crc. kitchenham, b. a., pretorius, r., budgen, d. an brereton, p. o., turner, m., niazi, m., and linkman, s. g. (2010). systematic literature reviews in software engineering a tertiary study. information & software technology, 52(8):792–805. konrad, s. and cheng, b. h. (2002). requirements patterns for embedded systems. in proceedings ieee joint international conference on requirements engineering, pages 127–136, essen, germany. ieee. kudo, t. n. (2021). a metamodel for the alignment of requirement patterns and test patterns and a metamodel evaluation framework. phd thesis, federal university of são carlos, são carlos-sp, brazil. (in portuguese). kudo, t. n., bulcão-neto, r. f., macedo, a. a., and vincenzi, a. m. r. (2019a). padrão de requisitos no ciclo de vida de software: um mapeamento sistemático. in proceedings of the xxii iberoamerican conference on software engineering (cibse’ 19), pages 420–433. kudo, t. n., bulcão-neto, r. f., macedo, a. a., and vincenzi, a. m. r. (2019b). a revisited systematic literature mapping on the support of requirement patterns for the software development life cycle. journal of software engineering research and development, 7:9:1–9:11. kudo, t. n., bulcão neto, r. f., and vincenzi, a. m. r. (2019c). a conceptual metamodel to bridging requirement patterns to test patterns. in proceedings of the xxxiii brazilian symposium on software engineering, pages 155–160, new york, ny, usa. acm. kudo, t. n., bulcão-neto, r. f., and vincenzi, a. m. r. (2020a). requirement patterns: a tertiary study and a research agenda. iet software, 14(1):18–26. kudo, t. n., bulcão-neto, r. f., and vincenzi, a. m. r. (2020b). uma ferramenta para construção de catálogos de padrões de requisitos com comportamento. in anais do wer20 workshop em engenharia de requisitos, são josé dos campos, sp, brasil, august 24-28, 2020. editora puc-rio. kudo, t. n., bulcão-neto, r. f., graciano neto, v. v., and vincenzi, a. m. r. (2022). aligning requirements and testing through metamodeling and patterns: design and evaluation. requirements engineering journal, pages 1–25. (to be published). kudo, t. n., bulcão-neto, r. f., and vincenzi, a. m. r. (2020c). toward a metamodel quality evaluation framework: requirements, model, measures, and process. in proceedings of the xxxiv brazilian symposium on software engineering, sbes 2020, pages 102–107. kuhrmann, m. (2017). teaching empirical software engineering using expert teams. in seuh, pages 20–31. martins, m. c., kudo, t. n., and bulcão-neto, r. f. (2021). padrões de requisitos para sistemas de registro eletrônico de saúde. in anais do wer21 workshop em engenharia de requisitos, brasília, df, brasil, august 23-27, 2021. editora puc-rio. mendes, e., wohlin, c., felizardo, k. r., and kalinowski, m. (2020). when to update systematic literature reviews in software engineering. journal of systems and software, 167:110607. napoleão, b., felizardo, k. r., souza, e. f., and vijaykumar, n. l. (2017). practical similarities and differences between systematic literature reviews and systematic mappings: a tertiary study. in 29th international conference on software engineering and knowledge engineering (seke’ 17), pages 1–10. palomares, c., quer, c., and franch, x. (2011). pabre-man: management of a requirement patterns catalogue. in 2011 ieee 19th international requirements engineering conference, pages 341–342. palomares, c., quer, c., and franch, x. (2017). requirements reuse and requirement patterns: a state of the practice survey. empirical software engineering, 22(6):2719– 2762. pejcinovic, b. (2015). development and uses of iterative systematic literature reviews in electrical engineering education. electrical and computer engineering faculty publications and presentations, 327(1):1–10. petersen, k., vakkalanka, s., and kuzniarz, l. (2015). guidelines for conducting systematic mapping studies in software engineering: an update. information and software technology, 64:1–18. souza, e. f., falbo, r. a., and vijaykumar, n. l. (2015). using the findings of a mapping study to conduct a research project: a case in knowledge management in software testing. in 41st euromicro conference on software engineering and advanced applications (seaa’15), pages 208– 215. withall, s. (2007). software requirement patterns. best practices. microsoft press, redmond, washington. kudo et al. 2022 wohlin, c. (2014). guidelines for snowballing in systematic literature studies and a replication in software engineering. in 18th international conference on evaluation and assessment in software engineering, ease ’14, london, england, united kingdom, may 13-14, 2014, pages 38:1– 38:10. zhang, h. and babar, m. (2011). an empirical investigation of systematic reviews in software engineering. in 5th international symposium on empirical software engineering and measurement (esem’ 11), pages 1–10. zhou, y., zhang, h., huang, x., yang, s., babar, m. a., and tang, h. (2015). quality assessment of systematic reviews in software engineering: a tertiary study. in proceedings of the 19th international conference on evaluation and assessment in software engineering, ease 2015, nanjing, china, april 27-29, 2015, pages 14:1–14:14. introduction software requirement pattern from systematic studies to a phd research project research method tertiary study on requirement patterns sm on requirement patterns and software life cycle project's decisions based on the best available evidence discussion threats to validity related work conclusion journal of software engineering research and development, 2019, 7:9, doi: 10.5753/jserd.2019.458 this work is licensed under a creative commons attribution 4.0 international license.. a revisited systematic literature mapping on the support of requirement patterns for the software development life cycle taciana n. kudo [ dc-ufscar, são carlos-sp, brazil | taciana@dc.ufscar.br ] renato f. bulcão-neto [ inf-ufg, goiânia-go, brazil | rbulcao@ufg.br ] alessandra a. macedo [ ffclrp-usp, ribeirão preto-sp, brazil | ale.alaniz@usp.br ] auri m. r. vincenzi [ dc-ufscar, são carlos-sp, brazil | auri@dc.ufscar.br ] abstract in the past few years, the literature has shown that the practice of reuse through requirement patterns is an effective alternative to address specification quality issues, with the additional benefit of time savings. due to the interactions between requirements engineering and other phases of the software development life cycle (sdlc), these benefits may extend to the entire development process. this paper describes a revisited systematic literature mapping (slm) that identifies and analyzes research that demonstrates those benefits from the use of requirement patterns for software design, construction, testing, and maintenance. in this extended version, the slm protocol includes automatic search over two additional search sources, the application of the snowballing technique, and the quality assessment of the relevant ten-study-group for data analysis and synthesis. in comparison to previous work, results still show a small number of studies on requirement patterns at the sdlc (excluding requirements engineering). results indicate that there is yet an open field for research that demonstrates, through empirical evaluation and usage in practice, the pertinence of requirement patterns at software design, construction, testing, and maintenance. keywords: requirement pattern, software development life cycle, systematic literature mapping 1 introduction requirements engineering is a critical development phase in which software functionalities and constraints must be well identified and understood. however, a high percentage of software projects do not meet deadlines and budget due to incomplete, misinterpreted, conflicting, or omitted requirements (tockey, 2015; palomares et al., 2017). to deal with this issue of quality of requirements specifications, software requirement patterns (srp) have been given special attention in the recent years (palomares et al., 2017; irshad et al., 2018). an srp is an abstraction that groups both behaviors and services of applications with similar characteristics. it works as a template for new requirements specification, and it can also be replicated in future requirements documentation (withall, 2007). for instance, to write a user authentication functional requirement, one can use an srp for this purpose and make appropriate adaptations to the requirement, if necessary. several proposals for srps are found in the literature such as for embedded (konrad and cheng, 2002), content management (palomares et al., 2013), and cloud computing systems (beckers et al., 2014). among the benefits obtained with the adoption of srps are: (i) greater efficiency in requirements elicitation since these are not identified from scratch; (ii) quality and consistency improvement in the requirements specification document; and (iii) improved requirements management (withall, 2007). because of the inherent interaction between requirements engineering and other phases of the software development life cycle (sdlc), it is assumed that the benefits of using srps can reach other development activities. although there are secondary studies on software engineering (kitchenham and brereton, 2013), requirements engineering (curcio et al., 2018), and requirement patterns (barros-justo et al., 2018), there is no evidence of secondary studies that analyze the use of srps at other sdlc phases. in short, existing secondary studies are restricted to analyzing the adoption of srps exclusively in the requirements engineering phase. in recent work, we performed a systematic literature mapping (slm) that identifies and analyses primary studies that put in evidence the usage of srps at the software design, construction, testing, and maintenance1 phases (kudo et al., 2019a). the underlying protocol included an automatic search over four sources of information and the definition and application of inclusion and exclusion criteria over 117 non-duplicate studies found. only nine primary studies were considered relevant, given the research aim (kudo et al., 2019a). results indicated that most of the relevant studies apply srps in software design, but none in software maintenance. moreover, only one study was featured as validation research, while the remaining studies were solution proposals. thus, we concluded that the benefits from the srps usage in practice at other sdlc phases are still in its early stages. in this paper, we revisit the slm described in kudo et al. (2019a) and improve the identification and selection methods of primary studies. besides the inclusion of two additional sources of information in the automatic search process, we also perform the snowballing technique (wohlin, 2014) that identifies relevant studies through the scanning of the list of bibliographic references or citations of a paper. the inclusion of two sources of studies resulted in 32 extra, non-duplicate papers, from which one novel relevant study arose. considering the 9 relevant primary studies found in our previous work, we obtained a ten-primary-study group in this research. to check whether other essential studies on 1we adopt the terminology of the software engineering body of knowledge (swebok) for the sdlc phases (bourque and fairley, 2014). https://orcid.org/0000-0002-7238-0562 mailto:taciana@dc.ufscar.br https://orcid.org/0000-0001-8604-0019 mailto:rbulcao@ufg.br https://orcid.org/0000-0001-5271-3086 mailto:ale.alaniz@usp.br https://orcid.org/0000-0001-5902-1672 mailto:auri@dc.ufscar.br a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 figure 1. phases and activities of this slm, adapted from (fabbri et al., 2013; wohlin, 2014) this research exist, we also analyzed the list of bibliographic references as well as the citing papers of each one of these 10 studies. the snowballing technique resulted in 202 nonduplicate papers from which none was assessed as relevant after the re-application of inclusion and exclusion criteria. we read the full text of 10 studies to extract the answers to the slm research questions and, in parallel, to assess the quality of each relevant paper. finally, we synthesized in a bubble graph a map with the remarkable characteristics of this ten-study-group. in comparison with our previous work, results continue to point out a lack of research on srps for software design, construction, testing, and maintenance. the organization of this paper is as follows. section 2 details the protocol of this slm. section 3 reports the data extraction and the quality assessment activities regarding the relevant studies. the answers to the research questions in this study and the research gaps are summarized in section 4. finally, section 5 describes the validation threats of this slm, whereas section 6 presents our final remarks. 2 the systematic mapping protocol in general, a systematic study process can be divided into three distinct phases (fabbri et al., 2013): planning, conduction, and publishing of results. first, a protocol is planned in such a way one can reproduce it later. this systematic mapping protocol includes the definition of the main goal, research questions, search strategy, search string, sources of studies, and inclusion and exclusion criteria. in the conduction phase, studies gathered from search engines and bibliographic databases are identified and selected using the inclusion and exclusion criteria previously defined. a set of useful information is extracted from these selected studies that, in turn, can be still excluded from the slm. snowballing is performed over these included papers by firstly checking their references list. the selection of the studies from this backward analysis is also based on the previous reading of the paper’s title and abstract. this same process is also carried out with the citation list of the same papers examined in the data extraction step. forward and backward analyses finish when no new study is included. following the slm goal, the studies remaining constitute the set of relevant papers from which answers for the research questions of the protocol are analyzed and synthesized. a quality assessment activity is also conducted to assist data synthesis from these relevant papers, as suggested by kitchenham et al. (2010). in the publishing phase, the entire protocol and the results of each previous stage are documented as scientific papers or technical reports. the slm presented in this paper is an extension of the kudo et al. (2019a)’s work and follows those three phases, as depicted in figure 1. 2.1 research questions and keywords the main goal of this slm is to identify studies that explore the benefits of requirement patterns for every sdlc phase, except for the requirements engineering process. based on this goal, the set of research questions (rq) this slm should answer, and the respect justifications are presented next: rq1. at what sdlc phases are requirement patterns used: design, construction, testing, and/or maintenance? this question is essential to find out if there is research on requirement patterns covering other sdlc phases, beyond requirements engineering. rq2. is there evidence of requirement patterns usage in practice at those sdlc phases? this question is relevant to discover empirical evidence on requirement patterns usage at other sdlc phases, i.e., not only solution proposals. rq3. are there reported benefits of using requirement patterns at those phases? if so, what metrics are used to measure these benefits? this question is useful to find out if the benefits of requirement patterns (e.g., development time savings, better quality specifications, etc.) have been exploited at other sdlc phases. if so, we want to know how these benefits have been measured. a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 to support the definition of standardized terms in software engineering, the search terms are borrowed from the sevocab (software and systems engineering vocabulary), which is an iso/ieee initiative to standardize the terms used in software engineering (iso/iec/ieee, 2017). the following is the set of keywords used for the definition of the search string: requirement pattern, development process, software development, life cycle, design, construction, coding, implementation, test, integration, and maintenance. a search strategy should find relevant studies to answer the research questions. next, we present the search strategy performed in this slm that includes automatic search and the snowballing technique. 2.2 automatic search after evaluating the trade-off between coverage and relevance of the search results in a pilot search, we opted for the following combination of keywords2 as search string: (“requirement pattern” or “requirement patterns” or “requirements pattern” or “requirements patterns”) and ((“software development” or “development process”) or (“life cycle” or design or construction or coding or implementation or test or integration or maintenance)) besides acm dl3, engineering village, ieee xplorer, and scopus, we also performed searches at the sciencedirect and the web of science websites. similarly, we did searches based on studies metadata, at least over the abstracts because of their richer content. table 1 describes in detail the number of studies returned per source of studies, both in the original search4 (kudo et al., 2019a) and in this revisited version5. therefore, 85 studies were identified (including duplicate papers) after the inclusion of two new bibliographic databases (sciencedirect and web of science) and the update of search results over the four original sources of studies. table 1. number of studies returned per source. source original extension difference acm dl 24 26 2 engineering village 100 106 6 ieee xplorer 23 25 2 scopus 71 76 5 sciencedirect 9 9 web of science 61 61 total 218 303 85 2.3 selection of primary studies this section describes the selection method of relevant studies to answer the research questions of this slm. the same 2plural variations of the term “requirement pattern” are necessary due to the capabilities of the search engines of each source of studies. 3we chose the the acm guide to computing literature because it is a most comprehensive bibliographic database on computing, including the full-text collection of all acm publications. 4search carried out from april 24 to may 5, 2018. 5additional search performed on june 3 and 4, 2019. original selection criteria were applied to the 303 papers returned by the automatic search process. the exclusion criteria (ec) are: ec1 it is not a primary study. ec2 it is not a paper (e.g., preface or summary of journals or conference proceedings). ec3 the research is not about srp. ec4 the research addresses srp in requirements engineering only. ec5 the full study text is not in english. ec6 the full study text is not accessible. ec7 it is a preliminary or short version of another study. a paper is removed from this slm whenever it meets at least one of the exclusion criteria (ec) presented; otherwise, the study is categorized based on the following inclusion criteria (ic): ic1 it addresses srp in software design. ic2 it addresses srp in software construction. ic3 it addresses srp in software testing. ic4 it addresses srp in software maintenance. figure 2 depicts the entire selection process with the respective number of primary studies chosen and removed in each activity of the conduction phase. after the automatic search process, 155 duplicate papers are identified and removed (from the 303 studies group) with the support of the start tool (fabbri et al., 2016). next, we proceeded with reading of the title, summary, and keywords of each of the 148, upon which we applied the exclusion and inclusion criteria. as a result, we selected 41 possibly relevant studies because this selection relies on the reading and interpretation of papers’ metadata only. in the data extraction activity, we read the full text of these 41 studies from which we excluded 31 papers by the ec4 criterion, i.e., their research focus is on srp in the requirements engineering phase. we describe the process of data extraction of the 10 studies remaining in section 3. these studies are identified throughout this paper as s1 to s10 as follows: s1 adaptive requirement-driven architecture for integrated healthcare systems (yang et al., 2010) s2 analysing security requirements patterns based on problems decomposition and composition (wen et al., 2011) s3 an architectural framework of the integrated transportation information service system (chang and gan, 2009) s4 application of ontologies in identifying requirements patterns in use cases (couto et al., 2014) s5 effective security impact analysis with patterns for software enhancement (okubo et al., 2011) s6 from requirement to design patterns for ubiquitous computing applications (knote et al., 2016) s7 modeling design patterns with description logics: a case study (asnar et al., 2011) s8 mutation patterns for temporal requirements of reactive systems (trakhtenbrot, 2017) s9 sacs: a pattern language for safe adaptive control software (hauge and stølen, 2011) s10 re-engineering legacy web applications into rias by aligning modernization requirements, patterns and ria features (conejero et al., 2013) a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 table 2. the total number of studies removed per exclusion criteria throughout the conducting phase. activity ec1 ec2 ec3 ec4 ec5 ec6 ec7 total automatic search 7 13 40 47 0 0 0 107 data extraction 0 1 10 15 0 2 3 31 snowballing 2 0 186 14 0 0 0 202 total 9 14 236 76 0 2 3 340 figure 2. a detailed view of the conduction phase: automatic search, duplicate study exclusion, study selection, data extraction, snowballing, and data synthesis. 2.4 snowballing besides automatic search, our search strategy includes snowballing as an attempt to obtain other relevant studies using the papers s1 to s10 as input. regarding backward snowballing, we collected the reference list of each paper from the scopus database, resulting in 216 documents whose metadata (title, abstract, and keywords) we stored into the start tool. after the removal of 49 duplicate studies, we read the metadata of the 167 documents remaining to decide for the exclusion or the tentative inclusion of a paper for further analysis. as no new paper was found in the first round of backward snowballing, we finished this analysis earlier. in sequence, we searched the citation list of s1 to s10 from the scopus website, resulting in 44 papers also registered into the start tool. similarly, no new paper was retrieved in the first round of this forward snowballing step, resulting from the removal of 9 duplicate studies and the reading of the metadata of the 35 documents remaining. both snowballing procedures end up the process of selection of relevant studies of this slm. figure 2 depicts the total number of studies identified (260), excluded (58), and selected (0) from the overall snowballing process. as a result, the data extraction and synthesis activities include only the studies s1 to s10 previously presented. finally, table 2 summarizes the removal process of studies in the conduction phase. most of the papers removed in the automatic search (87 of 107) are due to the ec3 and ec4 criteria, i.e., they do not address srp, or they do it in the requirements engineering phase only, respectively. studies were excluded at a similar rate (25 of 31) in data extraction activity. these exclusion rates around 80% are expected because of the trade-off analysis between coverage and relevance of the search string. differently, most of the studies removed during both snowballing procedures (186 of 202) are because of the ec3 criterion. two related reasons explain this 92% exclusion rate: first, in general, the size of the reference list of a paper is far more extensive than the number of studies citing that paper; second, the papers in a reference list often address other research topics. besides, only 7% of the studies referenced by or citing them represent research on srps (14 of 202). even so, none of these explores srps at other stages of sdlc other than requirements engineering. 3 data extraction this section describes the data extraction process from the full-text-reading of the 10 relevant studies (s1 to s10) of this slm. besides presenting a comparative analysis of the contribution types of each paper, we also extract: 1. the quality score of each primary study; 2. the type of research carried out; 3. the type of requirement addressed by srp; 4. the sdlc phase supported by srp; 5. and the contribution type. 3.1 quality assessment the quality assessment may be useful for an slm to assure that sufficient information is available to be extracted. however, we concur with petersen et al. (2015) that quality assessment should not pose high requirements on the primary a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 studies because the main objective of an slm is to give a broad overview of a research topic. rather than using quality criteria for the exclusion of papers, our quality assessment approach assists data analysis and synthesis, such as to investigate whether different quality scores are associated with varying outcomes of the primary studies (kitchenham et al., 2010; petersen et al., 2015). multiple checklists are available in the literature to help the process of assessing the quality of primary studies. here, we evaluated the quality of primary studies through nine quality criteria from which six are general factors (g1 to g6), as described in jamshidi et al. (2013), and three are particular factors (p1 to p3) that we defined based on the subject of this slm. following is the full description of every general and specific quality criteria, including their respective predefined responses and scores (in parenthesis). observe that g2 is the only criterion whose score ranges from 0 to 1, indicating a lower weight in the quality score of each study. g1problem definition of the study. (2) : there is an explicit problem description. (1) : there is a general problem description. (0) : there is no problem description. g2environment in which the study is carried out. (1) : there is an explicit description of the environment in which the research is performed (e.g., lab setting, as part of a project, in collaboration with industry, etc.). (0.5) : there are some general words about the environment in which the research is performed. (0) : there is no description of the environment. g3research design of the study. (2) : there is an explicit description of the plan (different steps, timing, etc.) used to perform the research, or of the way the research is organized. (1) : there are some general words about the research plan or the way the research is organized. (0) : there is no description of the research design. g4contributions of the study. (2) : there is an explicit list of the contributions/results. (1) : there are some general words about the study results. (0) : there is no description of the study results. g5insights derived from the study. (2) : there is an explicit list of insights/lessons learned from the study. (1) : there are general words about insights/lessons learned from the study. (0) : there is no description of the insights derived from the study. g6limitations of the study. (2) : there is an explicit list of the limitations of the study. (1) : there are general words about the limitations of the study. (0) : there is no description of the limitations of the study. p1the srp structure. (2) : there is an explicit description of the srp structure. (1) : there is some general information about the srp structure. (0) : there is no description of the srp structure. p2the integrated use of srps with the sdlc phases. (2) : there is an explicit description of which sdlc phase benefits from srps usage. (1) : there are some general words about which sdlc phase benefits from srps usage. (0) : there is no description of which sdlc phase benefits from srps usage. p3empirical investigation of srps usage in the sdlc phases. (2) : there is an explicit description of empirical investigation. (1) : there is some general information about the empirical investigation. (0) : there is no description of empirical investigation. the relevance of the particular quality criteria (p1 to p3) is presented next. as stated by franch et al. (2010), the reuse of an srp heavily depends on a detailed description of its structure (p1). the p2 criterion is important to identify the adherence of each study to the research question rq1, i.e., the sdlc phase supported by srps. finally, the p3 criterion allows distinguishing studies with empirical evidence. once presented general and particular quality criteria, the following is the final quality score (qs) formula that provides us with a numerical quantification as a means of ranking the relevant primary studies: qs = [( ∑6 g=1 11 ) + ( ∑3 p =1 6 × 3)] (1) , where the sums of g1 to g6 and of p1 to p3 may reach a maximum score of 1 and 3, respectively. that is, specific quality criteria represent 75% weightage in the final quality score because of their higher importance in comparison with the general items. table 3 presents the full quality assessment of the ten primary studies, in descending order of the final quality score (qs at the rightmost column). the respective values assigned to the general and particular quality criteria of every primary study are also available in table 3 as well as the particular total score for general and particular quality criteria (sgc and spc, respectively). observe that the p3 criterion contributes to a subclassification of the ten-study-group: research with no empirical investigation (s1 to s6, s8, and s9) got a quality score less than 3.0, while the studies whose quality score is higher than 3.0 (s7 and s10) have more empirical evidence and explicit their lessons learned (g5 criterion). however, s7 and s10 obtained a minor grade for the p1 criterion because they do not describe the structure of their srps proposals. finally, the lower quality scores are mainly due to the grades of the p2 and p3 criteria. consider the case of the studies s3 and s4 that both have no empirical evidence and partially describe how to employ srps in an sdlc phase. a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 table 3. a detailed view of the quality assessment results. study g1 g2 g3 g4 g5 g6 sgc p1 p2 p3 spc qs s10 2 1 2 2 2 1 0.9 1 2 2 0.8 3.3 s7 2 1 2 2 2 0 0.8 1 2 2 0.8 3.2 s6 2 1 2 2 1 0 0.7 2 2 0 0.7 2.8 s2 1 1 2 2 0 2 0.7 2 2 0 0.7 2.8 s9 2 1 2 2 1 0 0.7 2 2 0 0.7 2.8 s5 2 0.5 2 2 0 0 0.6 2 2 0 0.7 2.7 s8 2 0 2 2 0 0 0.5 2 2 0 0.7 2.6 s1 2 0 2 1 0 0 0.5 2 2 0 0.7 2.6 s4 1 0.5 2 2 0 1 0.6 2 1 0 0.5 2.1 s3 2 0 2 1 0 0 0.5 2 1 0 0.5 2.0 3.2 research type we classified the ten-study-group using petersen et al. (2015)’s criteria in which a set of conditions determine the type of research developed. for instance, opinion research solely reports the author’s point of view about a subject. in this case, there is no usage in practice, empirical evaluation, author’s experience report, or proposal of a conceptual framework or a novel solution. table 4 shows that, according to petersen et al. (2015)’s taxonomy, most of the studies (8 of 10) is a solution proposal because there is no empirical evaluation: three studies are validated by a free proof of concept, whereas the five remaining do not even confirm their proposals. furthermore, only two of ten studies are validation research: s7 presents a case study, and s10 describes an experiment with controlled conditions. table 4. types of research and validation of relevant studies. type of research type of validation solution proposal proof of concept: s2 s5 s9 no validation: s1 s3 s4 s6 s8 validation research case study: s7 experiment: s10 3.3 type of software requirement next, we analyzed the particular type of software requirement covered by srp, as presented in table 5. four of the relevant studies define srp for the adaptability requirement and the other four papers for the security one. the proposals of srp described in the remaining two studies do not address a specific type of software requirement. table 5. type of requirement covered by an srp. type of requirement studies adaptability s1 s3 s6 s8 security s2 s5 s7 s9 general purpose s4 s10 3.4 a comparative analysis next, we describe a detailed comparative analysis of the contributions proposed in s1 to s10, from which we perceived some similarities and identified the sdlc phase supported by their srps solutions. studies s1 and s3 propose a similar conceptual architecture for systems developed from srps, as illustrated in figure 3. the dashed lines a, b, c, and d show the similarities between the architectures proposed in s1 (left-hand side) and s3 (right-hand side). the requirements layer (a) identifies, analyzes, and models requirements as user requirement patterns (urp). the service layer (b) interacts with the requirements layer and provides services to satisfy the urp. the security and information sharing mechanism (c) establishes a process of reliable information exchange between systems of the same domain. the knowledge base (d) combines standards, norms, and ontologies of the system domain. the motivation of both research efforts is the need to share information between systems of the same area: medical systems (in s1) and transport systems (in s3). regarding s1 and s3 again, these studies make use of srp to support the software design phase. in both studies, a urp in the requirements layer leads to the efficient selection of services in the service layers. a urp is a crucial element not only because it represents user requirements but also due to the fact it guides the operation of the entire system. we also observed commonalities on how s2 and s5 represent security requirements as an srp, as depicted in figure 4. both studies specify security requirement patterns with similar structure and security concepts (context, assets, and threats) as well as protection measures as design patterns. illustrated as dashed lines in figure 4, the steps outlined in s2 (left-hand side) — the identification of stakeholders and objectives, essential information assets, and threat sources using standards — match with the following items of the security requirement pattern in s5 (right-hand side), respectively: the pattern definition format (context, problem, solution, and structure), asset, and threat. finally, the step “adding protection measures in the system design” in s2 matches with the countermeasure concept described as security design patterns in s5. from this analysis, we concluded that s2 and s5 also make use of srp to benefit the software design phase because they define security requirement patterns and relate them to design-pattern-based protection measures. a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 figure 3. a comparative analysis of the srp-based conceptual architectures discussed in s1 (left-hand side) and s3 (right-hand side). figure 4. a comparative analysis of the srp-based security approaches discussed in s2 (left-hand side) and s5 (right-hand side). as a result of the analysis of s8 and s9, we identified that both studies present proposals of requirement patterns representation format. in s8, each natural language-written requirement binds to a linear-temporal-language-written formula, in which mutations soften the likely issues in this association. each type of requirement pattern attaches its potential failures and the respective appropriate variations. the formulas associated with mutants have multiple purposes, such as test generation, the adequacy of test sets, or the automatic construction of monitors for the system’s behavior verification at run-time. thus, the mutations included in the transformation of the requirement patterns contribute to the software testing phase. in the case of s9, a composite pattern integrates three types of software patterns (i.e., requirement, design, and security). based on the problem frames theory, this composite pattern uses parameters extracted from an inner requirement pattern, from which a set of functions correspond both to solutions in a design pattern and contextual elements in a security pattern. thus, this applicability of srp is at software design. the studies s4 and s7 model requirement patterns using ontologies based on formal description logic. as ontologybased srps allow the automatic generation of source code in s4, this srp contribution is to the software construction phase. in study s7, authors implement a mechanism that automatically binds an ontology-based security requirement pattern to a corresponding design pattern solution. thus, the srp’s main contribution in s7 is for the design phase of the sdlc. in the context of ubiquitous computing (ubicomp) applications, s6 aims to map dependencies between design patterns and requirement patterns. this software pattern-integration approach bridges the gaps of the early software development phase, where recurring requirements demand similar design solutions, such as the case of the adaptability requirement for ubicomp applications. consequently, the main contribution of s6 is for the software design phase. regarding the study s10, it presents a systematic process to modernize legacy web applications into rich internet applications (ria). the core of that process is a set of tracea revisited systematic literature mapping on software requirement patterns kudo et al. 2019 table 6. data extraction from the 10 relevant studies. type of contribution sdlc phase type of requirement studies qs conceptual architectures for srp-based systems design adaptability s1 s3 2.6 2.0 representation formats for srp design security s2 s5 s9 2.8 2.7 2.8 testing adaptability s8 2.6 processes for discovery and use of srp design security s7 3.2 design general purpose s10 3.3 construction general purpose s4 2.1 catalog of srp design adaptability s6 2.8 ability matrices that relate modernization requirements, ria features, and patterns. a final traceability matrix suggests the most suitable ria patterns for each new requirement based on the values of two different metrics: the degree of requirement realization (drr) and the degree of pattern realization (dpr). once selected, the ria patterns are weaved into the legacy models so that those pattern-based ria functionalities are incorporated into the system. the reusability of ria patterns is very clear because the patterns traceability matrix is built once and used in any modernization process that, in turn, takes a lesser design time. thus, in this approach, srps cover the gap between requirements elicitation and architectural design along the ria development process. 3.5 summary table 6 summarizes the analysis of the ten-study-group by the types of contributions identified: conceptual architectures for srp-based systems, processes for discovery and use, representation formats, and catalogs of srp. the final quality scores (qs) of each study are at the rightmost column. 4 data synthesis this section presents a synthesis of the data extracted from the relevant studies to answer the research questions. 4.1 about the research question 1 to answer the research question “at what sdlc phases are requirement patterns used: design, construction, testing and/or maintenance?”, eight studies use srps at the design phase, one at construction, one at software testing, and none at software maintenance. among the eight studies that address srps at software design (s1 to s3, s5 to s7, s9, and s10), there are no repeating authors, neither the convergence of studies to one or more research groups. two hypotheses can explain the high concentration of studies related to the design phase: the fact that it is after requirements engineering as well as the increasing usage of design patterns in software development. even though the studies s3 and s4 do not clearly state the sdlc phase supported by srps, we consider that their srps proposals bring benefits to the software design and construction phases, respectively (see p2 criterion). a significant difference between the number of relevant studies (10) and the number of papers excluded (77) is because these investigate srps exclusively for requirements engineering. this unbalance makes it clear that there is still an open field for research on the benefits of srps for the other sdlc phases, such as testing and (1) maintenance (0). as a consequence, another evidence is the lack of research on the use of srps along the entire sdlc, from requirements engineering to software maintenance. an example of a challenging study could be the evaluation of the improvements for the sdlc resulting from the adoption of srps, beyond the well-known benefits of time savings and better quality specifications. 4.2 about the research question 2 regarding the research question “is there evidence of requirement patterns usage in practice at those sdlc phases?”, there is no study that reports evidence of srps usage in the software industry. eight of the ten-relevant-studies are solution proposals with no validation, and only two papers (s7 and s10) are validation research with the highest quality scores, according to our quality assessment. this analysis suggests that future work should be more focused on the use of srps along the sdlc in the software industry. 4.3 about the research question 3 to answer the research question “are there reported benefits of using requirement patterns at those phases? if so, what metrics are used to measure these benefits?”, s10 is the only study that defines srp-related metrics. we believe that this lack of concern with metrics is because most articles are solution proposals, thus without use in practice. in s10, the metrics drr (degree of requirement realization) and dpr (degree of pattern realization) select candidate ria patterns in the process of re-engineering of legacy web applications. a value of 1 in drr indicates that a pattern fully supports all the ria features demanded by the requirement, whereas a value of 0 means that the requirement and the pattern do not share any feature. similarly, a value of 1 in dpr denotes that the requirement demands all the ria features supported by the pattern. in contrast, a value close to 0 implies that the requirement needs an insignificant amount of the ria features supported by the pattern. the experiment results in s10 show that, in the worst case, more than half of the patterns would have been automatically suggested by the authors’ method. furthermore, the synchronization patterns indicated by the approach and those used by developers are the same in all systems tested in the experiment. both results allow concluding that srps usage in s10 implies significant development time savings. a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 figure 5. mapping of the types of requirements and validation on srps for software design, construction, testing, and maintenance. 4.4 discussion figure 5 illustrates a bubble graph that synthesizes the information we extracted and analyzed from each relevant paper. observe that four studies (s2, s5, s7, and s9) propose security requirement patterns with contributions to the software design phase. we conclude that this is because security is a recurrent requirement of many software systems, besides the support of well-established international standards (iso/iec, 2018). however, these studies mentioned above still require more significant validation with empirical assessments and use in the software industry. four studies (s1, s3, s6, and s8) explore srp for the adaptability nonfunctional requirement: one in software testing (s8), and the others in software design. besides, none of these studies presents any validation of the proposal. s4, in turn, investigates srps for general purpose requirements used in software construction, but also with no validation. still regarding figure 5, as important as mapping the research endeavors is the analysis of the existing gaps: 1. there is a general lack of investigation on the adoption of srps at other sdlc stages (10), while many research endeavors still focus on requirements engineering (77); 2. adaptability and security are the most addressed nonfunctional requirements as srps at the software design and testing phases, from the analysis of the left-hand side of the bubble graph. however, other types of nonfunctional requirements can be specified as srp at different sdlc phases, e.g., usability aspects with automated support for code and test case generation. 3. the application of research results on srps in the software industry (right-hand side of the figure); except for the studies s7 and s10, the remaining are in the proof of concept level. 5 threats to validity finding all relevant research on a topic and selecting evidence of quality are significant problems in systematic studies. three procedures were carried out throughout the planning and the conduction phases to reduce the potential threats to the validity of this slm. first, we performed an automatic search strategy that combines six relevant sources of studies with search string terms based on the sevocab standard vocabulary. besides, search in the gray literature is not part of the protocol (e.g., dissertations, theses, and technical reports) because we assume that good quality research is mostly published in journals or conferences. secondly, we were aware that searches could be extended to two additional relevant sources of research, i.e., sciencedirect and web of science. surprisingly, the number of relevant studies resulting from the automatic search increased only from 9 to 10 (the study s10 retrieved from web of science), even introducing those two new sources. as a means of retrieving a higher number of papers, we extended the search strategy again by performing the snowballing technique over those ten relevant studies. in spite of this, this hybrid search strategy included no new research. thirdly, we assessed the quality of primary studies as a means of reducing a likely bias in the analysis and synthesis steps of this slm. the quality criteria we defined and the scores we calculated for each relevant study allowed us to weight better the importance of individual studies when results were synthesised. for instance, the value of empirical evidence and the reporting of lessons learned convey a higher maturity level to the study s10 in comparison to s3. this explains somewhat the difference between their respective quality scores. finally, to mitigate the possibility of biases of this research, three researchers participated in the planning and conduction phases of this slm as follows: a: with 14 years of experience in requirements engineering, she performed the protocol planning, the study selection, and the data extraction and synthesis. b: with 13 years of experience in software engineering, he also performed the protocol planning. still, his contribution was mostly on the verification of the results of the selection, extraction, and synthesis activities. c: the team leader accumulates more than 20 years of experience in software engineering. he helped the synthesis and writing of the results. should divergences arose, a, b, and c solved conflicts together. a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 6 final remarks in the past few years, the literature has demonstrated the positive impacts of software requirement patterns on requirements specification quality, team productivity, elicitation and specification costs, among others (barros-justo et al., 2018; irshad et al., 2018) this paper presents a revisited version of a recent work (kudo et al., 2019a) that investigates if those benefits from srps usage have also been studied for the software design, construction, testing, and maintenance. here, we expand the scope of the search strategy with two additional and pertinent sources of studies and the application of the snowballing technique. besides, we carry out a quality assessment activity supporting the data extraction of relevant studies. by adding other databases in the search strategy, we obtained only one new relevant paper (s10) in comparison with our previous slm. however, the study s10 got the highest quality score, it is the only one that defines srp-related metrics, and it is also classified as validation research. concerning the overall snowballing procedure, in spite of scanning both the reference and the citation lists of the relevant studies as a means of finding further research, none of the 260 papers found suited for our purposes. this strengthens our claim that the effective use of srps in software design, construction, testing, and maintenance constitutes a gap for future research. we also conclude that the studies’ quality scores corroborate the maturity of each research described. the highest quality score studies (s7 and s10) achieve more empirical evidence and lessons learned than the remaining investigations about srps in the software design phase (studies s1 to s3, s5, s6, and s9). in general, we are confident that our results are valuable not only for new secondary studies on this same subject but also for future primary research. to promote further research on srps in the whole software development process, we continue suggesting that the academic community approaches the software industry to match the latter’s expectations effectively. researchers should also establish more metrics that corroborate the advantages of srps usage, such as reduced design time, automatic source code generation, standardized testing, and improvement in the quality of specifications in general (kudo et al., 2019c). at last, we also conclude that the concrete results of the srps usage in practice can be better experienced through two more lines of action: srp-based innovative development tools, and the enhancement of the current development methodologies that could integrate srps along the sdlc. our current efforts include the reuse of agile concepts and practices of behaviour-driven development (bdd) for the description of srps whose behavior is described as test patterns (kudo et al., 2019b). as future work, we plan the inclusion of the term “analysis pattern” (and its variants) in the search string of this systematic mapping to augment the group of relevant studies. the main reason is that analysis patterns and requirements patterns are complementary approaches (pantoquilho et al., 2003) in such a way that the former can be transformed into the latter to migrate to the implementation details level. acknowledgements this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior brasil (capes) finance code 001. renato bulcão-neto is grateful for the scholarship granted by capes/fapeg (88887.305511/2018-00), in the context of the postdoctoral internship held at the dept. of computation and mathematics of ffclrp-usp. alessandra macedo is grateful for the financial support of fapesp (16/13206-4) and cnpq (302031/2016-2 and 442533/2016-0). the authors would also like to thank all the anonymous referees for their valuable comments and suggestions on this paper. references asnar, y., paja, e., and mylopoulos, j. (2011). modeling design patterns with description logics: a case study. in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), volume 6741 lncs, pages 169 – 183, london, united kingdom. barros-justo, j. l., benitti, f. b. v., and leal, a. c. (2018). software patterns and requirements engineering activities in real-world settings:a systematic mapping study. comp. standards & interfaces, 58:23–42. beckers, k., côté, i., and goeke, l. (2014). a catalog of security requirements patterns for the domain of cloud computing systems. in proceedings of the acm symposium on applied computing, pages 337–342. bourque, p. and fairley, r. e., editors (2014). swebok: guide to the software engineering body of knowledge. ieee computer society, los alamitos, ca, version 3.0 edition. chang, f. and gan, r. (2009). an architectural framework of the integrated transportation information service system. in 2009 ieee international conference on grey systems and intelligent services, gsis 2009, pages 1342 – 1346, nanjing, china. conejero, j. m., rodríguez-echeverría, r., sánchezfigueroa, f., linaje, m., preciado, j. c., and clemente, p. j. (2013). re-engineering legacy web applications into rias by aligning modernization requirements, patterns and ria features. journal of systems and software, 86(12):2981 – 2994. couto, r., ribeiro, a. n., and campos, j. c. (2014). application of ontologies in identifying requirements patterns in use cases. in electronic proceedings in theoretical computer science, eptcs, volume 147, pages 62 – 76, grenoble, france. curcio, k., navarro, t., malucelli, a., and reinehr, s. (2018). requirements engineering: a systematic mapping study in agile software development. journal of systems and software, 139:32 – 50. fabbri, s., silva, c., hernandes, e. m., octaviano, f., thommazo, a. d., and belgamo, a. (2016). improvements in the start tool to better support the systematic review process. in proceedings of the 20th international conference on evaluation and assessment in software engineering, a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 ease 2016, limerick, ireland, june 01 03, 2016, pages 21:1–21:5. fabbri, s. c. p. f., felizardo, k. r., ferrari, f. c., hernandes, e. c. m., octaviano, f. r., nakagawa, e. y., and maldonado, j. c. (2013). externalising tacit knowledge of the systematic review process. iet software, 7(6):298–307. franch, x., palomares, c., quer, c., renault, s., and de lazzer, f. (2010). a metamodel for software requirement patterns. in wieringa, r. and persson, a., editors, requirements engineering: foundation for software quality, pages 85–90, berlin, heidelberg. springer berlin heidelberg. hauge, a. a. and stølen, k. (2011). sacs: a pattern language for safe adaptive control software. in proceedings of the 18th conference on pattern languages of programs, plop ’11, pages 7:1–7:22, new york, ny, usa. acm. irshad, m., petersen, k., and poulding, s. (2018). a systematic literature review of software requirements reuse approaches. inf. softw. technol., 93(c):223–245. iso/iec (2018). iso/iec 27000:2018 information technology – security techniques – information security management systems – overview and vocabulary. iso/iec/ieee (2017). iso/iec/ieee 24765:2017 systems and software engineering – vocabulary. jamshidi, p., ghafari, m., ahmad, a., and pahl, c. (2013). a framework for classifying and comparing architecturecentric software evolution research. in 2013 17th european conference on software maintenance and reengineering, pages 305–314. kitchenham, b. a. and brereton, p. (2013). a systematic review of systematic review process research in software engineering. information & software technology, 55(12):2049–2075. kitchenham, b. a., budgen, d., and brereton, o. p. (2010). the value of mapping studies a participant-observer case study. in 14th international conference on evaluation and assessment in software engineering, ease 2010, keele university, uk, 12-13 april 2010. knote, r., baraki, h., söllner, m., geihs, k., and leimeister, j. m. (2016). from requirement to design patterns for ubiquitous computing applications. in proceedings of the 21st european conference on pattern languages of programs. konrad, s. and cheng, b. h. c. (2002). requirements patterns for embedded systems. in proceedings ieee joint international conference on requirements engineering, pages 127–136. kudo, t. n., bulcão-neto, r. f., macedo, a. a., and vincenzi, a. m. r. (2019a). padrão de requisitos no ciclo de vida de software: um mapeamento sistemático. in proceedings of the xxii iberoamerican conference on software engineering, cibse 2019, la habana, cuba, april 22-26, 2019., pages 420–433. kudo, t. n., bulcão-neto, r. f., and vincenzi, a. m. r. (2019b). a conceptual metamodel to bridging requirement patterns to test patterns. in proceedings of the xxxiii brazilian symposium on software engineering, sbes 2019, salvador, brazil, september 23-27, 2019., pages 155–160. kudo, t. n., bulcão neto, r. f., and vincenzi, a. m. r. (2019c). requirement patterns: a tertiary study and a research agenda. iet software, pages 1–9. https://doi. org/10.1049/iet-sen.2019.0016. okubo, t., kaiya, h., and yoshioka, n. (2011). effective security impact analysis with patterns for software enhancement. in 2011 sixth international conference on availability, reliability and security, pages 527–534. palomares, c., quer, c., and franch, x. (2017). requirements reuse and requirement patterns: a state of the practice survey. empirical software engineering, 22(6):2719– 2762. palomares, c., quer, c., franch, x., renault, s., and guerlain, c. (2013). a catalogue of functional software requirement patterns for the domain of content management systems. in proceedings of the 28th annual acm symposium on applied computing, sac ’13, coimbra, portugal, march 18-22, 2013, pages 1260–1265. pantoquilho, m., raminhos, r., and araújo, j. (2003). analysis patterns specifications: filling the gaps. in viking plop, bergen, norway. petersen, k., vakkalanka, s., and kuzniarz, l. (2015). guidelines for conducting systematic mapping studies in software engineering: an update. information and software technology, 64:1–18. tockey, s. (2015). insanity, hiring, and the software industry. computer, 48(11):96–101. trakhtenbrot, m. (2017). mutation patterns for temporal requirements of reactive systems. in proceedings 10th ieee international conference on software testing, verification and validation workshops, icstw 2017, pages 116–121. wen, y., zhao, h., and liu, l. (2011). analysing security requirements patterns based on problems decomposition and composition. in 2011 1st international workshop on requirements patterns, repa’11, pages 11 – 20, trento, italy. withall, s. (2007). software requirement patterns. best practices. microsoft press, redmond, washington. wohlin, c. (2014). guidelines for snowballing in systematic literature studies and a replication in software engineering. in 18th international conference on evaluation and assessment in software engineering, ease ’14, london, england, united kingdom, may 13-14, 2014, pages 38:1– 38:10. yang, h., liu, k., and li, w. (2010). adaptive requirementdriven architecture for integrated healthcare systems. journal of computers, 5(2). https://doi.org/10.1049/iet-sen.2019.0016 https://doi.org/10.1049/iet-sen.2019.0016 introduction the systematic mapping protocol research questions and keywords automatic search selection of primary studies snowballing data extraction quality assessment research type type of software requirement a comparative analysis summary data synthesis about the research question 1 about the research question 2 about the research question 3 discussion threats to validity final remarks journal of software engineering research and development, 2023, 11:7, doi: 10.5753/jserd.2023.2657 this work is licensed under a creative commons attribution 4.0 international license. education, innovation and software production: the contributions of the reflective practice in a software studio aline andrade [ pontifícia universidade católica do paraná | alinesf.andrade@gmail.com ] alessandro maciel schmidt [ pontifícia universidade católica do paraná | alessandromacielschmidt@hotmail.com ] tania mara dors [ pontifícia universidade católica do paraná | taniadors@ppgia.pucpr.br ] regina albuquerque [ pontifícia universidade católica do paraná | regina.fabia@pucpr.br ] fabio binder [ pontifícia universidade católica do paraná | fabio.binder@pucpr.br ] dilmeire vosgerau [ pontifícia universidade católica do paraná | dilmeire.vosgerau@pucpr.br ] andreia malucelli [ pontifícia universidade católica do paraná | malu@ppgia.pucpr.br ] sheila reinehr [ pontifícia universidade católica do paraná | sheila.reinehr@pucpr.br abstract the growth of the mobile phone market has been generating a great demand for professionals qualified for applications (apps) development. the required profile includes technical skills, also known as hard skills, and behavioral or soft skills. the training of these professionals in speed, quantity, and quality demanded by the market poses a significant challenge for educational institutions. apple and pucpr have established a partnership to build a software studio to develop such talents using the challenge based learning (cbl) method and associated practices whose effects need to be studied. this research aims to analyze the contributions of reflective practice in a software studio to teach the main professional competencies regarding app development, including hard and soft skills. the research method was the case study, based on semi-structured interviews with 28 participants in three cycles. the collected data were analyzed with open and axial coding from grounded theory and atlas.ti tool. the results demonstrate that reflective practice, applied in a software studio environment that uses cbl was able to help students to map new ideas and acquire valuable hard and soft skills. the study pointed out that reflective practice is an effective instrument for developing the skills required by the app market, which demands innovation and quality at high speed. keywords: reflective practice, software studio, challenge based learning, software quality education, app development 1 introduction the demand for technological products has been growing in recent years, which requires better training of computing professionals, especially for developing applications for mobile devices. among the most requested competencies are technical knowledge (hard skills) and behavioral skills (soft skills), such as teamwork, collaboration, and communication. these abilities are crucial once information technology (it) professionals tend to be more introspective. although students perceive the development of soft skills as relevant, studies show that there is not the same degree of concern about acquiring these skills compared to more technical skills (lima and porto 2019). the apple developer academy, or simply academy, is a technological innovation project run in partnership between university environments and apple through a course that offers a complete education to students, allowing them to learn how to code, test, and publish applications based on their ideas. academy is a software studio (bull et al. 2013) that uses active and collaborative learning methods and tools to contribute to the student's practical learning and skills development. its staff consists of instructors composed of programmers and designers, available at the studio daily. for the two-year extension course, 50 graduate students or students to graduate within six months were selected. there are designers, developers, and devigners who are students able to work with both skills as designers and developers (dors et al. 2020). the academy uses the challenge based learning (cbl) method to support the mobile application development process, which is based on challenges proposed to students. this is a definition established by apple as a contractual item. one of the practices associated with cbl is the reflective practice. reflective practice is a feature of the software studio supported by formal and informal feedback from teachers to students, which can be mentoring or critiquing to improve the outcome (bull et al. 2013). in this learning environment, students are exposed to social interactions, group work, oral presentations, and discussions of their work with peers (kuhn et al. 2002). the course instructors use an approach that follows the guidelines established by the partnership and those related to the studio concepts. thus, students, through workshops, receive theoretical content. throughout the development of the challenges, instructors use coaching and mentoring reflectively with the students, according to the software studio concepts. the instructors encourage the students to reflect and find solutions independently, i.e., the instructors do not give the answers but the tools and conditions for the students to develop them. the presence and importance of reflective practice are recognized in the software engineering educational literature, according to dors et al. (2020) and bull and https://orcid.org/0009-0000-4239-4464 https://orcid.org/0009-0001-8853-3295 https://orcid.org/0000-0003-1167-7685 https://orcid.org/0000-0003-1564-1743 https://orcid.org/0000-0001-6682-3868 https://orcid.org/0000-0002-9508-0888 https://orcid.org/0000-0002-0929-1874 https://orcid.org/0000-0001-9430-7713 education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 whittle (2014). it is a form of reflective-based learning, as the name implies, that ranges from constant questioning, teamwork, peer review, and collaborative learning to group problem-solving. the concept was initially proposed by donald schön when observing architecture studios. he suggested thinking about professional practice, relating theory to practice, and generating terms such as reflectionin-action, reflection-on-action, and conversation with the material (schön 1983). reflective practice in architecture has proven effective in assisting soft skills development, improving performance, and helping students acquire an artistic talent essential for professional competencies (schön 1983 and hazzan 2002). the contributions of such an approach to computer science education are described by bull and whittle (2014) regarding the technical and attitudinal skills of the software engineer. the latter are improved decision-making skills, teamwork, communication, planning, and time management. with so many positive results aimed at teaching mobile application development from the software studio education approach, an interest arose in deepening the understanding of the results provided using the reflective practice, extending the study of dors et al. (2020) to the analysis of a more extensive set of students and projects. dors et al. (2020) analyzed data from the academy's 2017-2018 student class, conducting a face-to-face ethnographic observation study. the present study analyzed data from the 2019-2020 and 2021-2022 classes obtained through semistructured interviews and constituting a longitudinal study. new findings were identified, complementing dors et al. (2020), as can be seen later in this article. 2 background the different methodological approaches used by teachers throughout history have intrinsically aimed at the same scenario: to enable the learner to act autonomously in diverse professional situations. however, traditional methods that focus only on lectures are not enough to develop the competencies required by today's society. active methodologies, which place the student at the center of the learning process, have been widely used in universities worldwide and, in recent years, also in brazil. they seem to offer better results for the development of the required competencies. regardless of the methodology used, ferraz and belhot (2010) said that it is not enough to focus on the content to be covered to conclude the teaching-learning process efficiently. it is necessary to plan and structure activities to be developed, resources available, methodologies adopted, and evaluation tools used. 2.1 collaborative learning collaborative learning is one way to overcome the challenges faced by traditional teaching methodologies. according to barkley et al. (2014), a collaborative approach meets the following criteria: (i) the activity design must be intentional and carefully undertaken by the faculty member and not just limited to assigning some group activity; (ii) all group members must effectively engage in the activity and contribute equally to the outcome; and (iii) meaningful learning, related to the learning objectives of the discipline, must occur. briefly, collaborative learning is about "two or more students working together and sharing the workload equally as they progress toward the intended learning outcomes" (barkley et al., (2014)). the process of collaborative learning refers partly to the metacognition process, which means getting the students to reflect on their learning process. besides choosing the technique, implementing collaborative approaches implies properly defining how to organize the groups, encourage collaboration, and conduct the assessment. 2.2 hard and soft skills it is prevalent for undergraduate courses to focus on developing technical skills so that the future professional can work in his/her field. these are also known as hard skills. however, these are insufficient to build an excellent professional to cover current market demands. behavioral skills, also called soft skills, are equally or even more relevant in this journey. according to agante (2015), soft skills are non-technical competencies such as communication, empathy creation, trust with groups, and resilience in a work environment. competency is "a set of capabilities (knowledge, skills, attitudes, and values) mobilized for a delivery, which adds value to both the individual and the organization" (fernandes 2013). the survey developed in july 2021 by the american company careerbuilder with 2,138 managers and human resources professionals pointed out that 77% of interviewees believe that soft skills are essential for the job. carter, ferzli, and wiebe (2007) state that although communication skills are vital for an effective professional, these skills usually fall short of employers' expectations of recent technology graduates. several universities already recognize the need for computer science students to acquire these skills and incorporate teaching methods that favor their development. studies show the importance of communication in technology because students learn what it means to think like computer scientists and be professionals in the field (burge et al. 2012). 2.3 software studio one of the approaches to developing these competencies is the software studio, which comes from the historical tradition of the école des beaux-arts and the bauhaus and its atelier model (dors et al. 2020). according to tomayco (1991), the software studio emphasizes developing reflective skills and sensibilities and is the reflective practice the essence of the atelier concept. collaborative learning in the studio helps students to develop their skills through practice. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 furthermore, the dynamic interconnection of elements in a studio, such as people, software tools, development methodologies, processes, techniques, and products, provides a network in which software development knowledge and skills are created (prior et al. 2014)). reflection generally occurs in cycles of experience followed by consistent reflection to learn from that experience, during which the developer can explore comparisons, weigh alternatives and diverse perspectives, and generate inferences, especially in new and/or complex situations (dybå et al. 2014). according to shön (1983), reflection-in-action occurs during problem-solving, with doing and thinking as complementary ways. reflection-on-action is about thinking about a different approach to an already executed process. finally, conversation with the material refers to a conversation with the product that has been developed. reflection-in-action is the reflective form of knowing-inaction, reflecting during the problem-solving process. in the reflection-in-action process, doing and thinking are complementary. knowing-in-action is the knowledge built into and revealed by our performance of everyday action routines (schön 1983). sometimes it is labeled as intuition, instinct, or motor skills. in such cases, one continually controls and modifies his/her behavior in response to changing conditions (schön 1987). in such cases, we continually control and modify our behavior in response to changing conditions. “this capacity to do the right thing ... exhibiting the more that we know in what we do by the way in which we do it, is what we mean by knowing-in-action. and this capacity to respond to surprise through improvisation on the spot is what we mean by reflection-inaction". (schön 1987). carbone and sheard (2002) reported the reactions of firstyear students to being exposed to a new learning environment that consisted of differentiated physical space, a new teaching approach, it facilities, and a new assessment method. this space was a workshop whose approach was established in 2000 in the school of management and information systems, bachelor of management, and information systems (bims), at monash university (australia). the studio-based teaching and learning approach adopted was based on the bauhaus school of design model. the bauhaus school introduced a radical change from the traditional art education model, completely reshaping the teaching and learning spaces at that time. the atelier aims to allow students to develop strategies to cooperate and collaborate. the authors concluded that, in general, most first-year students enjoyed learning in the studio environment. an unexpected finding was the evidence of students developing metacognitive skills. danielewicz-betz and tatsuki (2014) analyzed reflective practice concerning the outcomes of a software workshop in undergraduate and graduate software courses. the analysis focused on the interaction between students and clients to determine how and to what degree students transformed through collaborative project-based learning. during the final self-reflection, students reported improving their project management, communication, presentation, writing, business, and software development skills. the reflective practice was analyzed and focused on collaborative learning and students' relationships with clients. prior et al. (2019) described a study based on open-ended interviews and ethnographic observations in studio sessions to understand how this experience impacted students’ employability. students observed that the studio experience helped enhance their technical and non-technical employability skills. in addition, from interviews with mentors and academics, the study corroborated the students’ views. they concluded that the relevant skills for employability include collaboration and communication, project management, mutual support to solve technical problems with help from industry mentors and academics, social aspects of the work, reflection skills, and technical skills were found to be essential employability skills. according to marques et al. (2018), adopting reflective practice (reflexive weekly monitoring rwm) is a way to improve learning for computer science students. the authors followed nine semesters of a project discipline and concluded that the approach effectively improved student coordination, effectiveness, sense of belonging, and satisfaction. 3 research method the research was conducted in a case study format based on data collection through semi-structured interviews (yin 2017). this method constitutes a research strategy that aims to understand the dynamics in a contemporary context over which the researcher has no control. it is appropriate to answer "how" and "why" questions. the study's main objective was to understand the contributions of reflective practice to technical and nontechnical skills development in a software study. we followed the steps defined by yin (2017): (i) definitions and planning; (ii) preparation, data collection, and analysis; (iii) cross-analysis and conclusions. the research planning involved case selection and preparation of the research protocol. the informed consent form (icf) and the non-disclosure agreement (nda) were prepared and signed by all researchers involved in the project. the project went through the analysis by the research ethical committee, receiving its approval (number 4.209.411) on august 12th, 2020. the underlying question for this study was: how is the reflective practice performed in software development studios? the following complementary questions arise from this general question: how is the reflective practice carried out in software studio environments? how can reflective practice contribute to the learning of computer science students? the present study was characterized by a prospective design in a qualitative research format. the first and second collection data rounds occurred between january and may 2021 and referred to the 2019-2020 class. the third collection cycle occurred from december 2021 until january 2022 and referred to the 2021-2022 class. the sample was determined through convenience time series. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 according to the inclusion and exclusion criteria, individuals considered eligible to participate in the study answered a semi-structured interview. the unit of analysis of this project is the apple developer academy (called academy), a software studio constituted in the scope of the partnership between apple and pucpr. candidates go through a selection process that identifies the most appropriate profiles. these students then undergo a two-year training period, exposing them to several challenges. included in the research were academy students over 18 who agreed to collaborate with the study. as selection criteria, we adopted that: the students should have participated in active learning using reflective practice and attended the class workshop immediately before. the project was divided into three cycles. in the first cycle, ten students participated in the interview. this stage was considered a pilot project and aimed to understand the benefits of reflective practice from the student's point of view. the first author initially developed the semi-structured script, which the other authors later revised. the script contained ten open questions to understand the academy student's perspective on reflective practice, ranging from usefulness, learning, and future applications outside the academic environment. at the end of this stage, a preliminary data analysis was performed to adjust for the second collection cycle. for the second cycle, the questions were adjusted, allowing a deeper exploration of items that emerged from the first collection. the second cycle comprised eight interviews that took place remotely and synchronously, each lasting 30 to 60 minutes, as had already happened in the first cycle. no adverse effects were perceived because the interviews took place online. the third and last cycle was conducted with twenty students from the 2021-2022 class. the interviews were undertaken in a synchronous remote way in two phases. the interviews for all cycles were recorded in audio format and later transcribed with the interviewee's permission. the information was mapped and analyzed using the support of atlas.ti tool. the analysis of the results used the open and axial coding of grounded theory (strauss and corbin 2007). open coding is a microanalysis of the transcribed interviews, performed line by line, identifying concepts and memos records (researcher's notes) about the meaning of the codes and categories. on the other hand, axial coding refers to grouping codes of the same properties in the form of networks. the results present the behavioral and technical skills acquired from applying reflective practice. the study participants had no direct benefit from the project. however, the research contributed to the planning and development of future actions aimed at improving the skills of technology students in the researched environment. it can be understood that an indirect benefit for the academy students was their reflection on reflective practice, performed through the researcher's inquiries. 4 results this section details the process of analyzing and triangulating the collected data, and it is organized into subsections according to the collection cycles. 4.1 first cycle ten students between 18 and 24 years old were interviewed in the first cycle. of these, five were male, and five were female. the interview follows the script shown in table 1. table 1. script for the semi-structured interview cycle 1. script for the semi-structured interview cycle 1 have you done activity reflections before joining the apple developer academy? how was your first reaction when you discovered that you would need to reflect on your challenges? why? considering programming, design, and business, what have you learned technically using reflective practice? how do you think reflective practice contributed to developing your behavioral skills? how do you think reflective practice contributed to developing your behavioral skills? how was the critique carried out? can you explain to me what it was like to receive and give a review of a challenge? after doing some reflection, did you avoid any mistakes in the development of the challenge, and consequently, did you see new attitudes for the following activities? if yes, explain how. how was your last reflection compared to the first one? in the future, do you intend to continue using reflective practice in new projects? can you explain how you intend to use it? finally, could you tell me the most significant benefits of reflective practice? interviewees answered about the process of reflective practice based on a semi-structured script with more comprehensive questions that would allow for more indepth results. most students stated that they had conducted reflections before participating in the academy. when asked about their reactions to finding out that reflection would be mandatory, most students used positive words such as cool and productive. only three participants demonstrated negative expressions, such as confusing, complicated, or boring. from the open coding analysis, it was obtained the axial network that reflects soft skills learning as presented in figure 1. this category comprises the following subcategories: leadership, communication, and teamwork. teamwork was very evident during this study. students could develop technical and behavioral skills during team interactions, such as collaboration among colleagues, managing time, and improving communication. examples of these statements can be seen later in this text. regarding teamwork, we found evidence of learning about conflict management. this emerged from statements concerning arguments and stress between colleagues in the same team. through reflection, these students were able to manage these disagreements better. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 figure 1. soft skills (behavioral competencies) cycle 1. figure 2 presents the results of the technical knowledge obtained through reflective practice. the most cited terms were technical learning, writing learning, and project management. it was also possible to analyze that many students could identify the depth of their knowledge about programming after doing the reflections. figure 2. hard skills (technical competences) cycle 1. the technical competence considered most relevant by the interviewees was project management. usually, students have no previous practical experience with project management except for graduation work. the studio provides them the practical experience throughout the course. they perceived this theme as planning skills, activities standardization, error analysis, problem identification, and time management. therefore, the results concerning personal development obtained from the reflective practice were: improvement in the student's technical performance, self-knowledge, and learning evolution in hard and soft skills, which led to a continuous improvement process for the students. students realized that through reflection, self-knowledge could be developed, which brings increased self-confidence. the evolution of learning, whether in the development of soft or hard skills, is also perceived as a cause of personal growth that makes one learn to deal better with feelings. during the interviews, many students stated that they reflected in search of improvements in their technical and behavioral results. with this, they started to analyze themselves more critically, finding mistakes and successes to deal with the following activities differently and obtain future learning and improved performance. self-knowledge was highly mentioned among the interviewees. through the practice of reflection, students could know themselves, understand their preferences, and what made them more confident in performing the academy activities. table 2 presents some interview quotations regarding the soft skills found in this first cycle. an “s” followed by a number replaced the student's identity (e.g., s1, s2). table 2. soft skills quotations – cycle 1. soft skills quotations self-knowledge "[...] self-knowledge for sure, more patience to understand my process, understand my time, decrease anxiety and stress […]" (s2) "[...] i believe the greatest benefit that reflective practice has brought to me is the selfknowledge [...]" (s6) planning "[...] it is assuredly very favorable to prepare yourself for the next challenge better [...]" (s8) self-confidence "[...] fell me more confident regards the skills that i possessed." (s1) time management "[...] learn how to manage time [...]" (s8) communication "[...] and another thing was a personal relationship. i believe you think about what interests you, what you talk about, the way you talk. i learned better this relationship with other people […]" (s8) "[...] learn how to communicate verbally better [...]" (s2) learning evolution "[...] i think a very cool thing was seeing the evolution over time [...]" (s1) "[...] not making the same mistake twice [...]" (s2) personal development improvement “[…] creating a concept, doing the more artistic part, doing the development part, doing the presentation later, so you interact with various aspects. and i think through reflective practice. i was able to understand better how my development process was […]” (s2) technical development improvement “[…] think, learn to communicate better verbally, […], not making the same mistake twice, improve what you do, like something that worked […]” (s5) critical thinking "[...] to make that change in behavior which is in line with the critical analysis [...]" (s3) the encodings were performed by one of the authors and reviewed by the others. these can be grouped in the category that donald schön (1983) defines as a reflectionon-action, the reflection after the action is performed. 4.2 second cycle in the second cycle, eight more students were interviewed. the basis for these interviews was the revised script shown in table 3. table 3. semi-structured interview script – cycle 2. script for the semi-structured interviews cycle 2 have you ever done any reflection before joining the apple developer academy? what was your first reaction when discovering that you had to reflect on your challenges? why was that? education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 in the day-to-day life of the academy, did you notice yourself reflecting during the execution of your activities? how was it? what were the moments when you needed to record the reflections? what was the difference between reflecting on the short-term and the long-term challenge? what are the differences between doing reflections with guidelines for their execution and those with a free format? what are the benefits of both types of reflections? what are the differences between learning when working as a team and individually? how were the reflections during the planning of the challenges? what about development? and in the delivery of the final product? there were only sessions with the instructor to reflect on the progress during the development of challenges. how were these moments? what was it like to receive feedback from the instructor? what skills have you developed to perform presentations in public sessions? what kind of learning or skill was developed or required to perform the division of tasks in teams? what kind of challenge reflections, learnings, or experiences was helpful in another challenge? what benefits have you brought to your professional life using reflective practice? in addition to encouraging individual reflection, the academy has specific moments for students to reflect critically on other students'/teams' work. what was it like to make and receive criticism about the cbl and design in the review sessions? how have the reviews contributed to your personal development or development of the challenge/final product? were you able to avoid any mistakes in any challenge after reflecting and consequently seeing new attitudes to the following activities? if so, explain how. how was your last reflection compared to the first one? do you continue to use reflective practice in your projects? how did that become part of your day-to-day life? finally, could you tell me the most significant benefits the reflective practice has brought you? with the changes implemented, nine additional results were found to be added to the networks obtained in cycle 1. the analysis allowed us to divide them into three technical skills, three behavioral skills, and three classified as the personal benefits of reflective practice. the ability to speak english was the only property of the technical skills that differed from cycle 1. the other changes were related to project management. the following were cited as relevant points: clarity in the execution of project activities and division of tasks, as shown in figure 3. as can be seen, the network was refined due to the best understanding of the scenario. figure 3. hard skills (technical competences) cycle 2. figure 4 presents the soft or behavioral skills mapped in cycle 2. it is possible to observe an outstanding factor that refers to knowing how to listen to colleagues once listening to different opinions is fundamental for good communication. all interviewees mentioned this skill. in addition, this ability is in great demand in the professional market and is essential in personal development through feedback. the students identified that improving the organization of presentations results in behavioral skills development, especially those that facilitate communication. this was a skill developed because they had to make several presentations of their projects throughout the course. figure 4. soft skills (behavioral competencies) cycle 2. the reflective practice significantly impacted the students' personal evolution and behavior improvement. the active methodology also helped the students to evolve their learning through reading their classmates' reflections and personal development. as can be seen, it was possible to evolve the networks to contemplate the findings made through the analysis of the cycle 2 interviews to find the hard and soft skills developed from reflective practice. table 4 presents the quotations extracted from the interviewees' speeches and represents the most frequently mentioned skills in the interviews in cycle 2. knowing how to listen to one's colleagues was highly cited among the interviewees in this second cycle. through reflection, students could better understand their colleagues and the importance of listening to them. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 table 4. hard skills quotations cycle 2. hard skills quotations active listening “[...] it is a lot about listening to other people you know, and understanding their thinking in a good way..." (s13) “[...] you learn to listen to what people want..." (s15) “[...] what i learned most from it was to listen to others before doing" (s17) presentation organization "i learned to organize for the presentations right, and go-getting the hang of getting better [...]" (s17) behavior improvement “i learned personally because we are dealing with people. so, i learned how to express myself or the form of my posture […] with people [...]" (s11) technical performance improvement “i was able to improve my results, so, so for me, it's one of the main things, to improve results effectively, quickly […]" (s14) learning by reading colleagues' reflections “[...] for me, learning from the mistake of others is also valid. i read the reflections of people who seemed interested in learning from them, so sometimes someone wrote and seemed dissatisfied, or someone who seemed very satisfied. i liked to read these reflections [...]" (s12) ideas sharing “[...] i knew then to present my thoughts [...]" (s12) “[...] i was able to expose better what i was thinking." (s15) “[...] i think the skill that i learned the most for the presentation, like this, was losing the shame, you know. lose the fear of presenting or speaking my ideas.” (s13) english speaking “[...] i developed a little bit in how to present in english, keep learning, and lose the fear right." (s18) another interesting issue that emerged when analyzing the interviews was the observation that it is also relevant to learn through the mistakes and successes of colleagues, which was possible by reading the reflections made by the colleagues. the ability to organize also appears in the reflections about improving oral presentations. the students learned self-organization to show their work to colleagues and realized this throughout the reflection process. it is interesting to note that in the students' perception, by reflecting and analyzing the situations in which they were involved with their colleagues, they report having learned to deal with people and adopt a more appropriate posture towards the team. 4.3 third cycle the third cycle refers to 10 students participating in the 20212022 academy class. this cycle aimed to identify the contributions from each reflective practice concept, such as reflection-in-action, reflection-on-action, conversation with the material, and knowing-in-action. an expansion was made to a more significant number of quotations and subsequent refinement of the second cycle networks to meet the interrelationship of codes with the third cycle, so new codes emerged from the interpretation. in this collection cycle, ten students answered questions from a semi-structured script about their studio showcase experience from the perspective of reflective practice contributions (see table 5). the showcase is a studio session where students present their projects to other studio students. the best project receives an award. they were also interviewed in this cycle to investigate the conversation with the material. table 5. semi-structured interview script cycle 3. script for the semi-structured interviews cycle 3 please comment on your experience with reflections on challenges. have you ever been introduced to reflection as a teaching methodology throughout your education? what aspects would your formal education have been different if you had used reflective practice? how do you believe that reflections with colleagues help create more interesting projects? has this ever happened to you? what techniques would you use in case of imminent team conflict? how did the reflections affect the relationship between the team members? what technical skills did you develop? what interpersonal skills do you think you developed throughout studio activities? how will the materials produced, such as projects, codes, slides, presentations, and so on, influence the development of future materials? what is your process of revisiting the materials produced at the academy like? have you changed the materials produced due to reflections between challenges? how have the team reflections helped to develop your creativity? what are the main lessons or learnings from the studio? table 6 shows some of the quotations extracted from the interviews in cycle 3. table 6. quotations cycle 3. code quotation decision making "[…] i believe would have... take more assertive decisions […] i had wasted less time because i would have had to stop to really focus on my ability to think about what's going on [...]" (s19) "[…] i think it brought me these insights into what i should do from now on, […] (s20) conflict management "[…] but basically, it was through conversations that we solved these problems [...]the third person who was the one who brought the conflict, admitted that they could have brought the situation in another way" (s24) “then through the reflections, i realized that in most cases, this was not a good alternative, and i started opting for these conversations 100% of the times [...]" (s20) reflection on mistakes and successes "[…] i think they kind of force you to look at everything you've done, look at all you did, and analyze what you did right or wrong. so, analyzing these practices, you can think on what keep doing, or what behaviors should i stop doing […]" (s22) active listening "[…] learning to give and receive feedback, to stop and hear feedback, was something i already was working on before, but the academy gave me interesting ways to develop this [...]" (s19) collaboration "[...] i think collaboration is a major one, work as a team, [...]" (s21) education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 communication "[...] i learned a lot about how to communicate myself, [...]" (s22) self-confidence "[...] also works for me to have confidence and believe in my own potential, [...]" (s21) motivation "[…] it influenced me a lot to keep myself motivated […]" (s19) upcoming artifacts production "[...] i think it influences a lot; i think that everything is a reference [...]" (s25) professional impacts “this helps a lot in the development of the personal portfolio.” (s22) “[...] when you need to reference these projects in some professional opportunity.” (s26) academic impacts “[...] like a portfolio, this is where you expose your projects, but more than that, you delve into a retrospective of how the project was carried out (why and how was the project developing). and this helps you not only document all the process[...]". (s29) “[...] something for the future, something i would apply to a future project or team.” (s24) “the biggest influence is on future projects, where i can use what was written and learned during the projects i participated in.” (s20) self-knowledge “[...] to see better this way, how we are doing in a specific subject, to dedicate ourselves more or better understand what we like, and what we do not want to work on.” (s22) synthesis skill "[...] the skills i developed in the construction of this activity is self-knowledge and synthesis. [...]" (s29) priority management “i started to apply this model of reflection […] not only in my work but also in my study things. then i realized that i had a much clearer vision of what i had to do or the alternatives i would prioritize in the steps forward.” (s20) from the analysis of the interviews, it was possible to observe that reflection-in-action promotes the behavioral improvement and the student’s continuous development. throughout the challenge development, students had to manage situations of divergence of ideas, conflict of leadership, and the teams’ expectations, improving their conflict management skills. students said this usually occurs at the beginning of the challenge while doing the project design. conflict resolution comes through conversations and sometimes using the voting strategy. the students stated that, based on reflection-in-action, they realized they could make more assertive decisions regarding the project and their behaviors. it also develops leadership, which contributes to engagement and teamwork. students pointed out reflection-on-action as a powerful tool to better understand their motivations, interests, and capabilities, contributing to their self-knowledge development. students wrote a self-reflection at the end of each challenge to stimulate their reflection, promoting reflectionon-action. reflection-on-action promotes self-knowledge, as well as self-reflection and creativity. these selfreflections promote work process comprehension, professional development monitoring (which stands for the ability to keep track of what is learning), reflections on mistakes and successes (which helped the students not to repeat the same mistakes and emulate behaviors that had positive results in the past) and, creativity. the reflective practice supported by studio sessions incentives having multiple views on a given theme, which promotes creativity. interaction with creative colleagues contributes to developing creativity. sharing ideas throughout the studio sessions stimulates teams to be open to different ideas. in addition, creativity is responsible for the creation of innovative projects. figure 5 shows these reflection-on-action findings. figure 5. reflection-on-action cycle 3. the third collection cycle’s purpose was to explore the conversations with the material and knowing-in-action, other reflective practice concepts not covered in cycle 2. from knowing-in-action, we could realize hard and soft skills development, as shown in figure 6 and figure 7. concerning hard skills, as shown in figure 6, knowingin-action promotes the development of project management, project development, and presentation organization skills. regarding the development of the project was noticed the design skills development. some students had never made a mobile app design before the academy course. in addition, they started to learn about code reuse-oriented development, software architecture, database development, app prototyping for different platforms such as miro, figma, or adobe package, algorithmic logic and programming, object-oriented programming, and programming in swift/ swift ui. figure 6. knowing-in-action – hard skills cycle 3. figure 7 shows the knowing-in-action findings regarding soft skills. they actively listen to colleagues’ ideas, communicate, collaborate, and self-criticism. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 figure 7. knowing-in-action – soft skills cycle 3. in this cycle, students reported soft skills that had not appeared in the previous cycle: responsibility and autonomy, storytelling, and communicating feedback. there was an evolution in their commitment to the project regarding deadlines and the execution of assigned tasks, improving responsibility and autonomy skills. another soft skill developed by the students was storytelling. students developed the art of telling stories while preparing app design and project presentations. they had to create an appealing backstory for the app and even engage colleagues in their presentations. the academy’s activities are collaborative by nature, and the other new skills the students present relate to this specific characteristic. through the development of collaborative tasks, students showed considerable improvement in communicating feedback to their colleagues, solving issues, and maintaining a respectful work environment. as shown in figure 8, the showcase stimulates the conversation exercise with the materials to studio students. as a result, the showcase experience positively influenced upcoming artifacts production and promoted priority management, synthesis skill, motivation, self-confidence, learning by shared experience, and self-knowledge, which positively influenced professional and academic contexts. figure 8. conversation with the material cycle 3. since the students were supposed to create brief presentations about complex projects, they had to develop priority management and understanding of the presentations' purpose and structure. not only prioritizing the most relevant items but using synthesis skills to tell a story as effectively as possible. enrolling in such activities helped students to boost their motivation and self-confidence. students sharing experiences in the showcase help them to learn from colleagues’ experiences, which positively influences professional and academic contexts. written self-reflection supports reflective practice in the studio and can be carried out in two ways: free format or with guiding questions. the student perceives these skills development throughout the exercise of his metacognition. metacognition is ”being aware of and able to monitor the development of one’s own learning and the application of that learning to their practice.” (parson; stephenson, 2005). the difficulties in applying the reflective practice found in the analysis are related to the student's lack of experience and physical fatigue. the latter is because students tend to work hard to complete the project on time. consequently, they get tired after the delivery, and writing the reflections becomes difficult. 5 discussion this study aimed to identify the contributions of reflective practice in a software studio, analyzing the benefits of mobile applications development and the acquisition of professional skills demanded by the market. the cbl active methodology is reflection-based learning that uses the students' relationship with their experiences. it was identified that students reflect to find new results and knowledge, so the practice aims to improve the student's abilities for the following activities. the results showed that reflective practice positively affects software development and the acquisition of professional skills. the study highlighted that collaborative learning helps students develop their own skills through practice and that groups interested in other teams' work acquire new knowledge and skills. in addition, among the main contributions of reflective practice are skills development, like teamwork, collaboration, communication, time management, planning, problem identification, decision-making, and selfknowledge. therefore, these contributions are crucial for a computer science practitioner to succeed. 5.1 related work compared to other literature studies, it was possible to notice that conflict management and time management are essential skills in executing a project. these points were also cited in the work of dors et al. (2020) and were confirmed in the present study results. the findings of this research confirm and extend the results obtained by dors et al. (2020). the authors' main results were that reflective practice promotes the emergence of new ideas and contributes to the practice and development of skills, such as collaboration, oral or written communication, commitment, interpersonal relationships, adaptability, flexibility, and teamwork. it also develops problem-solving, decision-making, planning, project management, time management, scope management, outsourcing development management, and new technical skills. in addition, the reflective practice emphasizes hands education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 on learning, supports the development of technical skills, and appears to be an authentic environment of the relationship between academic disciplines and real-world experiences, where students can practice and learn by doing, preparing students for the real world. the interviewees in this study highlighted the importance of behavioral competencies for their development. in this study, students emphasized the relevance of selfknowledge, knowing how to listen, awakening behavioral improvement, confidence, and communication as benefits of reflective practice. the second cycle highlighted communication as a fundamental part of personal development and teamwork (carter ferzli and wiebe (2007)). on the other hand, in the present study regarding the question, it was observed that students recognize that public speaking and listening skills are equally relevant to personal and professional life. the analysis of the results of this study showed that students perceived that decision-making is improved with each new reflection made since pondering one's virtues and weaknesses are critical to improving the timing and quality of personal deliberations. this is consistent with what was obtained by dors et al. (2020). this research made it possible to observe the students' acceptance and motivation to use reflective practice. these findings differ from the study by prior et al. (2016), who presented the results of action research conducted at the sydney university of technology with three software development studios. the main challenges in terms of motivation that the authors encountered were i) time pressures that made it difficult to record the journals; ii) difficulties in making the journal entries. some students only did the reflections when reminded, and others refused. they also identified the following patterns of students: i) refusers (do not write reflections), ii) recounters (difficulties in doing reflection), and iii) instinctive reflectors (able to reflect naturally). one intervention that proved to be effective was the 10-minute reflection sessions, which consisted of having students express how their written reflections were, any problems, and a stimulus question, which was answered during the session. in the case of the study reported in this article, this difficulty was not observed because the students felt motivated by the reflective practice. in this research, it was possible to observe improvement in technical performance. that is reflective practice positively impacted student performance. this finding is consistent with the research conducted by nylén et al. (2017), in which the authors studied how students approach the reflective practice task. two categories were identified in students' recording of critical incidents: progress and expansion. progress refers to progress reporting, divided into "what i am doing," status reports, and daily type categories. these subcategories grow in sophistication, respectively. expansion indicates students' reports on learning items and reflections on those items. this category is divided into keywords (how-to, knowledge about generic, personal, and theoretical language). the authors concluded that journal recording induces reflection on learning and positively affects students' awareness of their professional knowledge. on the other hand, students found it challenging to identify learning. on this last topic, this did not occur with the present study, as students could clearly identify their growth through the application of reflective practice. 5.2 threats to validity according to yin (2017), research developed using the case study method can be evaluated under four criteria: (i) construct validity; (ii) internal validity; (iii) external validity; and (iv) reliability. a threat to the construct validity of this study was the use of narratives to identify the acquisition of professional competencies through reflective practice. it was recognized that the narrative approach is compatible with the need to assess the complexity of organizations. however, it was necessary to rely on the interviewees' memories to understand this practice's benefits in both professional competencies and application and software development in general. another limitation of the narrative approach is that people often rationalize the facts while telling the story. the construction of meaning from the facts causes individuals to interpret past events, try to find explanations for what happened, and perhaps confuse what occurred. to lessen this threat, students were asked to recount specific moments to confirm their interpretation of the facts concerning their reflection during the interviews. similarly, the validation performed with academy students also helped validate the perception of their interpretations of the sequence of occurrences, understanding the benefits and uses of reflective practice. regarding external validity, since this study is about a specific environment in specific circumstances, the ability to generalize is limited. it is possible that similar results could be obtained when studying teaching and innovation environments that operate under close conduction, for example, using the software studio concept and reflective practice. 5.3 implications to software engineering education, industry, and academia from the point of view of software engineering education, we were able to observe that the cbl method, especially reflexive practice, is an effective means of developing technical and non-technical skills. providing students with challenges that provoke them to go further, search for answers, create, and reflect can lead to valuable knowledge. this can inspire educators worldwide to rethink their educational practices in the classroom. for industry, our study reveals that cbl and reflective practice lead to the development of highly demanded soft skills such as communication, conflict resolution, autonomy, and responsibility, among others. these methods can be used to develop such skills in the academic environment or professional education. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 from the academic perspective, we could present a model that describes our findings expressed in the form of a relationships network shown in the previous sections. 6 conclusion technology changes continuously, so computer professionals must increasingly deal with new methods, tools, platforms, user expectations, and software markets. thus, more advanced education is needed to prepare these professionals for the coming decades and new demands. in this respect, reflective practice has proven to be an effective method to help develop hard and soft skills, improving performance and assisting students to acquire talents essential for professional competencies. the analyses performed in this study refer to the students who participated in the software studio from 2019-2020 and 2021-2022. part of the activities of these students occurred in a face-to-face manner, and part happened remotely. this may have caused some effects unknown to the researchers. a further study with the cohort 2021-2022 is already being initiated. it may reveal whether, by entering in a fully remote manner, the results obtained by the students will be different from those obtained in the present study. acknowledgments we thank the apple developer academy for allowing us to conduct the study, the students who participated in the interviews for making their time available, and cnpq for providing the grant for this study. references agante, l. (2015). a importância das soft skills na vida profissional. dinheiro vivo. disponível em: https://www.dinheirovivo.pt/gestao-rh/a-importancia-dassoft-skillsna-vida-profissional-12665712.html. access in 04th july 2021. barkley, e. f.; major, a. h.; cross, k. p. (2014). collaborative learning techniques – a handbook for college faculty. 2nd ed. san francisco: jossey-bass – a willey brand, page 417. burge, j. e.; gannod, g. c.; anderson, p. v.; rosine, k.; vouk, m. a.; carter, m. (2012). characterizing communication instruction in computer science and engineering programs: methods and applications. in: frontiers in education conference proceedings, pp. 1-6, doi: 10.1109/fie.2012.6462496. bull, c.; whittle, j. (2014). supporting reflective practice in software engineering education through a studio-based approach. ieee software, v.31, n.4, pages 44-50. bull, c. n.; whittle, j.; cruickshank, l. (2013). studios in software engineering education: towards an evaluable model, in international conference on software engineering (icse 13), pages. 1063–1072. carbone, a.; sheard, j. (2002). a studio-based teaching and learning model in it. in: proceedings of the 7th annual conference on innovation and technology in computer science education (iticse'02), v. 34, n. 4, pp. 213-217. carter, m., ferzli, m., and wiebe, e. n. (2007). writing to learn by learning to write in the disciplines. journal of business and technical communication, v.21, n3, pages 278-302. danielewicz-betz and tatsuki, (2014). danielewicz-betz, anna; kawaguchi, tatsuki. gaining hands-on experience via collaborative learning: interactive computer science courses. in: 2014 international conference on interactive collaborative learning (icl), ieee, december, pp. 403409. dors, t.m.; van amstel, fmc; binder, f.; reinehr, s.; malucelli, a. reflective practice in software development studios: findings from an ethnographic study. in 2020 ieee 32nd conference on software engineering education and training (csee&t). dybå, t.; maiden, n.; glass, r. (2014). the reflective software engineer: reflective practice, ieee software, v.31, n4, pages 32-36. fernandes, b. h. r. (2013). gestão estratégica de pessoas com foco em competência. rio de janeiro: elsevier. ferraz, a. p.; belhot, r. v. (2010). taxonomia de bloom: revisão teórica e apresentação das adequações do instrumento para definição de objetivos instrucionais. gestão da produção, v.17, n.2, pages 421-431. hazzan, o. (2002). the reflective practitioner perspective in software engineering education. journal of systems and software, v. 63, n. 3, pages 161-171. lima, t.; porto, j. b. (2019). análise de soft skills na visão de profissionais da engenharia de software. in: workshop sobre aspectos sociais, humanos e econômicos de software (washes), 4., (2019). belém. anais [...]. porto alegre: sociedade brasileira de computação, pages 31-40. doi: https://doi.org/10.5753/washes.2019.6407. marques, m.; ochoa, s. f.; bastarrica, m. c.; gutierrez, f. (2018). enhancing the student learning experience in software engineering project courses. ieee transactions on education, v. 61, n. 1, pages 63-73. nylén, a.; isomöttönen, v. exploring the critical incident technique to encourage reflection during project-based learning. in proceedings of koli calling 2017, koli, finland, november 16–19, 2017, 10 pages. parsons, m.; stephenson, m. (2005) developing reflective practice in student teachers: collaboration and critical partnerships, teachers and teaching: theory and practice, 11:1, 95-116. prior, j.; connor, a.; leaney, j. (2014). things coming together: learning experiences in a software studio. in: proceedings of the 2014 conference on innovation & technology in computer science education, pp. 129-134. prior, j.; suman, l.; leaney, j. (2019). what is the effect of a software studio experience on a student´s employability? in: proceedings of the 21st australasian computing education conference (ace'19), acm. sydney, nsw, australia, pp. 28-36. prior, j.; ferguson, s.; leaney, j. reflection is hard: teaching and learning reflective practice in a software https://doi.org/10.5753/washes.2019.6407 education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 studio. acsw '16: proceedings of the australasian computer science week multiconference february 2016. doiorg.ez433.periodicos.capes.gov.br/10.1145/2843043.2843 346. schön, d. a. (1983). the reflective practitioner: how professionals think in action. new york: basic books inc., 374 p. schön, d. a. (1987). teaching artistry through reflection-inaction, in educating the reflective practitioner: toward a new design for teaching and learning in the professions.1st ed., san francisco, ca, us: jossey-bass. strauss, a.; corbin, j. (2007). basics of qualitative research: techniques and procedures for developing grounded theory. 3rd ed. london: sage publications. tomayco, j.e. (1991). teaching software development in a studio environment. in: proceedings of the twenty-second sigcse technical symposium on computer science education (sigcse '91), v. 23, n. 1, pp. 300-302. kuhn, s.; hazzan, o.; tomayko, j. e.; corson, b. (2002). the software studio in software engineering education. in: 15th conference on software engineering education and training (csee&t 2002), proceedings, kentucky, usa, pages 236-238. yin, r. (2017). case study research: design and methods (applied social research methods), 6th ed. los angeles: sage publications. education, innovation and software production: the contributions of the reflective practice in a software studio 1 introduction 2 background 2.1 collaborative learning 2.2 hard and soft skills 2.3 software studio 3 research method 4 results 4.1 first cycle 4.2 second cycle 4.3 third cycle 5 discussion 5.1 related work 5.2 threats to validity 5.3 implications to software engineering education, industry, and academia 6 conclusion acknowledgments references microsoft word 457-##_source texts-635-1-18-20200120.docx journal of software engineering research and development, 2020, 8:1, doi: 10.5753/jserd.2019.457 this work is licensed under a creative commons attribution 4.0 international license. supporting a hybrid composition of microservices. the eucaliptool platform pedro valderas [ pros research center – universitat politècnica de valència, spain | pvalderas@pros.upv.es ] victoria torres [ pros research center – universitat politècnica de valència, spain | vtorres@pros.upv.es ] vicente pelechano [ pros research center – universitat politècnica de valència, spain | pele@pros.upv.es ] abstract to provide complex and elaborated functionalities, microservices may cooperate with each other either by following a centralized (orchestration) or decentralized (choreography) approach. it seems that the decentralized nature of microservices makes the choreography approach more appropriate to achieve such cooperation, where lighter solutions based on events and message queues are used. however, orchestration through the usage of a process model facilitates the analysis of the composition when this is modified. to benefit from the goodness of these two approaches, this paper presents a hybrid solution based on the choreography of business process pieces that are obtained from a previously defined description of the complete microservice composition. to support this solution, the eucaliptool platform is presented. keywords: microservice, composition, choreography, orchestration 1 introduction companies such as amazon, airbnb, twitter, netflix, apple, uber, and many others have shifted towards a microservices architecture intending to be more agile in doing their business. the technology and functionality independence acquired when applying this architecture allows companies to replace, scale, and upgrade their applications easily and very fast (newman, 2015; bucchiarone et al., 2018; shadija et al., 2017). however, to provide their customers with valuable services, developer teams are forced to build microservice compositions due to the small granularity level in which these operate (dragoni et al, 2017). the definition of such compositions is being made by many organizations programmatically ad-hoc. the major problem when creating compositions in this way is that their complexity grows, making more difficult their visualization, understanding, and maintenance. this complexity has forced many companies to build their solution to compose microservices. among these solutions, we find zeebe (the evolution of the camunda project to orchestrate microservices), netflix conductor, ing baker or uber cadence. apart from zeebe, the other solutions have been developed by non-software companies to deal with the growing number of microservices handled by each company to develop their business. in general, to achieve microservices compositions we can find two major different approaches, these are choreography and orchestration. as a motivating example, let us consider a process designed to place orders in a webshop, which is supported by four microservices: customers, payment, inventory, and shipment. the sequence of steps to process an order is the following: 1. a customer places an order in the webshop. 2. the customers microservice checks customer data and logs the request. 3. if the customer is accepted, the payment microservice starts to collect the money. if it is required, payment details can be asked to the customer. in any case, the customer must be informed. 4. as soon as the payment is performed, the inventory microservice starts to fetch the ordered items. if some problem occurs, the customer is informed and the order is canceled. 5. finally, once the items are fetched correctly, the shipping microservice creates an order of shipment and assigns a driver. when following the choreography approach (dragoni et al., 2017; butzin et al., 2016), the logic of the composition is distributed through microservices, which communicate to each other through an event bus (usually supported by a message queue). thus, once the client places an order in the webshop (see fig. 1), an "order created" event is issued in the queue. the customers microservice, which is listening to this event, reacts to performing its assigned tasks, and a "customer accepted" event is triggered when the customer data is ok. then, the payment microservice, which is listening to this event, performs its tasks and generates the event that makes the next microservice in the composition perform the next tasks. and so on. let us now suppose that our company wants to provide special treatment to its vip customers, so they can proceed with the payment by the end of the process. to maintain these low-coupled microservices, this small change would imply the introduction of several changes in different microservices: the customers microservice should generate a different event depending on the type of customer to allow the participation of either the payment microservice (regular customers) or the inventory microservice (vip customers); in the same way, the shipment microservice should generate a different event to proceed with the payment or on the contrary with the delivery of the order; and the payment microservice should be also modified to allow delivering the order in case of vip customers. note how a single change requires the modification of several microservices. the major supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 problem with this approach is that there is not a clear picture of how microservices participate in the process since the composition is hard-coded and distributed along with multiple microservices. therefore, when engineering decisions need to be taken, it is difficult to analyze the composition's flow. figure 1. microservice collaboration through choreography. on the other hand, when building compositions with the orchestration approach (singhal et al., 2019; hamidehkhan, 2019), the logic of the microservice composition is centralized in an orchestrator microservice. one of the possible solutions for this approach is to define compositions as bpmn models and endow the orchestrator microservice with a bpmn engine that is in charge of executing it. the bpmn representation of the motivating example presented above is shown in fig. 2. figure 2. bpmn representation of the motivating example. in this case (see fig. 3), a client asks the orchestrator microservice to process an order, and this microservice executes the bpmn model that describes the microservice composition that manages customer orders. according to the logic of this composition, the first step the orchestrator does is asking the customers microservice to check the customer data, and then waits for a response from this microservice. once the response from the customer microservice is received, the orchestration microservice asks the payment microservice to collect the money and waits for a response. and so on. with this approach, the logic of the microservice composition is centralized in the orchestrator microservice. if we want to change the composition to support vip customers, we just need to update the bpmn model accordingly. however, all microservices depends on the orchestrator, reducing the degree of decoupling among them. also, there are some misconceptions within the microservice community that can make the adoption of this solution difficult: (1) many times, the task of process modeling is considered as an overhead for a software project; and (2) bpm tools are considered to be heavyweight and to take weeks to set up. figure 3. orchestration to support microservice collaboration. in this paper, we face the challenge of defining a hybrid solution to compose microservices that combine the benefits of both approaches. this solution is based on the following: 1. developers describe the complete microservice composition by means of a centralized model. this allows having the big picture of the composition, which facilitates the following maintenance and analysis tasks. 2. the centralized model of the composition is split into different pieces whose execution responsibility is delegated to the different participating microservices. each microservice is in charge of executing its piece and informing the other microservices about its execution. to do so, an event-based orchestration is proposed, which provides a degree of decoupling among microservices higher than the one provided by orchestration solutions. to support this solution, we present the eucaliptool platform, which includes the following: 1. an authoring tool to define microservices compositions through a domain specific modeling language (dsml) that facilitates the modeling activity. this tool has been developed to alleviate the misconceptions of using a process model for composing microservices. developers can design the whole composition using constructors that are easier to use than business modeling elements. this tool also supports the transformation of descriptions based on our dsml into executable bpmn specifications, and the split of it into pieces. 2. a microservice architecture that facilitates both, the deployment of each bpmn piece into the corresponding microservice, and the distributed execution of the microservice compositions through an event-based choreography. it also supports the maintenance and evolution of the microservice composition. the remainder of the paper is structured as follows. section 2 outlines the hybrid solution proposed in this work to supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 achieve microservice compositions. section 3 presents the architecture designed to support this solution. section 4 presents the authoring tool proposed to model microservices compositions. section 5 explains how a microservice composition is transformed into bpmn and split into pieces to be deployed in the proposed microservice architecture. section 6 analyzes how the evolution of microservice compositions are supported. section 7 introduces the related work. finally, section 8 concludes the paper and provides insights into directions for future work. 2 a hybrid approach to compose microservices in this section, we present a hybrid approach to achieve microservice compositions. the stages proposed in this approach are the following: 1. developers define a centralized description of the complete microservice composition. 2. the centralized description is split into bpmn pieces and these pieces are distributed among microservices. 3. the microservice composition is executed through an event-based choreography of bpmn pieces. to illustrate the proposed approach, we make use of the motivating example. first, developers start defining a microservice composition in a centralized model. in the case of the motivation example, developers should create a composition as the one shown in fig. 2. note that this microservice composition is defined with bpmn. however, we propose a dsml to facilitate this modeling activity, which is presented in section 4. once developers have described the complete microservice composition, the bpmn model is split into pieces whose execution responsibility is delegated to the different participating microservices. as fig. 4 shows, the bpmn model of the motivating example is split into four pieces that must be executed by the different microservices. figure 4. microservice orchestration split into different fragments. an event-based choreography of bpmn pieces is proposed to support the execution of a microservice composition. in this sense, each microservice is in charge of executing its piece and informing the others about it. following with the motivating example, once the client places an order in the webshop (see fig. 5), an "order process" event is issued in the message broker. the customers microservice, which is listening to this event, reacts executing their associated bpmn piece, and the "piece1_completed" event is triggered whether the customer data is ok. then, the payment microservice, which is listening to this event, performs its bpmn piece and generates the event that makes the next microservice in the composition to execute the next piece. and so on. note that current business process management (bpm) tools provide little support to create a business process model and split it into pieces that can be deployed into different microservices. there is also little help to implement the communication mechanisms that are required to coordinate the execution of the different pieces to complete a process. in addition, note that we propose to have two versions of the composition. on the one hand, we have the model of the whole microservice composition. on the other hand, we have a split version that is distributed along with the microservices. thus, when the microservice composition needs to be evolved due to changes in requirements, both versions must be updated, which implies additional efforts for developers. therefore, if we want that developers adopt our proposal we need to provide them with tools that facilitate the tasks of modeling and provide a high degree of automation to deploy composition pieces and configure the execution environment. to achieve this, we present the eucaliptool platform. the next section introduces the supporting microservice architecture. figure 5. event-based orchestration of bpmn pieces. 3 supporting microservice architecture in a microservice architecture, applications are structured as a collection of loosely coupled services, which implement the business capabilities of a system. apart from those business microservices, it is usual to find in this type of architecture other microservices that are focused on supporting infrastructure issues. examples of this type of microservices are the service registry that gives support to service discovery, containing the network locations of microservice instances; an supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 api gateway that provides addressability capabilities; an authentication server that is in charge of controlling the access to the microservices; and a configuration server that manages microservice configuration on the cloud. in addition, it is also common the use of tools to monitor microservices’ status and log their executions, as well as to deploy a message queue to manage asynchronous communication among microservices. finally, microservices are usually complemented with a client-side load balancer and some library that implements the circuit breaker pattern to support fault tolerance. microservices architectures have already been used to build business process modeling and analysis tools (alpers et al., 2015). in this work, we extend the typical microservice architecture with three main elements (see red-colored blocks in fig. 6): 1. eucaliptool composer. it is a microservice endowed with an authoring tool to facilitate the creation of microservices compositions. this microservice also is in charge of transforming the compositions created through the authoring tool into a bpmn executable specification, splitting it into bpmn pieces, and sending them to the eucaliptool server. in addition, this microservice stores the whole description of the microservice composition created with the authoring tool. 2. eucaliptool server. it is a microservice that plays the role of gateway among business microservices and the eucaliptool composer. it is responsible for the following tasks: a. receiving the split bpmn processes sent by the eucaliptool composer, registering them into a process repository, and distributing the pieces among the different microservices. b. launching the execution of each process by triggering the first bpmn piece and delegating the responsibility of continuing the process to the corresponding microservice. to achieve this, a message queue is used. c. providing the eucaliptool composer with the list of available microservices and their operations. to achieve this, microservices must be registered into this server using the eucaliptool client. 3. eucaliptool client. it is a client library that endows each microservice with: (1) a lightweight activiti 1 bpmn engine and (2) a microservice composition authoring tool. the bpmn engine is included to support the execution of bpmn pieces. the authoring tool is included to support the evolution of these pieces by the developers of each microservice. this library is also in charge of automatically registering microservice's operations into the eucaliptool server. 1 https://www.activiti.org/ 2 https://projects.spring.io/spring-boot/ figure 6. microservice orchestration split into different fragments. to satisfy the responsibilities associated with each architectural element, they must interact with each other. this interaction is done through the http protocol. thus, each architectural element is in charge of publishing the required http end-points. for instance, the eucaliptool client library is in charge of publishing an http end-point to allow the eucaliptool server to send the bpmn pieces to each microservice. in the same way, the eucaliptool server must publish an http end-point to allows the eucaliptool client library to register the operations of each microservice. 3.1 supporting technology one of the most important supporters of the microservice architecture is netflix. this video streaming company has developed its software infrastructure by using microservices and has published all its supporting tools as open source. one of the main characteristics of these tools is their ease of use. these tools are based on the spring boot2 framework and are distributed as java libraries3. they propose the use of simple annotations and configuration files to develop and deploy the different components of the architecture. for instance, to build a service registry to support microservice discovery it is enough to create a spring boot java class and annotate it with the annotation @enableeurekaserver. then, you just need to define some parameters in a configuration file and the “magic” is done. you have a functional service registry. we want to follow the same strategy to facilitate the use of the eucaliptool infrastructure in a real microservice architecture. thus, we have created three java packages that encapsulate the functionality of the three proposed architectural elements and they are complemented with the following three annotations: @eucaliptoolcomposer @eucaliptoolserver @eucaliptoolclient thus, to create these microservices, developers just need to create a spring boot java class, use these annotations and, in some cases, define some configuration parameters. 3 https://netflix.github.io/ supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 for instance, to create an eucaliptool sever microservice developers just need to import the corresponding java libraries, and create a java class as follows: @eucaliptoolserver public class server { public static void main(string[] args) { springapplication.run(server.class, args); } } the springapplication class is a spring utility that creates a java application with an embed tomcat. when the above code is executed, we intercept the run method and search for our annotations by using reflection capabilities. when the @eucaliptoolserver is found, we deploy the functionality of this component into the embed tomcat. we also create an http controller that publishes the required end-points to interact with the rest of the architectural elements. the configuration that is required for this component is the end-points of the components that need to interact with. in particular, the api gateway, the service registry, and the message cue. this configuration is done through a yml file. by using and configuring the other two annotations we achieve the following: @eucaliptoolcomposer. it creates a spring application with the eucaliptool composer deployed into the embed tomcat. this annotation needs a config file that indicates the end-points of the eucaliptool server that (1) provides the list of microservices and their operations, and (2) allows sending a split composition. it also creates an http controller that publishes the end-points required to interact with the eucaliptool server. @eucaliptoolclient. it transforms a microservice into a eucaliptool client. to do so, it includes a lightweight version of the activiti engine to execute bpmn pieces. it also includes a web graphical editor deployed into the embed tomcat. also, it creates an http controller that publishes end-points to both receiving bpmn pieces and subscribing the microservices to choreography events. this annotation needs a config file that indicates the end-points of the eucaliptool server in order to register microservice’s operations and send bpmn pieces when are modified. 4 specifying microservice compositions the eucaliptool composer includes a web-based authoring tool that proposes a domain specific modeling language (dsml) to facilitate the modeling of microservice composition. it is based on a previous work that focuses on helping end-users to compose services by using a visual interface from a mobile device (valderas et al., 2017). next, we present the abstract syntax of the dsml (i.e. the conceptual elements) and the concrete syntax (i.e. the graphical components that define the web interface). 4.1 dsml abstract syntax the abstract syntax of the dsml supported by the web graphical editor is based on the change patterns (weber et al., 2008) developed within the context of the process of process modeling. change patterns are high-level abstractions aimed at achieving flexible and easy adaptations of a business process. these abstractions are defined in terms of highlevel change operations (e.g., the creation of a parallel branch) which are based on the execution of a set of change primitives (e.g., add/delete activity). as opposed to change primitives, change pattern implementations typically guarantee model correctness after each transformation (casati, 1998) by associating pre/post conditions with high-level change operations. usually, process modeling environments supporting the correctness-by-construction principle (e.g., dadam et al., 2009) just provide process modelers with those change patterns that transform a sound process model into another sound one. for this purpose, structural restrictions on process models (e.g., block structuredness) are imposed. in addition, correct usage of change patterns allows speeding up the creation of the composition. some change patterns are (weber et al., 2008): insert process fragment, embed process fragment in loop, embed process fragment in conditional branch, etc. inspired by the concept of fragment introduced by change patterns, the abstract syntax of the dsml proposed to compose microservices is shown in fig. 7. figure 7. domain specific language designed for eucaliptool. a microservice composition is made up of compositionelements of two types which are operations (of a microservice) and fragments. each operation has some inputs and one output. inputs are classified into three types depending on the source from which their value is obtained. this source can be the output of another operation; it can be obtained at runtime; or can be defined at design time. in the next subsection, this issue is explained with some examples. regarding fragments, there are four types: parallel, which has two or more branches of elements that must be executed in parallel; conditional, which has one or more branches of elements that must be executed when a condition is satisfied; loop, which has a branch of elements that must be executed while supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 a condition is satisfied; and witherror, which has two branches of elements, a major one that is executed by default, and a compensation one the is executed if some errors occur with some of the major branch's operations. the previouselement relationship between compositionelements allows establishing the sequence order between operations and fragments. to better understand the concepts of this metamodel, fig. 8 illustrates them in a process that is composed of a sequence of four operations followed by a parallel fragment. in turn, this latter parallel fragment is made up of a conditional fragment and two operations that are executed in parallel to it. figure 8. dsl concepts applied in an example. 4.2 dsml concrete syntax to create a composition of microservices we have defined a web interface based on the "adding element" metaphor where microservice developers just need to add a set of operations or fragments to a composition. to exemplify this interface, fig. 9 shows some of the screens needed to define the payment piece marked in green in fig. 4. fig. 9a shows the composition after adding the checkcustomer and logrequest operations of the microservice customers. to add more elements, designers just need to click on the "+" symbol. the type of elements that can be added to a composition are single operations and fragments (note that there are two tabs in fig. 9b). fig. 9b shows a list of fragments that are ready to be used in the current composition. in this case, the designer is selecting a with error fragment. as a result, a fragment of this type is included after the existing operations (see fig. 9c). here, the designer should specify two things, the major branch of operations to perform and the compensation branch of operations in case the major branch fails. in this case, the designer selects the paymentprocess operation offered by the payment microservice to be included in the major branch (see fig. 9d). this is offered as a single operation from the available catalog. this list shows the microservice operations that the eucaliptool server sends to the eucaliptool composer. these operations are automatically registered into the eucaliptool server by the eucaliptool client library that is installed in each microservice. the selection of this single operation results in the screen shown in fig. 9e. at this point, the designer still has to specify what to do when the major branch fails. this can be specified by selecting the tab labeled with the warning icon, and proceeding similarly to the definition of the major branch. in this case, the designer selects the operation changepaymentdetails. with this action, the second element of the composition is already completed (see fig. 9f). at this point, the designer should continue by selecting the most appropriate operations or fragments until the composition is completely defined. once microservice composition's flow is described, developers must define the inputs that some microservice operations require to be properly executed. to facilitate this, we provide a graphical component (see fig. 10) that allows: (1) linking an input with any compatible previous output, (2) indicating that the input value should be obtained at runtime; or (3) defining an input value at design time. for instance, let us consider that the operation cancelorder, which must be executed by the inventory microservice in case of error, needs two inputs: the customer id, which is a string, and the order number, which is an integer. let us consider also that all previous microservice operations generate a string value as output. fig. 10b shows the options that are available for the customer input. in this case, it can be associated to any previous operation since their data types are compatible, and also can be defined as an input to be obtained at runtime or an input that is associated to a predefined value (defined at design time in this screen). fig. 10 shows the options available for the order input. in this case, none of the previous operations is compatible so they are not available to be associated with this input. if a developer selects the option predefined value for a microservice's input, an input component is shown in order to allow the developer to introduce the value associated with the microservice's input at design time. regarding the option of defining an input to be obtained at runtime it implies that the values must be obtained when executing the microservice composition, and from a data source different from the own operations included in the composition. currently, we are considering that the data source is the client that launches the microservice composition (see fig. 5). thus, any time a microservice needs to execute an operation that has some input to be obtained at runtime, the corresponding bpmn piece generates an event in order to ask the client for this data. in further work, we want to consider other data sources such as the results of other microservice compositions or some physical devices in the context of the internet of things. supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 figure 9. example of the dsml concrete syntax to create microservice compositions. figure 10. configuration of microservice operation’s inputs. 5 supporting the execution of split bpmn processes once a microservice composition is defined with the eucaliptool composer three main stages are followed to distribute the responsibility of the process execution: (1) generation. the composition is transformed into a set of bpmn pieces. (2) distribution. bpmn pieces are sent to the eucaliptool server which registers the process and deploys the pieces into the corresponding microservices. (3) choreography. each microservice participates in the composition through an event-based orchestration. 5.1 generation of bpmn pieces the eucaliptool composer analyzes each process defined with the dsml and creates groups of actions according to the microservices that support them. each of these groups will be transformed into a bpmn piece. for instance, let us consider the composition presented in the motivation example (cf. fig. 11). in this case, the first two operations must be executed by the customer microservice and, therefore, they constitute the first piece. the second piece is defined by the third and fourth elements of the composition (a with error boundary block and a single operation), which both must be executed by the payment microservice. the third piece is defined from the operations that the inventory microservice must execute, i.e. fetch the items and the composition actions in case of error. finally, the fourth piece is made up of the two last operations that must perform the shipment microservice. figure 11. identification of bpmn pieces. for each bpmn piece, the eucaliptool composer generates a specification with the bpmn tasks to be performed as well as additional tasks to trigger the events that must manage the orchestration. for instance, let us consider the operations that must perform the microservice inventory (the third piece of bpmn). this microservice must fetch the items of the order and, in case of error, inform the user and cancel the order. fig. 12 shows the definition built with the eucaliptool composer and the generated bpmn process model. as we can see, two additional bpmn tasks are in supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 charge of 1) triggering an ok event in case there is no error, and 2) triggering a fail event if some problem occurs. these tasks are preconfigured to publish the event in a message queue. figure 12. generated piece of bpmn. the eucaliptool composer internally manages each composition in json format. to transform json descriptions into bpmn (which is based on xml) it uses java parsers of json and xml. the json description is parsed into a structure of java objects that are maintained in memory. next, this structure is analyzed in order to generate a bpmn specification by using the xml parser. in particular, we generate bpmn specifications that will be executed in the activiti engine, i.e. the engine included in the microservice by the eucaliptool client library. 5.2 distribution of bpmn pieces once the set of bpmn pieces has been generated, the eucaliptool composer sends them to the eucaliptool server. to do so, the latter publishes an http end-point that accepts this data through post connections. when the eucaliptool server receives a split composition, it performs the following actions (see fig. 13): (1) it registers the composition into its repository and creates an http end-point to launch it. (2) it deploys each piece of bpmn into the corresponding microservice. (3) it defines an event to launch the first piece of bpmn and configures the first microservice to listening to it. (4) for each event generated by a piece of bpmn, it configures the microservice that must execute the next piece to listen to this event. note that the eucaliptool server must interact with the microservices to deploy each piece of bpmn as well as to configure the microservice to listen to specific events. this can be done using a set of http endpoints that each microservice has available when including the eucaliptool client. figure 13. actions done by eucaliptool server. 5.3 orchestration of bpmn pieces the orchestration of the bpmn pieces deployed in microservices is done as follows (see fig. 14): (1) a client accesses the end-point published by the eucaliptool server. (2) the eucaliptool server launches the start event for this process. (3) the microservice that is listening to this event executes the first piece of bpmn. this execution finishes by triggering an event that indicates that the execution of the first bpmn piece is completed. (4) the microservice that is listening to the event that indicates the execution of the first bpmn piece launches its bpmn piece (the second one) and when executed, it generates another event that indicates that the execution of the second bpmn piece is completed. (5) the microservice that is waiting for the event that indicates the execution of the second bpmn piece does the same actions as the previous one: launches its corresponding bpmn piece and generates an event that indicates its execution. (6) and so on until the process is completed. figure 14. event-based orchestration of a split bpmn process. 6 supporting the evolution of microservice compositions following the proposed hybrid approach, we have two descriptions of a microservice composition. on the one hand, we have the whole picture of the composition that is stored by the eucaliptool composer. this centralized description supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 helps developers to analyze the whole composition to take engineering decisions. on the other hand, we have the split version of the composition that is distributed through the different microservices. this split description provides a high degree of decoupling among microservices when the composition is executed through an event-based choreography. one of the most important challenges to be faced within this context is the evolution of the microservice composition and the synchronization of both descriptions. our main goal is to propose a solution that provides developers with a high degree of flexibility to perform changes. so these can be done either at the centralized composition, i.e., at the whole composition, or at the microservice level, i.e., at the pieces deployed in each microservice. to achieve this, as introduced in section 3, the following mechanisms are provided by the proposed three architectural elements: the eucaliptool client library includes a web editor like the one shown in section 6 where developers can independently evolve their composition pieces. the eucaliptool server publishes an http end-point to receive modified composition pieces from microservices to send them to the eucaliptool composer. the eucaliptool composer publishes an http endpoint to receive modified composition pieces from the eucaliptool server to update the whole version of the composition. thus, the evolution of a microservice composition can be done in two ways: 1 developers update the whole description of the composition from the eucaliptool composer microservice (see fig. 15a). in this case: 1.1 the eucaliptool composer microservice generates the corresponding bpmn pieces and sends those pieces that have been changed to the eucaliptool server. 1.2 the eucaliptool server microservice distributes the pieces among the corresponding business microservices. 1.3 microservices that receive a new version of a piece, replace the old version by the new one. 2 developers change a composition piece from a business microservice (see fig. 15b). in this case: 2.1 the microservice sends the new version of the piece to the eucaliptool server. 2.2 the eucaliptool server sends the received piece to the eucaliptool composer. 2.3 the eucaliptool composer updates the whole description with the changes that introduce the modified piece. figure 15. evolution of a microservice composition. to update the whole composition when an updated bpmn piece is received, the eucaliptool composer applies the transformation inverse to the one used to generate the bpmn pieces and obtains a json representation of the piece. this json representation is based on the dsml presented above and the eucaliptool composer just needs to replace the elements of the whole description that correspond with the updated piece. note that updating the whole description of the microservice composition is easy since pieces are composed of operations and fragments that are added to a container. there are no connections with previous or further elements that need to be managed like can happen with a bpmn model. in order to better understand this aspect fig. 16 illustrates how the composition of the motivating example is updated with a new piece 2. figure 16. example of composition update by replacing a piece. 7 evaluation this section presents the experiment that we have conducted to show the efficiency of our proposal in the development and evolution of microservice compositions. this experiment aimed to compare the efficiency measurement obtained by a development based on eucaliptool with the measurement obtained by an ad-hoc implementation of an event-based choreography. this ad-hoc implementation was done by using the technology provided by spring and netflix. to support the exchange of messages among microservices, a rabbitmq message broker was used in both cases. to do the experiment, we followed the guidelines presented by kitchenham et al. (1995) and wohlin et al. (2012). according to these guidelines, we have divided the experiment into three main phases: scoping, planning, operation and analysis, and interpretation 7.1 scope the scope of an experiment is set by defining its goal. to do so, we have used the template proposed by basili et al. supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 (1988). the goal of our experiment is characterized as follows: analyze: our approach based on eucaliptool for the purpose of: evaluating the impact of our approach compared to ad-hoc development with respect to: efficiency from the point of view of: microservice developers in the context of: researchers in software engineering composing microservices 7.2 experimental design in the planification activity, we must formalize the hypotheses, determine the dependent and independent variables, describe the context of the experiment and the instrumentation used, and consider the threats of validity we can expect. hypothesis. the hypotheses defined for the experiment were the following: null hypothesis 1, h10. the efficiency of the eucaliptool approach for developing and evolving microservice compositions is the same as an ad-hoc development. alternative hypothesis 1, h11. the efficiency of the eucaliptool approach for developing and evolving microservice compositions is greater than an ad-hoc development. identification of variables. we identified two types of variables: dependent variables: variables that correspond to the outcomes of the experiment. in this work, the efficiency in composing microservices was the target of the study, which was measured in terms of the following software quality factors: development time and evolving time. independent variables: variables that affect the dependent variables. the development method was identified as a factor that affects the dependent variable. this variable had two alternatives: (1) eucaliptool approach and (2) an ad-hoc implementation. context. the context of the experiment was the following: experimental subjects. ten subjects participated in the experiment, all of the researchers in software engineering. their ages ranged between 28 and 45 years old. the subjects had an extensive background in java programming and modeling tools; however, they did not have experience in the use of eucaliptool. only 3 of them have experience in using the spring framework and message queues, and 4 of them have previously worked with bpmn. objects of study. the experiment was conducted using a case study similar to the motivating example used throughout the paper, i.e. the microservice composition to manage a purchase order in a webshop (see section 1). instrumentation. the instruments that were used to carry out the experiment were: o a demographic questionnaire: a set of questions to know the level of the users’ experience in java/spring programming, modeling tools, and bpmn. o work description: the description of the work that the subjects should carry out in the experiment by using eucaliptool and the ad-hoc solution. this work description explained two activities: (1) the development of the microservice composition to support purchase orders, and (2) the modification of this composition to support new requirements. o a form: a form was defined to capture the start and completion times of the proposed work. for each task that was proposed in the experiment, participants had to annotate the starting and completion times by using the clock of the computer. if some interruptions occur while performing the work, subjects wrote down the times every time they started and stopped carrying out the activity; thus, the total time was derived using these start and completion times. finally, additional space was left after the completion time of the work for additional comments about the subjects about the performed activity. threats of validity. our experiment was threatened by the random heterogeneity of subjects. this threat appears when some users within a user group have more experience than others. this threat was minimized with a demographic questionnaire that allowed us to evaluate the knowledge and experience of each participant beforehand. this questionnaire revealed that all the users had experience in java programming and modeling techniques. some of them had experience in the use of technologies related to the implementation of choreographies, while others did not. this problem could affect the evaluation of the development with an adhoc solution since this type of development requires these technologies. some participants had experience in bpmn which could affect the evaluation of the development based on eucaliptool since it is based on some abstractions of bpmn. to minimize this threat, all subjects participated in training sessions about both choreography implementation technologies and eucaliptool. in addition, to minimize the effect of the order in which the subjects applied the approaches, the order was assigned randomly to each subject. however, in order to have a balanced design, the same number of subjects was assigned to start with each approach. to do so, the ten participants were aleatorily divided into two groups, and each group was initially assigned to a development type. then, each group changed of development type to do again the same tasks. in this way, we minimized the threat of learning from previous experience. finally, our experiment was threatened by the reliability of measures threat: objective measures, that can be repeated with the same outcome, are more reliable than subjective measures. in this experiment, the precision of the measures may have been affected since the activity completion time was measured manually by users using the computer clock. to reduce this threat, we observed subjects while they were performing different tasks to guarantee their exclusive supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 dedication in the activities and supervise the times that they wrote down. 7.3 execution we followed a within-subjects design where all subjects were exposed to every treatment/approach (eucaliptool solution and ad-hoc solution). the main advantage of this design was that it allowed statistical inference to be made with fewer subjects, making the evaluation much more streamlined and less resource-heavy (wohlin et al., 2012). to perform the experiment, we arranged a workshop of three days with two sessions per day (see table 1). table 1. sessions of the experiment session 1 session 2 day 1 duration: 4h all participants: training in choreography implementation duration: 4h all participants: training in eucaliptool day 2 duration: 5h group a: development of a microservice composition with an ad-hoc solution group b: development of a microservice composition with eucaliptool duration: 3h group a: evolution of a microservice composition with an ad-hoc solution group b: evolution of a microservice composition with eucaliptool day 3 duration: 5h group a: development of a microservice composition with eucaliptool group b: development of a microservice composition with an ad-hoc solution duration: 3h group a: evolution of a microservice composition with eucaliptool group b: evolution of a microservice composition with an ad-hoc solution during the first day, we had two sessions of 4 hours in which participants were proposed to fill in a demographic questionnaire to capture participants’ background and were trained in choreography technologies and eucaliptool. in particular: regarding choreography technologies, we provided the subjects with the necessary tutorials and tools to learn the basics of the spring and netflix technologies needed to develop the case study. we also made an introduction to message queues and rabbitmq. the subjects also participated in the implementation of some guided examples to gain experience with the technologies. regarding eucaliptool, we provided the subjects with a tutorial where the web authoring tool included in the eucaliptool composer was explained. the subjects also worked with some examples to gain experience with the dsml of this tool. we also explained the proposed architecture and how the proposed eucaliptool architectural elements interact among them and need to be configured. during the second and third days, participants were divided aleatorily into two groups, a and b, and two sessions of five and three hours respectively were proposed for each day. we did the same experiment in both days. in one day, group a used an ad-hoc solution to develop and evolve a microservice composition while group b used eucaliptool. the second day groups changed the development methods. the tasks designed for the experiment were initiated with a short presentation in which general information and instructions were given. afterward, the work description and the form were given to the subjects and they started to develop and evolve the microservice composition following the development method (eucaliptool and ad-hoc) that was indicated for each group. the microservice composition that participants had to develop was described in a textual way. after performing this work, participants filled in a form to capture the development times. once the subjects developed the composition, they started to modify it to evaluate the evolution. for these activities, they also filled in the form to capture the time taken to evolve the composition. to properly perform this work, we previously developed the microservice architecture required to support the case study. to do so, we used netflix’s technology. the eucaliptool composer and the eucaliptool server microservices were also created, and every business microservice was defined as a eucaliptool client. in a more detailed way, the activities carried out with each development approach were the following: ad-hoc development: from the case study description, they started the implementation of the microservice composition for the management of purchase orders. generally, they identified the operations that each microservice should perform, and defined for them both, a starting event and an end event. once this data was clear, they updated each microservice with the classes required to connect to rabbitmq and listen at the starting event to launch the operations corresponding to each microservice. to execute these operations, they implemented some classes that call the corresponding methods. these classes also were in charge of launching the ending event. once they modified each microservice and achieved the compilation of the code, they spent some time testing the composition and detecting code errors. finally, we provided a set of requirement changes for the composition to evaluate the evolution. in particular, we proposed them to support vip customers in such a way it was introduced in section 1. in this activity, the participants changed the code of the involved microservices to support the new requirements. then, the participants tested the new composition and corrected the errors. eucaliptool-based development. following this approach, the participants first designed the microservice composition with the eucaliptool composer according to the case study description. then, they asked the eucaliptool composer to deploy the composition. afterward, they spent some time testing the composition and detecting errors in the composition design. finally, we asked participants to support the same new requirements as explained in the previous activity. in this case, the participants changed the composition done with the eucaliptool composer and deployed it again. then, supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 the participants tested the new composition and corrected the errors. 7.4 analysis of results in this subsection, we analyze and compare the usefulness of both approaches based on the time used for the development and evolution of a microservice composition. the results have been studied based on time mean comparison and the standard deviation. table 2 presents the descriptive statistics for each of the studied quality factors. table 2. descriptive statistics for each quality factor. quality factor dev. method mean (hours) num. of subjects std.dev. (hours) develop. time ad-hoc 4.38 10 0.52 eucaliptool 1.15 10 0.44 evolution time ad-hoc 1.55 10 0.69 eucaliptool 0.29 10 0.05 next, we provide further analysis of the results for each measured software quality factor: development time. the development time following the ad-hoc approach differed according to the subject implementation experience, ranging from 3.25 hours (the most experienced subject) to 5. following the eucaliptool approach, the development activity ranged from 75 min to 2.10 hours. the difference between the two approaches was high since developing the microservice composition in an ad-hoc way was more complex and difficult for the participants since they had to implement all the composition logic manually as well as all the code required to connect with rabbitmq to participate in the event-based choreography. the eucaliptool approach allowed participants to focus on the required requirements instead of solving technological problems. note that by following this approach, none of the participants had to implement anything to manage the invocation of operations neither the events required to participate in the choreography. regarding the standard deviation, it was low for both development approaches (see table 1) indicating that development times tended to be close for each development approach. evolution time. concerning the ad-hoc development, this activity took subjects from 1.10 to 2.3 hours since they had to identify the microservices that must be updated, and modify the corresponding code. changing the eucaliptool description of the microservices composition took less than 30 min. for all the subjects (very low standard deviation obtained). this is because evolving the microservice composition to fit the new requirements was as easy as modifying the whole description with the web authoring tool. in this case, participants focused again only on requirements. they did not need to identify microservices and hardcoded changes. 1 statistical analyses using spss, http://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#1sampt with the eucaliptool approach, the subjects took, on average, 1.44 hours to develop the case study, whereas with an ad-hoc implementation the subjects took 5.93 hours. therefore, the process for automating and evolving microservice compositions is more efficient using the eucaliptool approach than using an ad-hoc solution. in order to verify whether we can accept the null hypothesis, we performed a statistical study called paired t-test using the ibm spss statistics v201 at a confidence level of 95% (α = 0.05). this test is a statistical procedure that is used to make a paired comparison of two sample means, i.e., to see if the means of these two samples differ from one another. for our study, this test examines the difference in mean times for every subject with the different approaches to test whether the means of an ad-hoc development and the eucaliptool approach are equal. when the critical level (the significance) is higher than 0.05, we can accept the null hypothesis because the means are not statistically significantly different. for our experiment, the significance of the paired ttest for the total time means is 0.000 (calculated using the ibm spss statistics), which means that we can reject the null hypothesis h10 (the efficiency of the eucaliptool approach for developing and evolving microservice compositions is the same as an ad-hoc development). based on this test, we have given strong evidence that the kind of development influences the usefulness. specifically, the efficiency using the eucaliptool approach is significantly better than using an ad-hoc solution, i.e., the mean values for all the measures are lower when using the eucaliptool approach; thus, the alternative hypothesis h11 is fulfilled: the efficiency of the eucaliptool approach for developing and evolving microservice compositions is greater than an ad-hoc development. 7.5 conclusions the above-presented experiment evaluated our approach to develop and evolve microservice compositions concerning ad-hoc solutions based on choreographies. we have validated that our approach is more efficient than ad-hoc solutions and have confirmed the expected benefits suggested in the introduction. on the one hand, having the big picture of the composition has facilitated its analysis to support its evolution when requirements changed. on the other hand, the visual editor of eucaliptool, as well as the supporting infrastructure to manage event-based communication, have significantly facilitated the definition and execution of choreographed microservice compositions. note that we have evaluated ad-hoc solutions based on choreographies since the decentralized nature of microservices seems to make choreographies more appropriate to define microservices compositions (dragoni et al., 2017; butzin et al., 2016). a similar experiment focusing on orchestration will be considered as further work. supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 8 related work rajasekar et al. (2012) presented the integrated rule oriented data system (irods) to orchestrate microservices within data-intensive distributed systems. a microservice choreography is defined as a set of textual event-conditionaction (eca) rules. each rule defines the data management actions that a microservice must execute. these actions generate events within the system that trigger the rules associated with other microservices. the authors also proposed the use of recovery microservices to maintain transactional properties. the main drawback of this work is that the logic of the process is distributed along with the different rules that each microservice implements, making the maintenance and evolution difficult to perform. yahia et al. (2016) introduce medley, an event-driven lightweight platform for microservice orchestration. they propose a textual domain-specific language (dsl) for describing orchestrations using high-level constructs and domain-specific semantics. these descriptions are compiled into low-level code run on top of an event-driven processbased and lightweight platform. the main drawback of this approach is that developers need to explicitly manage service orchestration issues at the modeling level. our solution allows developers to focus only on modeling business requirements. also, a choreography solution is proposed to obtain a major level of independence among microservices. kouchaksaraei et al. (2018) present pishahang, a framework for jointly managing and orchestrating cloud-based microservices. this framework introduces tools to easily integrate sonata (dräxler et al., 2017), an orchestration framework, with terraform (2019), a multi-cloud tool. however, tools for modeling business processes and support them within a decoupled microservice infrastructure are not provided. indrasiri & siriwardena (2018) introduce ballerina, an emerging technology that is built as a programming language and aims to make it easy to write programs that integrate and orchestrate microservices. however, although they propose an environment to design microservice integrations with sequence diagrams, most of the communication issues among microservices need to be managed at programming level. our solution automatically generates the implementation artifacts required to support microservice communication from business process models. petrasch (2017) presents an approach based on uml to design microservices and communication among them. however, complex business processes involving multiple microservices cannot be modeled. guidi et al. (2017) present the need for specific programming languages aimed towards microservices composition. authors claim that these languages should include concepts such as communication, interfaces, and dependencies. they instantiate their proposal in terms of the jolie (2019) programming language. similar work to this is the one presented by safina et al. (2016), which extends the jolie programming language to support data-driven workflows. this means that the flow of microservice compositions is controlled at the time of message passing according to the nature of the message structure and type. our work differs from these two approaches in the fact that we provide a solution based on business process modeling instead of programming languages to create ad-hoc solutions. finally, it is worth noting that in this paper we present an extended version of the work proposed in (valderas et al., 2019). in this current work, we introduce the evolution of microservice compositions from both, a top-down perspective (i.e. from the eucaliptool composer to the microservices), and a bottom-up strategy (i.e. from the microservices to the eucaliptool composer). we have improved the dsml defining how inputs and outputs of microservices can be linked. we also present the development infrastructure implemented to support developers in the composition of microservices by using our approach. in addition, our approach has been evaluated through a complete experiment that compares it with ad-hoc solutions to compose microservices. 9 conclusion and further work in this work, we have presented a hybrid solution that combines the choreography and orchestration approaches to deal with microservice compositions with the use of eucaliptool. the main reason to follow such a hybrid solution is that we want to take advantage of the goodness of each approach. this is, we want to maintain the flexibility and decoupling nature offered in choreographies but also want to keep the composition global vision and management offered by an orchestration approach. for this purpose, the eucaliptool platform has been presented and integrated in a typical microservice architecture to provide: 1) tool support to the specification of microservices compositions, 2) mechanisms to automate the distributed deployment of microservice compositions and its execution through an eventbased choreography, and 3) support the evolution of compositions following a top-down strategy (i.e. from the global vision of the composition) or a bottom-up strategy (i.e. from a piece of a specific business microservice). in addition to the evaluation based on the motivating example, it would be very interesting to evaluate also the performance of the designed architecture in a real scenario. furthermore, since our objective is to improve how compositions are made, as future work we plan to enrich eucaliptool with goal-oriented capabilities. this way, instead of specifying compositions, users would just need to state their goals. then, based on them, eucaliptool would propose an initial composition intended to satisfy the user stated goals. acknowledgments this work has been developed with the financial support of the spanish state research agency under the project tin2017-84094-r and co-financed with erdf. supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 references alpers, s., becker, c., oberweis, a., schuster, t. (2015). microservice based tool support for business process modeling. edoc workshops: 71-78 basili, v.r., rombach, h.d. (1988). the tame project: towards improvement-oriented software environments. ieee trans. softw. eng. 14(6), 758–773 bucchiarone, a., dragoni, n., dustdar, s., larsen, s. t., and mazzara, m. (2018). from monolithic to microservices: an experience report from the banking domain. ieee software, vol. 35, no. 3, pp. 50-55 butzin, b., golatowski, f., & timmermann, d. (2016). microservices approach for the internet of things. in 2016 ieee 21st international conference on emerging technologies and factory automation (etfa) (pp. 1-6). ieee. casati, f.: models, semantics, and formal methods for the design of workflows and their exceptions. (1998). phd thesis, milano dadam, p., reichert, m. (2009). the adept project: a decade of research and development for robust and flexible process support. comp scie r&d 23: 81-97 dragoni, n, giallorenzo, s., lluch-lafuente, a., mazzara, m., montesi, f., mustafin, r., safina, l. (2017). microservices: yesterday, today, and tomorrow. present and ulterior software engineering: 195-216 dräxler, s., karl, h., peuster, m., kouchaksaraei, h. r., bredel, m., lessmann, j., ... & xilouris, g. (2017). sonata: service programming and orchestration for virtualized software networks. in 2017 ieee international conference on communications workshops (icc workshops) (pp. 973-978). ieee. guidi, c., lanese, i., mazzara, m., & montesi, f. (2017). microservices: a language-based approach. in present and ulterior software engineering (pp. 217-225). springer, cham. hamidehkhan, p. (2019). analysis and evaluation of composition languages and orchestration engines for microservices (master's thesis). indrasiri, k., & siriwardena, p. (2018). integrating microservices. in microservices for the enterprise (pp. 167-217). apress, berkeley, ca. jolie. (2019). a service oriented language. url: https://www.jolielang.org/ last time accesed: november 2019. kitchenham, b., pickard, l. and pfleeger, s. l. (1995). case studies for method and tool evaluation, software, ieee, vol. 12, no. 4, pp. 52–62, 1995. newman, s. (2015). building microservices, usa:o'reilly media inc., february 2015. petrasch, r. (2017). model-based engineering for microservice architectures using enterprise integration patterns for inter-service communication. in 2017 14th international joint conference on computer science and software engineering (jcsse) (pp. 14). ieee. rajasekar, a., wan, m., moore, r., & schroeder, w. (2012). microservices: a service-oriented paradigm for. data intensive distributed computing. in: challenges and solutions for largescale information management (pp. 74-93). igi global. safina, l., mazzara, m., montesi, f., & rivera, v. (2016). datadriven workflows for microservices: genericity in jolie. in 2016 ieee 30th international conference on advanced information networking and applications (aina) (pp. 430-437). ieee. shadija, d., rezai, m., hill, r. (2017). towards an understanding of microservices. icac 2017: 1-6 singhal, n., sakthivel, u., & raj, p. (2019). selection mechanism of micro-services orchestration vs. choreography. international journal of web & semantic technology (ijwest), 10(1), 25. terraform. (2019). url: https://www.terraform.io/ last time accesed: november 2019. valderas, p., torres, t., mansanet, m., pelechano, v. (2017). a mobile-based solution for supporting end-users in the composition of services. multimedia tools appl. 76 (15): 16315-16345 valderas, p, torres, v, and pelechano, v. (2019). hybrid composition of microservices with eucaliptool. proceedings of the xxii iberoamerican conference on software engineering, cibse 2019, la habana, cuba, april 22-26, 2019: 2-15. weber, b., reichert, m., rinderle, s. (2008). change patterns and change support features enhancing flexibility in processaware information systems. data and knowledge engineering 66: 438-466 wohlin, c., runeson, p. , höst, m., ohlsson, m. c., regnell, b. and wesslén, a. (2012). experimentation in software engineering, springer. yahia, e. b. h., réveillère, l., bromberg, y. d., chevalier, r., & cadot, a. (2016). medley: an event-driven lightweight platform for service composition. in international conference on web engineering (pp. 3-20). springer, cham. journal of software engineering research and development, 2021, 9:11, doi: 10.5753/jserd.2021.1892 this work is licensed under a creative commons attribution 4.0 international license. a requirements engineering technology for the iot software systems danyllo valente da silva [federal university of rio de janeiro | dvsilva@cos.ufrj.br] bruno pedraça de souza [federal university of rio de janeiro | bpsouza@cos.ufrj.br] taisa guidini gonçalves [federal university of rio de janeiro | taisa@cos.ufrj.br] guilherme horta travassos [federal university of rio de janeiro | ght@cos.ufrj.br] abstract contemporary software systems (css) – such as the internet of things (iot) based software systems – incorporate new concerns and characteristics inherent to the network, software, hardware, context awareness, interoperability, and others, compared to conventional software systems. in this sense, requirements engineering (re) plays a fundamental role in ensuring these software systems' correct development looking for the business and enduser needs. several software technologies supporting re are available in the literature. however, many do not cover all css specificities, notably those based on iot. this paper presents retiot (requirements engineering technology for the internet of things-based software systems). it aims to provide methodological, technical, and tooling support to produce iot software system requirements document. in addition, it comprises an iot scenario description technique, a checklist to verify iot scenarios, construction processes, and templates for iot software systems. a feasibility study was carried out in iot system projects to observe its templates and identify improvement opportunities. the results indicate the feasibility of retiot templates' when used to capture iot characteristics. however, further experimental studies represent research opportunities, strengthen confidence in its elements (construction process, techniques, and templates), and capture end-user perception. keywords: software engineering, requirements engineering, internet of things, iot software systems, software systems specification, software system requirements document, software technology 1 introduction contemporary software systems, such as those inherent to the internet of things (iot) paradigm, are complex compared to conventional software systems. this complexity comes from the inclusion of new concerns and characteristics related to network, software, hardware, context awareness, interface, interoperability, and others (motta et al., 2019a) (nguyen-duc et al., 2019). iot-based software systems seek to promote the interlacement of technologies and devices that, through a network, can capture and exchange data, make decisions, and act. with these actions, they unite the real and virtual worlds through objects and tags. however, building iot software systems is not a trivial activity due to its specific technological characteristics. it requires adapted and/or innovative software technologies to create and guarantee the quality of the built product (motta et al., 2019a). the quality of contemporary software systems' development depends on software technologies that respond to these systems' new concerns and characteristics. as with any other product built on engineering principles, a key activity in developing iot software systems is constructing the requirements document. defects present in the requirements document can cause an increased time, cost, and effort for the project; dissatisfied customers and end-users; low reliability of the software system; a high number of failures; among others (vegendla et al. 2018) (arif et al. 2009). requirements engineering (re) is responsible for the life cycle of the requirements document. it ensures its proper construction (vegendla et al. 2018) (pandey et al., 2010). the re phases and activities may differ according to the application domain, people involved, processes, and organizational culture. however, we can observe some recurring phases and re activities, such as conception/design, elicitation, negotiation, analysis, specification, verification, validation, and management. the technical literature presents several software technologies to support re for software systems. however, not all of them cover the different re phases and, mainly, iot software systems' specificities. in this work, the term "software technology" refers to the methodological, technical, and tooling offered by the works to support the construction of requirements document for iot software systems. considering the need for appropriate software technologies to develop iot software systems and understand the importance of requirements document for the stability, adequacy, and quality of a project, this work proposes the retiot (requirements engineering technology for the internet of things software systems). the retiot consists of a requirements specification technique based on iot scenarios description scenariot (silva 2019), an iot scenario inspection technique scenariotcheck (souza 2020), a construction process, and templates to support the processes activities and build the requirements document. the scenariot and scenariotcheck techniques were previously evaluated through experimental studies, which indicated their feasibility (souza et al. 2019a) and usefulness (souza et al. 2019b). moreover, they have been used in iot software system projects developed by the experimental software engineering (ese) group in the context of delfos – the observatory of engineering contemporary software systems – coppe/ufrj. based on the experiences with these projects, the construction process and templates of retiot evolved. this article extends a previous publication of retiot (silva et al. 2020b). the first version (silva et al. 2019) encompasses many re activities. it focuses on the definition of project scope, iot system, and iot system requirements. the second version (silva et al. 2020b) focuses on eight re phases: conception, iot elicitation, iot analysis, iot specification, iot verification, negotiation, validation, and management. the templates of this version are evaluated through a feasibility study https://orcid.org/0000-0002-1026-1189 https://orcid.org/0000-0002-1502-7703 https://orcid.org/0000-0002-7108-1763 https://orcid.org/0000-0002-4258-0424 a requirements engineering technology for the iot software systems silva et al. 2021 (section 4). the third version (silva et al. 2020a) involves the different re phases through an engineering cycle divided into eight phases: iot ideation and conception, iot elicitation, iot analysis, iot specification, iot verification, negotiation, iot evaluation, and management. their templates are evaluated through a proof of concept. the fourth and current version of the technology includes improvements in the construction process and templates. optional activities and tasks were analyzed and incorporated into the construction process. the focus and perspective of the process were changed to iot product and project, including product ideation and evaluation concepts. also, the engineering cycle was compacted to simplify the construction process that includes four phases instead of eight of the third version. the templates were evolved in the current version to include the gaps and improvements identified during the proof of concept (silva et al. 2020a). for the sake of completeness and applicability, this paper presents the current (fourth) version of retiot, including a feasibility study comparing three retiot templates with regular ones used to build requirements document for conventional software systems. the results indicate that retiot templates allow capturing the information needed for iot software systems and that they are mature to be evaluated in constructing such requirements document. furthermore, it is also possible to observe that the technology covers the main re phases and activities concerning iot-based projects. beyond this introduction, this article presents six other sections. section 2 describes the technological basis of the retiot. next, section 3 introduces and details the retiot. section 4 demonstrates the feasibility study. section 5 presents some related works found in the literature. section 6 discusses some research opportunities. finally, section 7 presents future work and concludes the article. 2 the technological basis of the retiot this section presents the technological basis used to build the retiot to support re in iot software systems. such a requirements technology is inserted in the context of a systems engineering approach, which concerns the major development stages of iot software systems (motta et al. 2020). its technological basis is composed of two empirically evaluated techniques, the scenariot and scenariotcheck techniques. 2.1 scenariot conventional software scenarios can be used in any software system and development stage. they can cover different purposes, such as eliciting requirements, specifying requirements, validating requirements, and testing (glinz 2000) (behrens 2002) (alexander and maiden 2004). a scenario is a sequence of events describing the system behavior and its environment (burg and van de riet 1996) or an ordered set of interactions between partners usually systems and external actors (glinz 2000). it represents requirements through stories describing the system from the users' perspective when applied to requirements engineering (glinz 2000) (alexander and maiden 2004). scenarios offer many advantages: they are based on the users' point of view; ii) the possibility to carry out partial specifications; iii) easy to understand; iv) enable short feedback loops; and v) provide a basis for testing the system (glinz 2000). thus, scenarios constitute a good basis for communication with clients and laypeople (non-technical) because they can be easily understood and do not require prior understanding. therefore, everyone involved at different levels and functions can express opinions and identify problems (glinz 2000) (behrens 2002) (alexander and maiden 2004). the scenariot (silva 2019) is a specification technique that adapts conventional scenarios to support iot software systems' specifications. it considers the characteristics (adaptability, connectivity, privacy, intelligence, interoperability, mobility, among others) and behaviors (identification, sensing, and actuation) specific to these software systems (motta et al., 2019b). the combination of characteristics and behaviors led to the creation of nine iot interaction arrangements (iias). iot interaction arrangements represent frequent interaction flows between things and other non-iot elements, such as conventional software systems and end-users. each iia has a catalog containing all relevant information captured and used in the scenario's description. the cardinality for arrangements and scenarios is a many-to-many relationship (m:n). therefore, many arrangements (isolated or combined) relate to one or more iot scenarios. an iot scenario can be linked to one or more arrangements. the iias, together with their catalogs, guide software engineers to capture essential information about the system: i) identification of the "things" and system components; ii) the types of data that will be collected and displayed; iii) the actions that will be performed in the environment; iv) aspects related to decision making on a particular system context; v) the actors (end-users, software systems, things, among others) who will access the data; among others. figure 1 shows the "iia-1: display of iot data" arrangement and its catalog. a requirements engineering technology for the iot software systems silva et al. 2021 figure 1. "iia-1: display of iot data" arrangement (silva 2019). 2.2 scenariotcheck the scenariotcheck is a checklist-based software inspection technique specialized in verifying iot software system scenarios (souza 2020). this technique aims to assist inspectors in detecting defects in iot scenario descriptions, guaranteeing their quality. it was created to work together with the scenariot technique since it produces the input to scenariotcheck. the scenariotcheck checklist consists of two parts. the first part (general questions) aims to identify defects related to project information and systemic solution such as i) problem domain; ii) interaction and identification among actors, system, hardware, and devices; iii) alternative and exception flow; among others. table 1 shows the first part of the questionnaire. table 1. questions on the scenariot specification technique. g e n e r a l q u e s ti o n s nº question 01 has the overall application domain been established? (health, leisure, traffic) 02 is the specific purpose of the system correctly described? (data visualization, decision making, and/or actuation only) 03 is the type of data collected specified? (temperature, humidity, pollution, and so on) 04 is it possible to identify who or what collects the data? (sensors, qr code readers, and so on) 05 is it possible to identify who or what manages the data collected? (administrator, decision-maker, users, and so on) 06 is it possible to identify who or what accesses the data collected? (things, software systems or users) 07 is the user interface device that displays the data described? (dashboard, smartphone, tablet, and so on) 08 is it possible to identify who is viewing the data? (things, software systems, users, and so on) 09 is it possible to identify the source from which the data is provided? (chairs, table, automobiles, houses, buildings, and so on) 10 are the roles involved in the system described? (things, software systems, users, and so on) 11 is there any description of each actor in the specified scenarios? 12 is it possible to identify the source of data provision? 13 has each action within the scenario been described clearly and contains no extraneous information? 14 is there a sequence of ambiguous actions in the scenarios? 15 are the actors described in the scenarios consistent with the actors described in the arrangements? (things, software systems, users) 16 are the scenarios related to the arrangements consistently? 17 do the scenarios seek to be accurate by presenting title and flows? (presenting the purpose and actions of the system directly and explicitly) 18 are adverbs avoided in order not to generate more than one possibility of interpretation in the scenarios? (probably, possibly, supposedly) 19 are the condition terms (such as "if", "go to", "while") used correctly? 20 when you use words like "things" "data" in the scenario, do they have the same meaning in other parts of the same scenario? 21 is it possible to identify "things" described with a function in the arrangements representing another function in the described scenarios? 22 are the alternative and/or exception flows described? 23 does the scenario specification identify the matching id arrangement? (aii1, aii2, ..., aii9) the second part (specific questions) considers the nonfunctional properties (iot facets) of iot software systems discussed in (motta et al. 2019a). table 2 presents the questions of the second part of the scenariotcheck checklist. table 2. questions on iot facets. s p e c if ic q u e s ti o n s nº question 24 is it possible to identify the specific context in which the system is embedded? (smart room, smart greenhouse, autonomous vehicle, healthcare, and so on) 25 are the limitations of the environment described? (e.g., lack of connectivity structure, lack of hardware structure, inadequate infrastructure) 26 are the technologies associated with system objects described? (smartphones, smartwatches, wearables) 27 are the events that the system has identified? (e.g., on/off an object, sending data) 28 what kind of communication technology does the system use in the scenarios? (bluetooth, intranet, internet ...) 29 does the proposed communication technology meet the geographic/physical specifications of the system? (large, medium or small scale) 30 is it possible to identify how the system will react according to changes in the environment? 31 are the interactions between the system and the environment represented in the scenarios? 32 is it possible to identify the interaction between actors? after specifying iot scenarios, the inspectors can apply the scenariotcheck technique to verify the scenario descriptions. the identified non-conformities are described in the inspection report. finally, after the discrimination meeting (defects identification), the iot scenario specification document is corrected. the application process of the two techniques is shown in figure 2. the scenariotcheck complements scenariot by providing a template for iot scenarios specification. this template resembles a use-case description document with some additional fields: i) identification of the iot software system elements; ii) problem domain description, iii) role description of each actor involved in the scenario; a requirements engineering technology for the iot software systems silva et al. 2021 figure 2. application process of scenariot and scenariotcheck techniques (souza et al. 2019a). and iv) descriptions of the interaction between the actors (end-user, things, software system, others) and the iot software system. 3 the retiot the retiot (requirements engineering technology for the internet of things based software systems) comprises the techniques described in section 2, a construction process, and templates to build requirements document following re principles. the requirements document's construction process is based on the main re phases (pressman and maxim 2014) (sommerville 2015): conception/design, elicitation, analysis, specification, negotiation, verification, validation, and management. however, the retiot adapts and includes new activities to meet the specificities of iot software systems. 3.1 construction process the current version of the technology encompasses product ideation, evaluation concepts, such as lowand high-level prototypes, and mvps' creation (minimum viable product) for the desired product. in addition, the construction process incorporates aspects and characteristics found in the literature review inherent to iot software systems. it also involves the different re phases (pressman 2014) (sommerville 2015) through an engineering cycle divided into four phases: iot ideation, conception, elicitation; iot analysis and specification; iot negotiation and evaluation; and management. figure 3 presents an overview of the construction process engineering cycle with two dimensions: main and transversal (performed in parallel). the main dimension corresponds to the activities and tasks required to build the iot requirements document. figure 3. construction process overview phases. the transversal dimension (see figure 4) offers three management activities, and tasks focused on artifacts and process management. the activities and tasks do not have a specific and determined time to be performed. instead, everything depends on the need identified by the user through the main process flow. the technology proposes version control of the artifacts and traceability between requirements, iot scenarios, iot interaction arrangements, and iot use cases in the management phase. besides, retiot offers change management so that modifications in requirements can be reflected in the generated artifacts. the iot requirements document construction is performed iteratively and incrementally. thus, the engineering cycle is executed three times, where each execution is called a stage. a requirements engineering technology for the iot software systems silva et al. 2021 figure 4. management phase overview. 3.2 construction process stages each stage performs the common phases (see figure 3 and figure 4) of the engineering cycle generating intermediate artifacts. the result of stage 3 is the iot requirements document. however, each stage has specific objectives, activities, and tasks. the construction process can be executed for one idea or a set of requirements (see figure 5). in the first case, the process execution is an intermediated version of the iot requirements document. also, the construction process can be adapted to be used in different contexts. for example, we can apply this proposal with any development methodologies. regarding projects that use an agile methodology, iot use cases could be not applicable and demand more cost and effort. in these contexts, iot use cases cannot be mandatory, and the activities and tasks that support build them can be skipped. however, on the other hand, it can cause positive impacts (decrease time and effort) and negative (absence of important information). therefore, the user of the process needs to evaluate these impacts. besides, the current retiot version integrates ten templates – eight of them are defined/adapted from the project templates currently used in projects of the ese group/pesc/coppe and other templates used in software engineering group/pesc/coppe. in addition, two of them were defined by the scenariotcheck technique (souza 2020). figure 6 shows an overview (idef0 diagram) of the three stages with their inputs, outputs, templates presented in the next paragraphs, and controls management procedures and feasibility strategy. the management phase performs the management procedures. the feasibility strategy represents the milestone of each stage. 3.2.1 stage 1 the first stage is to understand the problem. then, it aims to understand the problem or opportunity, analyze the stakeholders and their needs, elicit the business needs, and carry out the project feasibility analysis. it is composed of 12 activities and 27 tasks that are distributed throughout the engineering cycle. figure 7 presents an overview of the activities performed in the first stage. this stage offers three templates: iot canvas, iot project feasibility analysis, and requirements checklist. its milestone is the feasibility analysis performed by four activities (analyze market demand, analyze economic feasibility, analyze impact and risks, and analyze technical feasibility). figure 5. construction process overview stages a requirements engineering technology for the iot software systems silva et al. 2021 figure 6. idef0 diagram of the three stages figure 7. first stage overview of the construction process. 3.2.2 stage 2 the second stage is to describe the solution. it aims to transform business needs, stakeholders' needs, and general requirements into detailed, classified, and organized requirements. iot scenarios, arrangements, and components are used for the specification and verified during this stage. subsequently, the requirements are negotiated and evaluated, attesting that a common understanding of the system has been reached. this stage is composed of 12 activities and 39 tasks that are distributed throughout the engineering cycle. a requirements engineering technology for the iot software systems silva et al. 2021 the scenariot technique (silva 2019) supports the requirements identification and the system behavior description. this technique is executed during the following activities: define iot scenarios and specify iot scenarios. figure 8 presents an overview of the activities performed in the second stage. this stage defines three templates: iot project detail, iot solution proposal, and change analysis report. the scenariotcheck technique (souza 2020) contributes with two templates (verification checklist template and inspection record template) used in this stage during verify iot scenarios activity. its milestone is the low-level prototype performed by the activity "define low-fidelity prototype." this stage presents optional activities since the construction process can be used with any development methodology. 3.2.3 stage 3 the third stage is to detail the solution. it transforms iot requirements and scenarios into iot use cases. the iot use cases diagram, the list of iot use cases, and their descriptions are generated during this stage. subsequently, the generated artifacts are checked and evaluated, attesting that a common understanding of the system has been achieved. this stage is composed of ten activities and 24 tasks distributed throughout the engineering cycle. figure 9 presents an overview of the activities performed in the third stage. two templates are defined for it: iot use case description and iot diagram and use cases checklist. also, the change analysis report can be used in this stage. this stage's milestone is the high-level prototype performed by the activity "define an evolved prototype." this stage presents optional activities since the construction process can be used with any development methodology. figure 8. second stage overview of the construction process. a requirements engineering technology for the iot software systems silva et al. 2021 figure 9. third stage overview of the construction process. 4 evaluating the retiot's templates feasibility the retiot aims to support software engineers during the re activities. the main techniques presented in section 2 and used to compose the software technology have already been empirically evaluated and used in iot software system projects (souza et al. 2019a) (souza et al. 2019b). however, new facilities' inclusion to support the re with the retiot requires an initial observation before using them in the projects and conducting further experimental studies. thus, this section presents a feasibility study of the retiot templates. 4.1 templates in this feasibility study, we considered the structure of two artifacts' templates – requirements list (rl) and iot usecases description (iotucd1) – for conventional software systems but used in iot software system projects. we compared their structure with the structure of the retiot templates – project scope (ps), solution proposal (sp), and iot use-cases description (iotucd2). the full versions of all templates are available at http://bit.ly/393sghx. http://bit.ly/393sghx a requirements engineering technology for the iot software systems silva et al. 2021 4.1.1 the retiot templates this section presents three retiot templates (silva et al. 2020b) regarding the activities of elicitation (eli) – "project scope (ps)" template (see figure 10); analysis (ana) and specification (spe) – "solution proposal (sp)" template (see figure 11) and "iot use-cases description (iotucd)" template (see figure 12). the conception/design (con), negotiation (neg), and validation (val) activities are minimally covered by the "project scope" template. the "solution proposal" and "iot use-cases description" templates support the management (man) activities, maintaining traceability between requirements and analysis models. besides, the techniques described in section 2 support the elicitation (eli), specification (spe), and verification (ver) activities. the following items will present the templates' global description (project scope, solution proposal, and iot use-cases description) as defined in the retiot. • project scope template this template supports the documentation of the project's initial activities, the problem to be solved, those involved in the project, the user profiles, user needs, and business needs. in addition, it includes identifying and describing system requirements (functional, non-functional, restrictions, others) and business rules. figure 10. extract from the "project scope template." also, the requirements document's validation is made through an explicit agreement (signature or email copy). finally, it provides two (status and priority) fields to support the negotiation of functional and non-functional requirements. figure 10 presents an extract of this template. the proposed template is used in the activities of conception/design (con), elicitation (eli), negotiation (neg), and validation (val). • solution proposal template this template supports the solution description. it identifies and describes, using the scenariot technique, the iot scenarios, the iot components, and the iot interaction arrangements (iias) of the system. also, it provides the details of the iias chosen for each iot scenario via the corresponding catalogs. thus, the traceability between requirements, iot scenarios, iot interaction arrangements, and their respective catalogs is maintained. figure 11 presents an extract of this template. it should be used in the elicitation (eli), analysis (ana), specification (spe), and management (man) activities to identify, describe and refine the system's behavior while maintaining requirements traceability. figure 11. extract from the "solution proposal template." • iot use-cases description template this template includes the description of the iot use-cases. use cases are identified and described providing a view of the system's behavior. in addition, the use-case diagram is inserted in this template. traceability between requirements, iot scenarios, iot interaction arrangements, and iot use-cases is maintained. figure 12 presents an extract of this template. it should be a requirements engineering technology for the iot software systems silva et al. 2021 used in the analysis (ana), specification (spe), verification (ver), and management (man) activities. the scenariotcheck technique is applied during the verification activities to identify inconsistencies in the iot scenarios description and their components and the choice of iias. figure 12. extract from the "iot use-cases description template." 4.1.2 projects and teams the rl and iotucd1 artifacts were built by using conventional templates in three (3) iot based software projects: • project a supports environmental markers' collection (e.g., temperature, humidity, particulates, co2 level, and toxic gases). • project b monitors a high-performance computing environment (data center) to collect different information such as temperature, humidity, energy consumption, and energy supply quality. • project c collects temperature, humidity, wind speed, and wind direction in different city regions. all three projects represent real demand. a stakeholder (totally external to the course and the research group) worked with the developers, including the requirements acceptance. undergraduate students produced the rl and iotucd1 artifacts during a software engineering course at ufrj. the course had the participation of 21 students of the fourth year of information and computer engineering. the subjects were organized into three development teams, with seven participants each. the teams contained balanced participants with equivalent levels of knowledge and skills regarding software and hardware. training on different topics in software engineering and mentoring throughout the project were available. there was no intervention by the mentors in the artifact's content. all ethical issues and consent forms were made available. some of the course's topics included requirements engineering, iot scenarios, verification technique for iot systems, uml (unified modeling language) diagrams, among others. in addition, the scenariot and scenariotcheck techniques were presented to the participants, although they were not conditioned to use them. the teams were free to organize their projects. the requirements document represented one of the design milestones. a minimum viable product (mvp) represents one of the concrete results delivered at the end of the course. 4.2 execution the researchers (paper's authors) analyzed the requirements document (rl and iotucd1 artifacts) after the three teams constructed them. the information found in the generated artifacts was compared with the requested information in the project scope (ps), solution proposal (sp), and iot use-cases description (iotucd2) templates structure. a working checklist was used to compare the templates, which will be presented in the next subsection. three researchers carried out the comparison – two master's students and one ph.d. that work in software engineering and iot domains. after that, a fourth researcher (ph.d. and expert in software engineering and iot domains) reviewed the analysis of the results. 4.3 results and discussion table 3 presents the checklist used to compare the template structure (conventional and retiot) and the analysis result. it indicates that: • the rl template does not address the project/system objective and problem domain. however, knowing the problem domain is essential for building an iot software system (motta et al. 2019) (nguyen-duc et al., 2019). • the rl template presents a partial description of the stakeholders. however, it does not include profiles descriptions of the different users important for the system development and the user interface design. • the rl template does not address the description of business/stakeholder needs. the identification of business/stakeholder needs represents the initial stage of the project. in this step, we seek to understand the client's real need, which will be transformed into system requirements in the future. • retiot allows identifying the requirements that will guide the iot solution from the beginning (project scope template), unlike the rl template that does not identify the iot requirements. table 3. mapping checklist of the template structure. a requirements engineering technology for the iot software systems silva et al. 2021 project/system information conventional templates retiot templates rl iotucd2 ps sp iotucd2 project name/project responsible t t t t t version control t t t t t explicit agreement t t project/system objective n t problem domain n t project scope t t glossaire t t stakeholders description p t business and stakeholders needs a description n t functional requirements p t non-functional requirements t t requirements negotiation (prioritization) t t business rules n t t t project analyses n p iot scenarios p t iot components description n t t iot interaction arrangements p t t iot use-cases diagram n t iot use-cases description n t requirements traceability p p t references (others project documents) t n p partially collected; t totally collected; n does not collect information; gray not applicable for this template. • the iotucd1 template treats iot scenarios and iot arrangements partially but does not address iot components' description. in contrast, the retiot treats this information entirely in the solution proposal (sp) and iot use-cases description (iotucd 2) templates. • conventional templates are not treat the iot use-cases diagram and iot use-cases description, but retiot fully treats them. • the requirements traceability is partially treated by the iotucd1template, partially treated by retiot in sp template, and fully treated iotucd2 template. • the rl template presents a field for references (other documents), which retiot does not address. the different convergence and divergence points between conventional systems templates (rl and iotucd1) and retiot templates (ps, sp, iotucd2) offer indications that the retiot can be more robust because it deals with iot information since the beginning of the project according to the results, retiot can present (silva et al. 2020a) (silva et al. 2020b) a good potential for supporting iot software systems' specifications because of its templates' specific iot information, differently from conventional ones. 4.4 templates' evolution this study allowed us to evolve the existing templates regarding the reorganization of sections and insert new sections and new fields. first, we identified the lack of information and redundancies on templates. then some fields were added, moved, or removed. besides that, we started to think about mvp and prototypes applied on iot projects that caused changes in some templates, and we added the "iot canvas" template to the technology. in the second version, we started to think about requirements negotiation, reuse, and traceability. consequently, we identified the need to insert new fields to attend to these points better. also, the templates of the second version did not cover the project feasibility and requirements verification. it is because the technology does not cover these points. to fill these gaps, we propose "project feasibility analysis," "requirements checklist," and "iot diagram and use cases checklist" templates. at least, we identified the need to register and track requirements changes. the second version presented activities and tasks to support it, but no template was defined. in that sense, one more template, named "change analysis report," was defined to support these activities. in addition, the project scope template was renamed to iot project detail and the solution proposal template to iot solution proposal. in the iot project detail template, we included a new field, "project description." the "glossary" and "stakeholders" sections have been changed to include fields to support the capturing of specific information. the "potential stakeholders" section has been changed to "stakeholders" to include two new fields to capture each stakeholder's interest and its influence in the system. the "project scope" section has been removed from the template. the "canvas iot" section was added, allowing the insertion of an image or photo of the iot canvas built in stage 1. new fields ("reused requirement?" and "related requirement id") have been added to the "system requirements" section to enable requirements traceability (functional and non-functional). in addition, for functional requirements, two fields ("cost" and "effort") were added to make negotiation feasible. in its previous version, requirements were classified into iot requirements and non-iot requirements. in this new version, this classification has been removed, and the "iot characteristic" field has been included. therefore, when describing a non-iot requirement, this field should not be filled. instead, the iot characteristic must be described as identification, sensing, performance, connectivity, and processing in an iot requirement. in addition, the "dependency a requirements engineering technology for the iot software systems silva et al. 2021 between requirements" field has also been added to the "functional requirements" section. the non-functional requirement "scalability" was added to the new template. the requirements "portability and compatibility" and "security and privacy" have been adapted. the section "annex non-functional requirements" has been added to support the identification of non-functional requirements. the section "scope not covered by the project" has been added, and the section "project analysis" has been removed. in the "business rules" section, the "related needs id" field has been added to allow business rules' traceability. in the iot solution proposal template, the fields "actors," "actions," "interaction arrangements" were added in the "iot scenarios" section. the section "iot system components" was removed because it had a redundancy of the arrangement catalogs' information. the "related functional requirements," "precedencies," and "dependencies" fields have been added in the "iot scenarios description" section to enable the traceability of iot scenarios. the field "collected data and actions performed" was divided into two fields: "collected data" and "actions performed." the "interaction sequence" field was changed because it is like a use-case structure (main, alternative, and exception flows). finally, the "environment" and "connectivity" fields have been removed. in the iot use-cases description template, the "business rules" field was moved from the "interaction sequence" section to a separate section. in addition, the section "customer or customer representative agreement" has been added to this template. five new templates were also defined to support the construction process activities that had not yet been contemplated. the new models (see section 3) correspond to iot canvas, iot project feasibility analysis, requirements checklist, change analysis report, and iot diagram and use cases checklist. table 4 shows each change and the rationale for them. however, to ensure the technology validity, further experimental evaluation is necessary to verify whether the retiot construction process with the templates is useful, complete, correct, and intuitive. 4.5 threats to validity internal validity is the study itself, even though experimental studies have evaluated part of the retiot technology. however, the results indicate that the retiot templates can capture relevant information than conventional templates regarding project artifacts. an external validity issue concerns the participants (undergraduate students) who have been invited to participate in the study. we cannot claim that the information provided is complete from the project's point of view, nor did the participants understand all the topics taught during the course. to mitigate this threat, the projects treated in the study represented real problems. besides, each team had contact with a stakeholder of each addressed problem. there was no control over the artifact's creation during the course and used in the study regarding their construct validity. however, the projects were equivalent in size, complexity and used iot technologies to mitigate this threat. also, it can be highlighted that the teams received equivalent training and mentoring in re. finally, the conclusion validity concerns the study interpretation and sample size. we had a small and inhomogeneous sample size. therefore, it was impossible to apply statistical tests to carry out a deeper analysis of the results obtained. also, the study conclusion is limited to the researchers' interpretation. these items limit the study results generalization. to mitigate this threat, we aim to perform future experimental studies to collect feedback from the retiot. 5 literature analysis 5.1 re phases this section presents related works found in the technical literature, which address technologies for the different re phases mentioned above. table 5 presents a comparison of seventeen (17) technologies found in the technical literature. we can observe that conception, negotiation, verification, validation, and management phases need more attention regarding iot concepts and characteristics. figure 13 synthesizes the information presented in table 5, showing the number of technologies per re phase. again, we can highlight that a high number permeate elicitation (nine), analysis (ten), and specification (eight) phases. in contrast, a small number is concentrated in the conception/design (four), negotiation (one), verification (five), validation (three) and management (three) phases. figure 13. technologies x re phases. regarding the conception phase (con), gsem-iot (zambonelli 2017) (laplante et al. 2018) and ignite (giray et al. 2018) technologies carry out the stakeholders' analysis involved in the system. in addition, the feasibility analysis is partially addressed by iot methodology (giray et al., 2018). also, the ignite and core (hamdi et al. 2019) technologies provide business analysis mechanisms. a requirements engineering technology for the iot software systems silva et al. 2021 table 4. templates' evolution. template name previous element new element change description rationale iot project detail project description. included new field allow getting a simple and brief description of the project. glossary and stakeholders included fields to support the capturing of specific information enable to capture specific information about terms used in the project and stakeholders. potential stakeholders stakeholders change section name and included two new fields to capture interest and influence simplify this section and capture more data about the stakeholders. project scope removed from the template allows avoiding redundancy. the described requirements can identify the project scope. canvas iot included new field allow the insertion of an image or photo of the iot canvas built in stage 1 reused requirement? and related requirement id included new field added to "system requirements" section enable requirements' reuse and traceability. cost and effort included new field make negotiation feasible. iot requirements and noniot requirements iot characteristic (only to iot requirements) this classification has been removed, and the "iot characteristic" field has been included simplify this section and enable to capture identification, sensing, performance, connectivity, and processing characteristics on requirements. dependency between requirements this field has also been added to the "functional requirements" section enable requirements' traceability. non-functional requirements "scalability", "portability and compatibility" and "security and privacy" the section has been improved this field's description of this section has been adapted and improved to attend iot systems better. annex non-functional requirements included new section support the identification of non-functional requirements. scope not covered by the project included new section enable to capture and describe the project analysis the section has been moved to another template a new template (iot project feasibility analysis) has been created. as a result, the information presented in this section has been adapted and moved to a specialist template. related needs id a new field has been added to the "business rules" section allow business rules' traceability. iot solution proposal actors, actions, and interaction arrangements included new field added in the "iot scenarios" section enable traceability between iot scenarios information. iot system components this section has been removed avoid redundancy of the arrangement catalogs' information. related functional requirements, precedencies and dependencies included new fields have been added in the "iot scenarios description." enable traceability between iot scenarios. collected data and actions performed collected data and actions performed the field "collected data and actions performed" was divided into two fields simplify this section and separate specific information. interaction sequence remove alternative and exception flows. simplify this section because it is like a use-case structure (main, alternative, and exception flows). environment and connectivity these sections have been removed simplify this section. we believe this information is not relevant at this point. in this way, they must be collected during the design projects' phase. iot use-cases description business rules the business rules field was moved from the "interaction sequence" section to a separate section simplify this section. customer or customer representative agreement this section has been added enable to get an explicit agreement about iot diagram and use cases. iot canvas this template has been added enable to support projects' description and idea validation through the easy and fast way. a requirements engineering technology for the iot software systems silva et al. 2021 template name previous element new element change description rationale iot project feasibility analysis this template has been added enable to support projects' decision-making about feasibility in market demand, cost, impact, risks, and technology. requirements checklist this template has been added enable to verify if requirements are correct, understandable, and consistent. change analysis report this template has been added enable to manage requirements' change through the project life cycle. iot diagram and use cases checklist this template has been added enable to verify if iot diagram and use cases are correct, understandable, and consistent. a requirements engineering technology for the iot software systems silva et al. 2021 table 5. technologies x re phases. technology/re phase con eli neg ana spe ver val man (aziz et al. 2016) x x x (mahalank et al. 2016) x (takeda and hatakeyama 2016) x x (touzani and ponsard 2016) x iot-rml (costa et al. 2017) x x x (yamakami 2017) x gsem-iot (zambonelli 2017) x x (carvalho et al. 2018) x (curumsing et al. 2019) x x x x iot system development methods ignite (giray et al. 2018) x x x x x x iot methodology (giray et al. 2018) x x x (laplante et al. 2018) x x x x iotreq (reggio 2018) x x x core (hamdi et al. 2019) x x scenariot (silva 2019) x x x scenariotcheck (souza 2020) x trustapis (ferraris and fernandez-gago 2020) x x x several technologies address the elicitation phase (eli): ignite (giray et al. 2018), iot methodology (giray et al. 2018), (laplante et al. 2018), iotreq (reggio 2018), (curumsing et al. 2019), core (hamdi et al. 2019), scenariot (silva 2019) and trustapis (ferraris and fernandez-gago 2020) that offer resources for collecting requirements. in addition, gsem-iot (zambonelli 2017), iotreq, and iot methodology propose mechanisms to transform users' needs into requirements. for the negotiation phase (neg), ignite (giray et al., 2018) addresses the impact and risk analysis but does not provide further details on conducting this activity. in the analysis phase (ana), (takeda and hatakeyama 2016) and (touzani and ponsard 2016) technologies, ignite (giray et al. 2018), (laplante et al. 2018), iotreq (reggio 2018), (curumsing et al. 2019) and core (hamdi et al. 2019) use uml diagrams to develop the analysis models. the scenariot technology (silva 2019) comprises the scenario analysis based on iot interaction arrangements. the works of (aziz et al. 2016) and iot-rml (costa et al., 2017) address artifacts and models' reuse. the specification phase (spe) is addressed by several technologies: (takeda and hatakeyama 2016), iot-rml (costa et al. 2017), iotreq (reggio 2018), and trustapis (ferraris and fernandez-gago 2020) that use formal models for specifying requirements. technologies proposed by (aziz et al. 2016), (mahalank et al. 2016), and (giray et al. 2018) – ignite provide templates for specifying requirements. the scenariot (silva 2019) proposes the scenario specification using iot interaction arrangements. in the verification phase (ver), we found that (carvalho et al. 2018) and scenariotcheck (souza 2020) propose mechanisms to verify requirements. the technologies proposed by (yamakami 2017), (costa et al. 2017) iot-rml (carvalho et al., 2018), and (curumsing et al. 2019) offer mechanisms for checking conflicts between requirements. the validation phase (val) is addressed by ignite (giray et al., 2018), iot methodology (giray et al., 2018), and (laplante et al., 2018), which propose a prototyping technique to ensure that the product meets users' needs. for the management phase (ger), (aziz et al. 2016), (curumsing et al. 2019), and trustapis (ferraris and fernandez-gago 2020) offer mechanisms to enable traceability. in addition, trustapis also provides a mechanism for requirements change management. 5.2 techniques and methods a quasi-systematic literature review (lim et al. 2018) identified 12 relevant publications and 37 elicitation techniques normally applied in iot systems development. the most frequently used techniques are interviews and prototypes, where the latter can also be used to validate requirements. we can also highlight other techniques and methods applied during the elicitation phase: scenarios, use cases, and frameworks. this work also presents a brief contribution regarding the conflict resolution of the stakeholders. the authors emphasize using interview and prototyping techniques to encourage discussions and find alternative ways to identify conflicts. in this way, we analyzed the 17 technologies to identify which techniques/methods are used and where (re phases) in iot systems development. figure 14 shows our findings where we can observe 14 items and the most used: process (thirteen), use cases (eight), and models (seven). figure 14. technologies x techniques/methods. table 6 shows where (re phases) the techniques/methods found are applied. the elicitation (28), analysis (30), and specification (22) phases offer a greater number of techa requirements engineering technology for the iot software systems silva et al. 2021 niques/methods. it is important to highlight that some technologies offer more than one technique for one or more re phases. retiot permeates the eight phases previously described offering methodological and technical support through construction, techniques, and templates. analyzing the retiot current version (see section 3), we can say that it proposes and integrates some techniques/methods: prototyping, iot canvas, iot scenarios based on iot scenarios specification technique scenariot (silva 2019), use cases diagram and description, templates, and iot scenario inspection technique scenariotcheck (souza 2020), and a construction process. 6 research opportunities analyzing the technologies found in the technical literature, we can observe that only one technology discusses the negotiation phase. it represents a research opportunity. few technologies offer project management, validation, test case elaboration, and decision-making related to the system's design and architecture. these topics can be explored through future research. we can also observe that not all technologies cover all re activities and present gaps regarding the different activities necessary to build iot system requirements document. among these gaps, we can observe the lack of i) methodological support for the design and ideation of iot products (nguyen-duc et al. 2019); ii) stakeholder identification and description and business needs (silva et al. 2020b); iii) iot system characteristics and behaviors (motta et al. 2019a), as well as the requirements refinement; iv) high-level (new iot interaction arrangements) and low-level (iot use-case diagram) analysis models; v) project feasibility analysis (silva et al. 2020); vi) prototypes as suggested by (nguyen-duc et al. 2019) (lim et al. 2018); and vii) explicit agreements with the client (silva et al. 2020). these technologies also do not fully meet the iot software system specificities and characteristics: i) the components and actors' description (curumsing et al. 2019) (aziz et al. 2016); ii) the behaviors description of different levels of each object (curumsing et al. 2019) (reggio 2018); iii) the identification of the systemic characteristics (sensing, identification, performance, processing, and connectivity); and iv) the detailed specification of each feature. 7 conclusion and future works this paper presented the retiot. it provides a construction process, techniques (iot scenario specification and verification techniques), and tools (templates) to support iot software's requirements engineering systems. besides, this work seeks to accomplish an initial observation about this technology that focuses on analyzing and evaluating only the templates. a feasibility study was performed to compare three templates defined in the second version of retiot with conventional software systems templates (not specific to iot software systems). their comparison provided indications that the artifacts generated by retiot may be complete regarding the capture of iot information. table 6. techniques/methods x re phases. techniques-methods /re phase con eli neg ana spe ver val man interview 2 prototyping 3 canvas 1 1 scenarios 1 3 use cases 2 2 7 1 class diagram 1 activity diagram 1 state diagram 1 sysml language 1 1 1 formal models 2 4 2 1 1 templates 2 3 5 1 2 goal model 2 4 1 1 framework 2 3 1 2 1 2 catalogs 1 1 1 process 3 11 1 9 7 3 3 3 total 9 28 2 30 22 7 9 7 the experimental study was planned to analyze the process and templates of retiot. however, it was not possible to conduct this study due to covid 19 pandemic. some of the future work reserved for the retiot are: i) (re)design and execution of experimental studies to evaluate the technology in more robust iot software system projects (both academic and industrial contexts); a comparative study of the retiot with traditional technologies will be carried out to verify the efficiency and effectiveness of the retiot in terms of capturing system and project relevant information. such a study should also evaluate the retiot usefulness and suitability according to the user's perception; ii) integrating retiot with a testing technique to support software engineers with the specification of context-aware test cases cats# contextaware test suite design (doreste and travassos 2020); and iii) developing tooling support integrating the construction process, iot scenario specification, and verification techniques templates. the tool will facilitate the traceability among iot requirements, iot interaction arrangements, iot scenarios, and iot use-cases. a requirements engineering technology for the iot software systems silva et al. 2021 acknowledgments the authors would like to thank the national council for scientific and technological development cnpq. taisa gonçalves received a postdoctoral scholarship (154004 / 20189). prof. travassos is a cnpq researcher (304234/2018-4). references alexander i, maiden n (2004) scenarios, stories, and use cases: the modern basis for system development. computing and control engineering 15:24–29. https://doi.org/10.1049/cce:20040505 arif s, khan q, gahyyur sak (2009) requirements engineering processes, tools/technologies, & methodologies. international journal of reviews in computing 2:41–56 aziz mw, sheikh aa, felemban ea (2016) requirement engineering technique for smart spaces. in: international conference on internet of things and cloud computing. acm press, cambridge united kingdom, p 54:1-54:7 behrens h (2002) requirements analysis using statecharts and generated scenarios. in: doctoral symposium at ieee joint conference on requirements engineering burg jfm, van de riet rp (1996) a natural language and scenario based approach to requirements engineering. in: proceedings of workshop in natuerlichsprachlicher entwurf von informationssystemen carvalho rm, andrade rmc, oliveira km (2018) towards a catalog of conflicts for hci quality characteristics in ubicomp and iot applications: process and first results. in: 12th international conference on research challenges in information science (rcis). ieee, nantes, pp 1–6 costa b, pires pf, delicato fc (2017) specifying functional requirements and qos parameters for iot systems. in: 15th intl conf on dependable, autonomic and secure computing, 15th intl conf on pervasive intelligence and computing, 3rd intl conf on big data intelligence and computing and cyber science and technology congress. ieee, orlando, fl, pp 407–414 curumsing mk, fernando n, abdelrazek m, et al (2019) emotion-oriented requirements engineering: a case study in developing a smart home system for the elderly. journal of systems and software 147:215–229. https://doi.org/10.1016/j.jss.2018.06.077 doreste ac de s, travassos gh (2020) towards supporting the specification of context-aware software system test cases. in: xxiii ibero-american conference on software engineering. springer, curitiba, brazil (online), p s10 p1:8 pages ferraris d, fernandez-gago c (2020) trustapis: a trust requirements elicitation method for iot. international journal of information security 19:111–127. https://doi.org/10.1007/s10207-019-00438-x giray g, tekinerdogan b, tüzün e (2018) iot system development methods. in: hassan q, khan ar, madani sa (eds) internet of things. crc press/taylor & francis, new york, pp 141–159 glinz m (2000) improving the quality of requirements with scenarios. in: proceedings of the second world congress on software quality. pp 55–60 hamdi ms, ghannem a, loucopoulos p, et al (2019) intelligent parking management by means of capability oriented requirements engineering. in: wotawa f, friedrich g, pill i, et al. (eds) advances and trends in artificial intelligence from theory to practice iea/aie 2019. springer international publishing, cham, pp 158–172 laplante nl, laplante pa, voas jm (2018) stakeholder identification and use case representation for internet-of-things applications in healthcare. ieee systems journal 12:1589–1597. https://doi.org/10.1109/jsyst.2016.2558449 lim t-y, chua f-f, tajuddin bb (2018) elicitation techniques for internet of things applications requirements: a systematic review. in: vii international conference on network, communication and computing. acm press, taipei city, taiwan, pp 182–188 mahalank sn, malagund kb, banakar rm (2016) non functional requirement analysis in iot based smart traffic management system. in: international conference on computing communication control and automation. ieee, pune, india, pp 1–6 motta rc, oliveira km, travassos gh (2019a) on challenges in engineering iot software systems. journal of software engineering research and development 7:5:1-5:20. https://doi.org/10.5753/jserd.2019.15 motta rc, oliveira km, travassos gh (2020) towards a roadmap for the internet of things software systems engineering. in: proceedings of the 12th international conference on management of digital ecosystems. acm, virtual event united arab emirates, pp 111–114 motta rc, silva vm, travassos gh (2019b) towards a more in-depth understanding of the iot paradigm and its challenges. journal of software engineering research and development 7:3:1-3:16. https://doi.org/10.5753/jserd.2019.14 a requirements engineering technology for the iot software systems silva et al. 2021 nguyen-duc a, khalid k, shahid bajwa s, lønnestad t (2019) minimum viable products for internet of things applications: common pitfalls and practices. future internet 11:paper 50. https://doi.org/10.3390/fi11020050 pandey d, suman u, ramani ak (2010) an effective requirement engineering process model for software development and requirements management. in: international conference on advances in recent technologies in communication and computing. ieee, kottayam, india, pp 287–291 pressman rs, maxim b (2014) software engineering: a practitioner’s approach, 8 edition. mcgraw-hill education, new york, ny reggio g (2018) a uml-based proposal for iot system requirements specification. in: 10th international workshop on modelling in software engineering. acm press, gothenburg, sweden, pp 9–16 silva dv da, goncalves tg, rocha arc da (2019) a requirements engineering process for iot systems. in: xviii brazilian symposium on software quality. acm press, fortaleza, brazil, pp 204–209 silva dv da, gonçalves tg, travassos gh (2020a) a technology to support the building of requirements documents for iot software systems. in: xix brazilian symposium on software quality. acm press, são luís, brazil (online), p article no 4 pages 1-10 silva dv da, souza bp de, gonçalves tg, travassos gh (2020b) uma tecnologia para apoiar a engenharia de requisitos de sistemas de software iot. in: xxiii ibero-american conference on software engineering. curitiba, brazil (online), p s09 p3:14 pages silva vm (2019) scenariot support for scenario specification of internet of things-based software systems. master’s dissertation, federal university of rio de janeiro sommerville i (2015) software engineering, 10 edition. pearson, harlow souza bp (2020) scenariotcheck: uma técnica de leitura baseada em checklist para verificação de cenários iot. master’s dissertation, federal university of rio de janeiro souza bp de, motta rc, costa d, travassos gh (2019a) an iot-based scenario description inspection technique. in: xviii brazilian symposium on software quality. acm press, fortaleza, brazil, pp 20–29 souza bp de, motta rc, travassos gh (2019b) the first version of scenariotcheck: a checklist for iot based scenarios. in: xxxiii brazilian symposium on software engineering. acm press, salvador, brazil, pp 219–223 takeda a, hatakeyama y (2016) conversion method for user experience design information and software requirement specification. in: marcus a (ed) design, user experience, and usability: design thinking and methods duxu 2016. springer, cham, pp 356–364 touzani m, ponsard c (2016) towards modelling and analysis of spatial and temporal requirements. in: 24th international requirements engineering conference. ieee, beijing, china, pp 389–394 vegendla a, duc an, gao s, sindre g (2018) a systematic mapping study on requirements engineering in software ecosystems: journal of information technology research 11:4:1-4:21. https://doi.org/10.4018/jitr.2018010104 yamakami t (2017) horizontal requirement engineering in integration of multiple iot use cases of city platform as a service. in: 2017 ieee international conference on computer and information technology (cit). ieee, helsinki, finland, pp 292–296 zambonelli f (2017) key abstractions for iot-oriented software engineering. ieee software 34:38–45. https://doi.org/10.1109/ms.2017.3 a requirements engineering technology for the iot software systems 1 introduction 2 the technological basis of the retiot 2.1 scenariot 2.2 scenariotcheck 3 the retiot 3.1 construction process 3.2 construction process stages 3.2.1 stage 1 3.2.2 stage 2 3.2.3 stage 3 4 evaluating the retiot's templates feasibility 4.1 templates 4.1.1 the retiot templates 4.1.2 projects and teams 4.2 execution 4.3 results and discussion 4.4 templates' evolution 4.5 threats to validity 5 literature analysis 5.1 re phases 5.2 techniques and methods 6 research opportunities 7 conclusion and future works acknowledgments references journal of software engineering research and development, 2022, 10:8, doi: 10.5753/jserd.2022.2133 this work is licensed under a creative commons attribution 4.0 international license.. accessibility mutation testing of android applications henrique neves da silva [ federal university of paraná | henriqueneves@ufpr.br] silvia regina vergilio [ federal university of paraná | silvia@inf.ufpr.br] andré takeshi endo [ federal university of são carlos | andreendo@ufscar.br] abstract smart devices and their apps are present in many everyday activities and play an important role for people with some disabilities. however, making apps more accessible is still a challenge for developers. automatically accessibility testing tools can help in this task but present some limitations. they produce reports on accessibility faults, which usually cover only a subset of the app because they are dependent on the test set available. in order to help in the improvement and/or assessment of test suites generated, as well as contribute to increasing the performance of accessibility testing tools, this work introduces a mutation testing approach. the approach includes a set of mutant operators derived from faults corresponding to the negation of the wcag standard’s principles and success criteria. it also includes a process to analyse the mutants regarding the original app. evaluation results with 7 open-source apps show the approach is applicable in practice and contributes to significantly improving the number of faults revealed by the test suites accompanying the apps. keywords: mobile apps, mutation testing, accessibility 1 introduction in the last decade, we have observed a growing number of smartphones and studies show this number is expected to increase even more in the next years (cisco, 2017). smart devices and their apps have become a key component in people’s daily lives. this is not different for people with some disabilities. for instance, people with some visual impairment have relied on smartphones as a vital means to foster independence in carrying out various tasks, such as understanding text document structure, communicating through social media apps, identifying products on supermarket shelves, and moving between obstacles (acosta-vargas et al., 2020). world health organization (who) estimated that more than one billion people, which is around 15% of the world’s population, are affected by some form of disability (hartley, 2011). then, it is fundamental to engineer software so that all the advantages of technology are accessible to every individual. mobile accessibility refers to making websites and apps more accessible to people with disabilities when using smartphones and other mobile devices (w3c, 2019). progress has been made with accessibility because of mandates from government regulations (e.g., u.s. section 508 of rehabilitation act), standards (such as the british broadcast corporation standards, brazilian accessibility model, and web content accessibility guidelines), widespread industrial awareness, technological advances, and accessibilityrelated lawsuits (yan and ramachandran, 2019). however, developers still have the challenge of providing more accessible software on mobile devices. according to ballantyne et al. (2018), much of the research on software accessibility is dedicated to the web and its sites (grechanik et al., 2009; wille et al., 2016; abuaddous et al., 2016); even though there is a recurring effort on the accessibility of mobile apps (vendome et al., 2019). moreover, studies point to the lack of adequate tools, guides and policies to design, evaluate, and test the accessibility in mobile apps (acosta-vargas et al., 2020). automated accessibility testing tools are usually based on existing guidelines. one of the most popular standards is the wcag (w3c’s web content accessibility guideline) (kirkpatrick et al., 2018) guide. the wcag guide covers recommendations for people with blindness and low vision, deafness and hearing loss, limited movement, cognitive limitations, speech and learning disabilities. wcag encompasses several guidelines, each one related to different success criteria,groupedintofouraccessibilityprinciples.sometoolsproduce, given a set of executed test cases, a report of accessibility violations for the app. examples of these tools are accessibility google scanner (google, 2020), espresso (google, 2018), a11y ally (toff, 2018), and mate (eler et al., 2018). they can perform static or dynamic analysis (silva et al., 2018). a limited number of violations can be checked by static tools, but dynamic analysis tends to be more costly. another limitation is that the accessibility faults checked by tools are limited by the test cases used. they cover only a subset of the app due to weak test scripts or limited input test data generation algorithms (silva et al., 2018). tools generally used for test data generation such as monkey (moher et al., 2009), sapienz (mao et al., 2016), stoat (su et al., 2017) and ape (gu et al., 2019), are focused on functional behavior, code coverage or crashes. in this sense, this work hypothesizes that a mutation approach specific to accessibility testing can help in the improvement and/or assessment of test suites generated and contribute to increasing the performance of accessibility testing tools. the idea behind mutation testing is to derive versions of the program under test p , called mutants. each mutant describes a possible fault, and is produced by a mutation operator (jia and harman, 2011). the objective is to generate test cases capable of distinguishing p from its mutants, that is, that when executed with each mutant m produces a different output from the output of p . if the p ’s result is correct, it is free from the fault described by m. if the output is differhttps://orcid.org/0000-0002-2417-3374 mailto:henriqueneves@ufpr.br https://orcid.org/0000-0003-3139-6266 mailto:silvia@inf.ufpr.br https://orcid.org/0000-0002-8737-1749 mailto:andreendo@ufscar.br silva et al. 2022 ent, m is said killed. at the end, a measure called mutation score is calculated, related to the number of mutants killed. this measure can be used to design test cases, or to evaluate the quality of an existing test suite, and consider whether a program has been tested enough. mutation testing has been proved to be effective in different domains and contexts (jia and harman, 2011). more recently, it has been used in the test of non-functional properties such as performance regarding execution time (lisper et al., 2017) and energy consumption (jabbarvand and malek, 2017). there are some initiatives exploring mutation testing of android apps (wei, 2015; deng et al., 2015; jabbarvand and malek, 2017; luna and el ariss, 2018; escobarvelásquez et al., 2019). but these works are not focused on accessibility testing. given the context and motivation described above, this paper introduces a mutation approach for the accessibility test of android apps. the underlying fault model is related to the non-compliance with wcag principles and success criteria. we propose a set of 6 operators that remove some selected code elements, the most commonly used in the apps, and whose absence may imply accessibility violations. we also define a mutant analysis process that uses tools’ accessibility reports to distinguish killed mutants. the process is implemented using the reports produced by espresso google (2018), and evaluated with 7 open-source apps. the results showourapproachisapplicableinpracticeandcontributesto improving the quality of the test suites accompanying the selected apps. we observe a significant improvement regarding the number of faults revealed by using the mutant-adequate test suites. in this way, the present work introduces a mutation approach that encompasses a set of mutant operators and a mutation process implemented by a tool. the approach (i) can be used as a criterion for test data generation and/or assessment, helping developers measure the quality of their test suites or to generate tests from an accessibility perspective; (ii) can be explored to evaluate the accessibility tools available in the market and in academia; and (iii) contributes to the emergent area of mutation testing for non-functional properties, and represents the first step to allow accessibility mutation testing, serving as basis to direct future research and encourage the academic community to create tools that further explore this field of research. the remainder of this paper is organized as follows. section 2 gives an overview of related work. section 3 introduces our mutation testing approach. section 4 details the evaluation and its main results. section 5 discusses the threats to validity, and section 6 concludes the paper. 2 related work related work can be classified into two main categories: mutation testing of apps (section 2.1) and accessibility testing (section 2.2). 2.1 mutation testing of android apps in the literature, there are some mutation approaches for android apps. deng et al. (2015) define 4 classes of mutation operators specific to the android context. the proposed workflow differs from the traditional mutation test process. once the mutants are generated, it is necessary to install each mutant m on the android emulator. the test cases are implemented through frameworks robotium (reda, 2019) or junit (gamma and beck, 2019). while deng’s approach requires the app source code, wei (2015) proposes mudroid, a tool that requires only the apk file of the app. linares-vásquez et al. (2017) define a list of 38 mutation operators, implemented by the tool mdroid+ (moran et al., 2018). first, a static analysis of java code using abstract syntactic trees (ast) is performed to find a potential fault profile (pfp) that describes a source code location that can be changed by an operator. pfps are used to apply the transformation corresponding to each operator in the java code or xml file. mdroid+ creates a clone of the android project and applies a single mutation to a pfp specified in the cloned project, resulting in a mutant. finally, a report is generated associating the name of the created clone with the applied operator. the tool does not offer a way to compile and execute the mutants, nor does it calculate the mutation score. in a follow-up study, escobar-velásquez et al. (2019) introduce mutapk that requires as input the apk of the android app and implements the same operators of mdroid+ (linares-vásquez et al., 2017; moran et al., 2018). the corresponding implementation considers smali representation. like mdroid+, mutapk does not include a mutant analysis strategy. both allow the creation of customized mutation operators. some works have explored aspects of a specific nature within the android platform. the edroid tool (luna and el ariss, 2018) implements 10 mutation operators oriented to vary configuration files and gui elements. the analysis of the mutants is done manually. if the mutant’s ui components are distinguished from the original, the mutant is classified as dead. µdroid is a mutation tool to identify energy-related problems (jabbarvand and malek, 2017). the tool implements a total of 50 mutation operators corresponding to 28 classes defined as energy consumption anti-patterns. µdroid has a fully automated mutation testing process. while the test is performed in the original app, energy consumption is monitored. when the test is executed on the mutant, the energy consumption of the original app is compared to that of the mutant. if the screening is different enough, the mutant is considered dead. most tools may be extended to have integrated support for the mutation testing process, mainly automatic mutant execution and analysis. most of them generate mutants and do not offer automatic support for the analysis of the mutant output, which is mainly conducted manually. in addition, there are some initiatives exploring mutation testing of apps for nonfunctional properties, such as energy consumption, but they do not address accessibility faults. based on elicited results about mutation testing of mobile apps (silva et al., 2021), and as far we are concerned, there is not a mutation approach for silva et al. 2022 mobile accessibility testing and evaluation. 2.2 accessibility evaluation of android apps there are few studies on the accessibility assessment of mobile apps. this small amount of studies is due to the lack of adequate tools, guides, and policies to evaluate apps (acostavargas et al., 2020; eler et al., 2018). such guides are generally used as oracles to check whether the app meets accessibility requirements during accessibility evaluation that can be conducted manually or by automated tools. below, we present some works that analyse those guides and report the main accessibility problems, as well as automated tools that take them into consideration. ballantyne et al. (2018) compile a super-set of guides and normalize them to eliminate redundancy. the result lists 11 categories of testable accessibility elements: text, audio, video, gui elements, user control, flexibility and efficiency, recognition instead of recalling, gestures, system visibility, error prevention, and tangible interaction. damaceno et al. (2018) perform a similar mapping that identifies 68 problems associated with different aspects of the interaction of people with visual impairments on mobile devices. these problems are mapped into 7 groups: buttons, data entry, gesture-based interaction, screen size, user feedback, and voice command. the group with more problems is related to the interaction made of formal gestures. vendome et al. (2019) elaborate taxonomy of accessibility problems by mining 13,817 android apps from github. the authors observe that 36.96% of the projects did not have elements with descriptive label attributes, and only 2.08% imported at least one accessibility api. the main categories listed in the fault model are: support for visual limitation, support for motor limitation, hearing limitation, and other aspects of accessibility. alshayban et al. (2020) present the results of a largescale study to understand the accessibility from three complementary perspectives: app, developers, and users. first, they analyze the prevalence of accessibility violations in over 1,000 android apps. then they investigate the developer sentiments through a survey. in the end, they investigate user ratings and app popularity. their analysis revealed that inaccessibility rates for apps developed by big companies are relatively similar to inaccessibility rates for other apps. the works of acosta-vargas et al. (2019, 2020) evaluate the use of wcag 2.1 and the accessibility google scanner, a tool that suggests accessibility improvements for android apps. the authors conclude that the wcag guide achieves digital inclusion on mobile platforms. however, the accessibility problems must be fixed before the application goes into production and they recommend the use of wcag throughout the development cycle. the most recent version of wcag 2.1 includes suggestions for web access via a mobile device (kirkpatrick et al., 2018). wcag principles are grouped into 4 categories: (i) perceivable, that is, “the information must be presentable to users in ways they can perceive”; (ii) operable, “user interface components and navigation must be operable.”; (iii) understandable, “information and the operation of user interface must be understandable.”; and (iv) robust, “content must be robust enough that it can be interpreted by a wide variety of user agents, including assistive technologies”. these principles are the core tenets of accessibility. to follow the accessibility principles, we must achieve the success criteria defined within their respective guideline and principle. automated tools commonly use the wcag success criteria as testable statements to check for guideline violations. they can perform static or dynamic analysis (silva et al., 2018). static analysis can quickly analyze all assets of an app (google, 2018), but they cannot find violations that can only be detected during runtime (e.g., low color contrast). in contrast, dynamic analysis tends to be time consuming. in this sense, eler et al. (2018) define a set of accessibility criteria and implemented mate (mobile accessibility testing), a tool that automatically explores and verifies the accessibility of mobile apps. developers can also manually assess accessibility properties using the google scanner (google, 2020). it allows testing apps and gets suggestions on how to improve accessibility (to help those who have limited vision, speech, or movement). first, the app is activated, then it displays the main handling instructions. finally, with the mobile app running, google scanner highlights the gui element on the screen and what accessibility property it has not fulfilled. the a11y ally app (toff, 2018) checks the accessibility of the running app. from its integration via the command line, a11y generates a json file at the end of its execution. this file contains the list of gui elements and which accessibility criteria have been violated. the framework espresso (google, 2018) allows the recording of automated tests that assess the accessibility of the mobile app. the accessibility of the gui element, or only widget, will be checked if the test action triggers/interacts with the widget in question. the tools for accessibility testing and evaluation present some limitations. the most noticeable one is that the kind and numberofaccessibilityviolationsdeterminedbythetoolsare dependent on the test set used to execute the app and produce the reports. in this sense, the use of mutants describing potential accessibility faults can guide the test data generation and help in the improvement or assessment of an existing test set regarding this non-functional property. 3 a mutation approach for accessibility testing this section introduces our approach, and describes its main elements, which are usually required for any mutation approach: (i) the underlying fault model, related to accessibility faults; (ii) the mutation operators; (iii) the mutation testing process, adopted to analyze the mutants; and (iv) automation aspects, essential to allow the use of the approach in practice. 3.1 fault model in this stage, we searched the literature for different accessibility guides that establish good practices and experiments that used them (see section 2.1). in general, a guide summarizes the main recommendations for making the presented content of the mobile app more accessible. as a result of silva et al. 2022 our search, we observe that the wcag guide was adopted as a reference to build mobile accessibility guides such as emag (brazilian government, 2007), list of accessibility guidelines for mobile applications (ballantyne et al., 2018), bbc mobile accessibility guideline (bbc, 2017), and sidi accessibility guideline (sejasidier, 2015). in this way, the wcag guide was chosen due to the following reasons: i) as mentioned before, it encompasses success criteria written as testable statements; ii) it is constantly updated and a new version of the guide maintains compliance with its previous one; and iii) it has been considered by many authors as the most popular guide (acosta-vargas et al., 2019, 2020). once the success criteria are known, we can start building a fault model by negating these criteria. an unsatisfied criterion may imply one or more accessibility faults, as exemplified in table 1. table 1. negating wcag success criteria principle success criterion success criterion denial perceivable content description absence of content for non-text elements descriptions operable recommended touch not recommended area size touch area size understanlabels or absence of labels dable instructions or instructions robust status messages absence of status messages as observed in table 1, the denial of the criterion “labels or instructions” causes one or more faults related to the absence of a label. within android’s mobile development, different code elements characterize the use of a label for a gui element. these code elements can be either xml attributes or java methods. for instance, one way to satisfy the success criterion “labels or instructions” is setting the xml attributes :hint and :labelfor, or using the java methods sethint and setlabelfor. such elements are the key to the generation of mutants, in order to capture the faults of our model. in this way, more than one mutation operator can be derived from the negation of a criterion, such as “labels or instructions”. each mutation operator, in its turn, can be applied to more than one element in the code, generating distinct mutants. to select the code elements and propose the mutation operators of our approach, we refer to the work of silva et al. (2020). this work maps the wcag principles and success criteria to code elements of native android api, and analyzes the prevalence of the mapped elements in 111 open source mobile apps. the study identifies code elements that impact accessibility, and shows that apps which adopt different types of code elements tend to have a smaller density of accessibility faults. this means that code elements associated with wcag are related to accessibility faults and justify mutation operators based on these code elements. 3.2 mutation operators the main objective in defining the accessibility mutation operators is to make sure that the test suite created by the tester exploits all, or at least most, of the app gui elements, as well as check the correct use of the code elements related to the accessibility success criterion. in this way, the operators can be used to guide the generation of test cases or to assess the quality of existing ones. to this end, and following the work of silva et al. (2020), we selected a set e of code elements, the most adopted in the apps, to propose an initial set of operators. these operators are defined considering aspects of android apps’ accessibility and can be improved in the future, by adding other code elements and success criteria. the selected code elements are presented in table 2; they correspond to the most used ones in the apps for each principle (silva et al., 2020). the table also shows the corresponding mutation operator. the labelfor element is a label that accompanies the view object. it can be defined via the xml file or the java language. in general, it provides a description and exploration labels for some screen elements. the hint element is a temporary label assigned to editable fields only. it is necessary for talkback, or any other screen reader, to correctly report what information the app needs. we can set or change textview font size with the element textsize. recommended dimension type for text is “sp” for scaledpixels (e.g., 15sp). the element inputtype specifies the input type for each text field in order for the system to display the appropriate soft input method (e.g., an on-screen keyboard). the app, by default, looks for the closest element to receive the next focus. the next element is not always the most logical. in these cases, we need to give the app custom navigation. we can define the next view to focus on using the code element nextfocusdownid. the element importantforaccessibility describes whether or not this view is important for accessibility. if the value is set with “yes”, the view fires accessibility events and is reported to accessibility services (e.g., talkback) that query the screen. the idea of the operators is to remove the corresponding code element e ∈ e when present. we opted for statement deletion operators, as previous studies gave evidence that such operators produce fewer yet effective mutants (delamaro et al., 2014). for each code element removed, we have a unique generated mutant. table 3 presents examples of applying the operators. snippets of code are presented and the ones to be removed are preceded by “–”. it is important to emphasize that if a mutation operator can not be applied to the app source code, this may indicate that the project/developer team has low priority on accessibility. now, imagine that the developer has taken care to define the accessibility code elements in the app. even if they are defined, it is very important to ensure that the test set includes a test that performs an action and interacts with the corresponding gui element and check they are defined properly. 3.3 mutation process the testing process for the application of the proposed operators is depicted in figure 1. it encompasses three steps. the first one is the mutant generation using the accessibility mutation operators defined. this step produces a set of mutant apps m. in the second step, the original app and the mutants in m are executed with test set t , which can be designed silva et al. 2022 table 2. selected code elements and corresponding wcag principles and success criteria. principle success criteria code elements mutation operator xml attributes java methods perceivable resize text :textsize settextsize missing textsize identify input purpose :inputtype setinputtype missing inputtype operable keyboard; focus order :nextfocusdownid setnextfocusdowndid missing nextfocusdownid underslabel or instructions :labelfor setlabelfor missing labelfor tandable label or instructions :hint sethint missing hint robust status messages :importantforaccessibility setimportantforaccessibility missing importantforaccessibility table 3. mutation operator description mutation operator code context example xml attribute java method mts missing textsize mit missing inputtype mnfd missing nextfocusdownid mlf missing labelfor mh missing hint mia missing importantforaccessibility with the tester’s preferred strategy. however, for the mutant analysis our process requires that t is implemented and executed by using an accessibility checker tool, such as the ones reported in section 2.2. the third step, mutant analysis, allows calculating the mutation score by comparing the accessibility reports produced by an accessibility checker for the original and mutant apps. if the accessibility logs differ, that is, different accessibility faults are encountered the mutant can be considered dead. the accessibility report generated by espresso contains some temporal information that may cause a non-deterministic output. to correct this, we post-process the output so that only the essential information is taken into account, namely the code element id and its reported accessibility issue. therefore, if the original app’s accessibility log is the same as that of the mutant app, resulting in a live mutant, the test suite probably needs to be revised and improved. if the score is not satisfactory, the tester can add new test cases or modify existing ones in t so that more mutants are killed. (1) mutant generation (2) execution of t in m and app (3) mechanism of analysis of mutants android app mutation score tester decides to improve score by modifying t set of tests t accessibility log of visited screens set m of mutant apps figure 1. testing process of the proposed approach 3.4 implementation to evaluate and use our approach, we implemented a prototype tool named accessibilitymdroid. it receives as input the source code of the android app under test. accessibilitymdroid implements the proposed operators by extending mdroid+ (moran et al., 2018), which are used for mutant generation (step 1). to build and execute the test, as well as to produce the accessibility log (step 2), the espresso framework is used. we chose tests implemented with espresso because it is the default framework for gui testing in android studio and includes embedded accessibility checking. as t is executed, the accessibilitycheck class allows us to check for accessibility faults. in the end of the run, espresso generates a log of the accessibility problems used in step 3. the tool compares the log automatically, and a list of mutants killed is produced. to illustrate our approach we use a sample app built with android studio. a piece of code for this app is presented in figure 2. with the application of operator mh (missing hint), which removes from the gui element the hint code element, line 22 (in red) disappears in the mutant m. 14 figure 2. a mutant generated by operator mh silva et al. 2022 27 @test 28 public void logintest() { 29 var appcompatedittext = onview(allof( 30 withid(r.id.username), 31 childatposition(allof(withid(r.id.container), 32 childatposition(withid(android.r.id.content), 0)), 1), 33 isdisplayed())); 34 35 appcompatedittext.perform(replacetext("email"), closesoftkeyboard()); 36 37 var appcompatedittext2 = onview(allof( 38 withid(r.id.password), 39 childatposition(allof(withid(r.id.container), 40 childatposition(withid(android.r.id.content), 0)), 2), 41 isdisplayed())); 42 43 appcompatedittext2.perform(replacetext("123456"), closesoftkeyboard()); 44 45 var appcompatedittext3 = onview(allof( 46 withid(r.id.password), withtext("123456"), 47 childatposition(allof(withid(r.id.container), 48 childatposition(withid(android.r.id.content), 0)), 2), 49 isdisplayed())); 50 51 appcompatedittext3.perform(pressimeactionbutton()); 52 } figure 3. test case using espresso 1 appcompatedittext{id=2131230902,res-name=nickname}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 2 appcompatedittext{id=2131230902,res-name=nickname}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 3 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 4 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 5 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). figure 4. accessibility log for the original app 1 @test 2 public void logintest() { 3 + onview(withid(r.id.nickname)).perform(typetext("nick"), 4 + closesoftkeyboard()); 5 var appcompatedittext = ... 6 } figure 5. changed test suppose that for this app, a test, as depicted in figure 3, is available. when t is executed with espresso on mutant m (step 2), a log is generated. this log is compared to the log generated by executing t in the original app (step 3). from the difference between the two accessibility logs, it is possible to determine the mutant’s death. in this case, t was not enough to show the difference between the original app and the mutant. as both produce the same log in figure 4, the mutant is still alive. the tester now tries to improve t and realizes that existing tests do not interact with one of the 1 + appcompatedittextid=2131230902,res-name=nickname: view is missing speakable text needed for a screen reader 2 appcompatedittext{id=2131230902,res-name=nickname}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 3 appcompatedittext{id=2131230902,res-name=nickname}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 4 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 5 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 6 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). figure 6. accessibility log for the mutant m app’s input fields. after changes in t (illustrated in figure 5), step 2 is executed again and the log for m is now the one in figure 6; it differs from the original one in the first line. by employing a similar procedure to kill accessibility mutants, t achieves a higher mutation score, covers more gui elements, and potentially reveals other accessibility faults. 4 evaluation the main goal of the proposed operators is to serve as a guide for the evaluation and improvement of test suites regarding accessibility faults. to evaluate these aspects properly, as well as our implementation using espresso, we formulated three research questions as follows. rq1: how is the applicability of the accessibility mutation operators? this question aims to investigate if the proposed operators and processes are applicable in practice. to answer this question, we evaluate the approach’s application cost by analysing the number of mutants generated by each operator, as well as the number of required test cases. rq2: how adequate are existing test suites with respect to the accessibility mutation testing? this question evaluates the use of the proposed operators as an evaluation criterion. they are used for quality assessment of the test suites accompanying the selected open source apps with respect to accessibility. to this end, we analyse the ability of existing tests to kill the mutants generated by our approach. rq3: how much do the mutation operators contribute to revealing new accessibility faults? this question looks at the effectiveness of mutant-adequate test suites when revealing accessibility violations. silva et al. 2022 4.1 study setup we sampled open source apps from f-droid1, last updated in 2019/2020, containing espresso test suites. we refer to the test suite accompanying the project as t . we removed apps that failed to build and whose tests were not compatible with the accessibility checking feature. the replication package is available at: https://osf.io/vfs2d/. the seven apps are: alarmclock, an alarm clock for android smartphones and tablets that brings a pure alarm experience; anymemo, a spaced repetition flashcard learning software; authorizer, a password manager for android; equate, a unit converting calculator; kolabnotes, a note taking app; piwigo, a photo gallery app for the web; and pleestracker, a sleep tracker. for each app, we used accessibilitymdroid to generate mutants, run t , produce the accessibility logs to each mutant, and compare those with the original log. in this way, we obtained the set of mutants killed by t . after this, we manually inspected the alive mutants and realized that many times, some of the test cases in t exercised the mutated code, but they produced no difference in the log due to some espresso limitations (e.g., a limited set of accessibility criteria that will be detected and printed in the accessibility log). in this case, we marked the corresponding mutant as covered. other mutants were marked as “unreachable” since their mutations are related to widgets that are not reachable in the app (e.g., dead code). so, we counted the number of generated, killed, covered, and unreachable mutants by t . then, we extended t so that all mutants were killed or at least covered. we refer to this extended test suite as xt . the inclusion of a test case was conducted in the following way: (i) pick an alive mutant (not covered, not killed by t ); (ii) manually record a test that exercises the mutation using espresso test recorder in android studio, and if needed, refactor the test code to make it repeatable2; (iii) analyze if the mutant is killed by the new test, if not mark it as covered. the mutants information was collected again for xt . as cost indicators, we collected the number of tests of a test suite t c(t ), and its size, given by the number of lines of test code loc(t ). as for effectiveness, we counted per test suite the number of accessibility faults reported by the espresso accessibility check. table 4 shows the information on the seven selected apps. authorizer is the app with the greatest value of loc (28,286), while anymemo has 30 activities (#act.). alarmclock is the app with the smallest number of loc: 1,349, and equate has only 2 activities. the table also shows the number of test cases (#tc) and loc for the original set t and the extended one xt . notice that alarmclock has 41 tests and 1,068 lines of test code (loc(t )). kolabnotes has only one test, yet anymemo has the smallest loc(t ) (76). concerning xt , alarmclock and authorizer require more tests (both 43) and more loc(xt ) (1,341 and 1,700, respectively). pleestracker has the smallest number of test cases (5) and loc(xt ) (345). however, authorizer required more additional test cases, 32, while piwigo only one. 1https://www.f-droid.org 2the code generated by espresso test recorder may be too specific and fails in re-runs. table 4. selected apps app∗ loc #act. #tc(t ) loc(t ) #tc(xt ) loc(xt ) alarmclock 1,349 5 41 1,068 43 1,341 anymemo 19,751 30 3 76 13 932 authorizer 28,286 7 11 652 43 1,700 equate 5,826 2 6 511 9 709 kolabnotes 11,025 9 1 494 6 884 piwigo 4,744 7 8 408 9 579 pleestracker 1,868 5 2 89 5 345 ∗ the app’s name is a clickable link to the github project. 4.2 analysis of results table 5 summarizes the main results of the evaluation and is used in this section to answer our rqs. this table shows the number of mutants that were generated (columns g), killed by some test (columns k), covered but alive (columns c), and unreachable (columns u). notice that the results are shown for 4 out of 6 operators described in table 3; operators mlf and mnfd did not generate any mutant for the selected apps. for each app, two rows are presented, one for the results obtained by t and the other for xt . the last four columns list the total for all operators, while the last rows bring the total for all apps. for instance, for the app anymemo the operator mts generated 64 mutants, 11 unreachable. the test set t was not capable of killing any mutant but covered 14. the set xt covered 52; that is, 38 additional mutants could be covered. considering all operators, only one mutant could be killed by xt , and 70 mutants were covered out of 84 generated mutants. for this app, four mutants change a screen that is reached only when integrated with a third-party app. as exercising these mutants would require other tools beyond espresso, we were not able to cover them. however, they can not be classified as unreachable. because of this, the sum of killed, covered but alive, and unreachable mutants are not equal to the number of generated mutants for this app, as it happens for all of the other ones. rq1 – approach applicability. to answer rq1, we evaluate the number of mutants generated by each operator. we observe in table 5 that operator mts generated more mutants (145 in total), followed by mit (68), mh (34), and mia (9). mts generated mutants for all apps, mit for 6, and mh for 5 apps. operator mia generated mutants only for authorizer. in total, 256 mutants were generated, with anymemo with more mutants (86) and piwigo with 5. this means that the apps selected contain more code elements associated with the principle perceivable (operators mts and mit), which may indicate: (i) developers are worried about content descriptions for non-text elements more than the principle robust (operator mia that generated mutants for only one app) or operable (operator mnfd that did not generate any mutant); (ii) user experience (ux) and user interface (ui) documents include a more significant amount of code elements of the perceivable principle in their guidelines. operators mit and mia generated mutants that were not killed; only one mutant of mts was killed, and 17 out of 34 mutants generated by mh were killed. the process using espresso was capable of distinguishing mutants in the great majority generated by removing the code element :hint. analysing alive mutants, we identified 222 as covered, and 12 as unreachable. unreachable mutants were genhttps://osf.io/vfs2d/?view_only=6c3af7cdbb7f4132a9367e196735c68f https://www.f-droid.org https://github.com/yuriykulikov/alarmclock https://github.com/helloworld1/anymemo https://github.com/tejado/authorizer https://github.com/evanrespaut/equate https://github.com/konradrenner/kolabnotes-android https://github.com/piwigo/piwigo-android https://github.com/vmiklos/plees-tracker silva et al. 2022 table 5. summary of the results per operator android app mutation operator total mts mit mh mia g k c u g k c u g k c u g k c u g k c u alarmclock t 12 9 1 1 14 9 xt 12 1 1 1 13 anymemo t 64 14 11 22 86 14 11 xt 1 52 18 1 70 authorizer t 18 1 27 3 18 3 9 2 72 9 xt 18 27 6 12 9 6 66 equate t 3 2 2 1 1 7 1 1 xt 3 2 1 1 5 kolabnotes t 23 8 13 3 12 48 11 xt 23 13 8 4 8 40 piwigo t 1 3 3 1 1 5 1 3 xt 1 3 1 1 4 pleestracker t 24 8 24 8 xt 24 24 total t 145 40 11 68 9 34 2 3 1 9 2 256 2 54 12 xt 1 133 64 17 16 9 18 222 number of mutants generated, killed, covered but alive, unreachable by the original test suite t and the extended one xt . the mutation operators are: missing textsize; missing inputtype; missing hint; and missing importantforaccessibility. table 6. efforts to build xt app #mutants / app kloc a-tc a-loc mts mit mh mia total alarmclock 8.9 0.7 0.7 0.0 10.37 2 273 anymemo 3.2 1.1 0.0 0.0 4.35 10 856 authorizer 0.6 0.9 0.6 0.3 2.58 32 1048 equate 0.5 0.3 0.3 0.0 1.20 3 198 kolabnotes 2.0 1.1 1.0 0.0 4.35 5 390 piwigo 0.2 0.6 0.2 0.0 1.05 1 171 pleestracker 12.8 0.0 0.0 0.0 12.8 3 256 average 4.0 0.67 0.4 0.043 5.42 8 456 a-tc stands for the number of test cases added to t to obtain xt . a-loc stands for the number of loc added to t to obtain xt . erated mainly for anymemo and are related to implementation smells like dead code. for a deeper analysis, table 6 contains the number of mutants generated by the operator divided by the kloc of each app. the last two columns present information regarding the effort required to add new test cases so that an accessibility mutant adequate test suite is obtained. the last rows contain min, max and average values. we can see that the operators generate a mean value of 5.42 mutants per kloc, and, in the worst case, 12.8 for pleestracker. notice that a greater number of mutants is generated for the largest apps in terms of loc and number of activities: anymemo, authorizer and kolabnotes. given the fact that the proposed operators only remove code elements, the number of mutants tends to be equal to the number of existing elements associated to the accessibility wcag success criteria. due to this characteristic, it is unlikely that the operators generate equivalent mutants. this is an advantage, because the identification of such mutants is usually costly. moreover, we have not found either stillborn or trivial mutants. the first ones are mutants that do not compile, and the second ones are mutants that crash in the initialization. we also measured the effort of adding new test cases, considering the values in table 4. as table 6 shows, authorizer demanded more effort required 32 additional tests (with 1,048 a-loc), followed by anymemo: which required 10 additional tests (with 856 a-loc); and kolabnotes: 5 tests (390 a-loc). these apps are the greatest in terms of size. response to rq1: the number of mutants is related to the size of the app, mainly to the number of gui elements, and code elements associated with the accessibility success criteria. operators mts and mit, related to the principle perceivable, produce more mutants, while no mutant is generated for operator mnfd, related to the operable principle. moreover, we did not observe any stillborn, trivial, or equivalent mutants. implications: the operators are deletion style and depend on the use of accessibility-related code elements. the number of generated mutants grows proportionally to the number of accessibility code elements used in the app. operators mts and mit generated more mutants, which may indicate that code elements related to the principle perceivable are the most used in the app selected. our set of operators represents a first proposal, and we intend to improve the set with other kinds of operators, that for instance adding or modifying code elements, as well, and other code elements and success criteria could be considered. the proposed operators do not generate equivalent mutants due to their conception characteristics. we did not observe any stillborn or trivial mutant. this is important, because they imply in cost. these kinds of mutants are very common in the android mutation testing (linares-vásquez et al., 2017). we observe espresso’s limited ability to detect accessibility faults, and as a consequence, a reduced number of mutants were killed. because of this other accessibility testing tools should be used in future versions of accessibilitymdroid. we also intend to implement mechanisms to automatically determine covered mutants. the analysis of dead mutants is a drawback of most mutation testing approaches for android apps. the great majority do not offer an automatic way to persilva et al. 2022 form this task, they do not even provide a way to consider a mutant killed. rq2 – adequacy of existing test suites. rq2 evaluates the adequacy of the test suites concerning the proposed operators. the answer can shed some light on the quality of the test cases regarding accessibility faults and if the developers are worried about the test of such a non-functional property. to answer this question, table 7 brings the percentage of mutants killed and covered by t , per app. unreachable mutants were not considered. on average, the original sets were capable of killing only 5.23% of the mutants. the killed percentage reaches 20% for piwigo, the app with the fewest number of mutants. but this percentage is equal to zero for five apps. the percentage of covered mutants are better, 30.24% on average. the best percentages were achieved by alarmclock (64.3%) and piwigo (60%). the other five apps achieved a percentage lower than 35%. table 7. adequacy results of original test suites app killed covered alarmclock 0.0% 64.3% anymemo 0.0% 18.67% authorizer 0.0% 12.5% equate 16.67% 0% kolabnotes 0.0% 22.91% piwigo 20% 60% pleestracker 0.0% 33.33% average 5.23% 30.24% response to rq2: the existing test suites of the studied apps killed or covered only a small fraction of the accessibility-related mutants. in other words, they had a low mutation score. implications: in general, there are opportunities to improve the quality of gui tests in mobile apps. while code coverage and mutation testing have better support at the unit test level, more tool support is required at gui level. as the accessibility mutants demand better test coverage at gui level, the results herein presented helped to expose those weaknesses. rq3 – accessibility faults. by answering rq2, we observe that the existing tests obtained a small coverage of accessibility mutants, and new tests are required to obtain adequate test suites. however, it is important to know if such additional tests and efforts improve the test quality in terms of accessibility faults revealed. rq3 aims to answer this question. table 8 shows the number of accessibility faults pointed by espresso when the original (t ) and extended (xt ) test sets are used; the last column also shows the percentage of improvement. for t , alarmclock has more accessibility faults (126), while pleestracker has only 2 faults. on average we have 45.28 accessibility faults per app. concerning the mutant-adequate test suite xt , piwigo has more faults (447); pleestracker presented the best percentage of improvement (3,650%). but, the smallest percentage of improvement was obtained for alarmclock. on average xt revealed 186.4 accessibility faults. the improvements varied from 3.2 to 3,650%. table 8. accessibility faults detected by t and xt app #faults(t) #faults(xt) improv. alarmclock 126 130 3.2% anymemo 24 355 1,479% authorizer 65 201 209.2% equate 19 27 42.1% kolabnotes 43 70 62.8% piwigo 38 447 1,076.3% pleestracker 2 75 3,650% average 45.28 186.4 931.8% response to rq3: mutant-adequate test suites contribute to meaningful improvements in the number of accessibility faults detected. on average, the extended test suites improved around 932% the number of accessibility faults revealed in the original test suites. implications: the results gave evidence that the use of the mutation operators contributed to an increase in the number of revealed accessibility faults. we anticipate that the quality of the test suite is improved too, besides the accessibility point of view. 5 threats to validity there are some threats to the validity of our study. sample selection. it is not easy to guarantee the representativeness of the apps. in addition, the adopted sample has only android native apps with espresso test suites. to mitigate this, we selected the apps from f-droid a diverse set of open-source apps with recent updates. f-droid has been used in other studies (mao et al., 2016; zeng et al., 2016; gu et al., 2019). limited oracle. the mutant analysis strategy is linked to the espresso tool. however, the proposed approach is also compatible with other tools that monitor the running app and produce accessibility logs like mate (eler et al., 2018) and a11y (toff, 2018); we plan to integrate them in the future. manual determination of covered elements. this task was performed manually and is subject to errors. to minimize this threat, this analysis was carefully conducted and double-checked. flaws in the implementation. there may be implementation errors in any of tools or routines used in our study, like the mdroid+ extension, android emulator management, and espresso. the number of mutation operators. the set of accessibility mutation operators proposed represents only a fraction of all accessibility violations that can occur in a mobile app. we created this initial deletion set to validate the proposed tool. this set of deletion mutation operators is tested and validated as effective in practice. silva et al. 2022 6 concluding remarks this paper presented an approach for accessibility mutation testing of android apps. first, we defined a set of six accessibility mutation operators for android apps. then, for an android app, we generated the mutants. based on the original test suite, we checked which mutants are killed or at least covered. following our approach, we extended the original test suite to cover more mutants. the empirical results show that the original test suites cover only a small part of the accessibility-related mutants. besides, mutant-adequate test suites contribute to meaningful improvements in the number of accessibility faults detected. as future work, we plan to extend the tool support to handle apk files and commercial apps (closed source). the mutation operators may also be described more generically so that the approach can be extended to include other mobile development languages and frameworks (e.g., swift, reactnative, kotlin). another direction is to experiment with different oracles (e.g., mate (eler et al., 2018)), besides the accessibility check of espresso we used in this study. finally, different accessibility mutation operators can be defined, now focused on including and changing code elements. acknowledgment this work is partially supported by cnpq (andre t. endo grant nr. 420363/2018-1 and silvia regina vergilio grant nr. 305968/2018-1). references abuaddous, h. y., jali, m. z., and basir, n. (2016). web accessibility challenges. international journal of advanced computer science and applications (ijacsa). acosta-vargas, p., salvador-ullauri, l., jadán-guerrero, j., guevara, c., sanchez-gordon, s., calle-jimenez, t., laraalvarez, p., medina, a., and nunes, i. l. (2020). accessibility assessment in mobile applications for android. in nunes, i. l., editor, advances in human factors and systems interaction, pages 279–288, cham. springer international publishing. acosta-vargas, p., salvador-ullauri, l., perez medina, j. l., zalakeviciute, r., and perdomo, w. (2019). heuristic method of evaluating accessibility of mobile in selected applications for air quality monitoring. in international conference on applied human factors and ergonomics, pages 485–495. springer. alshayban, a., ahmed, i., and malek, s. (2020). accessibility issues in android apps: state of affairs, sentiments, and ways forward. in proceedings of the acm/ieee 42nd international conference on software engineering, icse ’20, page 1323–1334, new york, ny, usa. association for computing machinery. ballantyne, m., jha, a., jacobsen, a., hawker, j. s., and elglaly, y. n. (2018). study of accessibility guidelines of mobile applications. in proceedings of the 17th international conference on mobile and ubiquitous multimedia, pages 305–315. acm. bbc (2017). the bbc standards and guidelines for mobile accessibility. https://www.bbc.co.uk/ accessibility/forproducts/guides/mobile. brazilian government (2007). accessibility model in electronic government. https://www.gov.br/governodigital/ pt-br/acessibilidade-digital/ modelo-de-acessibilidade. cisco (2017). cisco visual networking index: global mobile data traffic forecast update, 2017–2022 white paper cisco. https://www.cisco.com/c/en/ us/solutions/collateral/service-provider/ visual-networking-index-vni/ white-paper-c11-738429.html. damaceno, r. j. p., braga, j. c., and mena-chalco, j. p. (2018). mobile device accessibility for the visually impaired: problems mapping and recommendations. universal access in the information society, 17(2):421–435. delamaro, m. e., offutt, j., and ammann, p. (2014). designing deletion mutation operators. in 2014 ieee seventh international conference on software testing, verification and validation, pages 11–20. deng, l., mirzaei, n., ammann, p., and offutt, j. (2015). towards mutation analysis of android apps. in proceedings of the eighth international conference on software testing, verification and validation workshops, icstw, pages 1–10. ieee. eler, m. m., rojas, j. m., ge, y., and fraser, g. (2018). automated accessibility testing of mobile apps. in 2018 ieee 11th international conference on software testing, verification and validation (icst), pages 116–126. escobar-velásquez, camilo, o.-r., michael, and linaresvásquez, m. (2019). mutapk: source-codeless mutant generation for android apps. in 2019 ieee/acm international conference on automated software engineering (ase). gamma, e. and beck, k. (2019). the new major version of the programmer-friendly testing framework for java. https://junit.org. google (2018). espresso. https://developer.android. com/training/testing/espresso. google (2018). improve your code with lint checks. https: //developer.android.com/studio/write/lint. google (2020). accessibility scanner. https://play. google.com/store/apps/details?id=com.google. android.apps.accessibility.auditor&hl=en_u. grechanik, m., xie, q., and fu, c. (2009). creating gui testing tools using accessibility technologies. in 2009 international conference on software testing, verification, and validation workshops, pages 243–250. gu, t., sun, c., ma, x., cao, c., xu, c., yao, y., zhang, q., lu, j., and su, z. (2019). practical gui testing of android applications via model abstraction and refinement. in proceedings of the 41st international conference on software engineering, icse ’19, page 269–280. ieee press. https://www.bbc.co.uk/accessibility/forproducts/guides/mobile https://www.bbc.co.uk/accessibility/forproducts/guides/mobile https://www.gov.br/governodigital/pt-br/acessibilidade-digital/modelo-de-acessibilidade https://www.gov.br/governodigital/pt-br/acessibilidade-digital/modelo-de-acessibilidade https://www.gov.br/governodigital/pt-br/acessibilidade-digital/modelo-de-acessibilidade https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://junit.org https://developer.android.com/training/testing/espresso https://developer.android.com/training/testing/espresso https://developer.android.com/studio/write/lint https://developer.android.com/studio/write/lint https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.auditor&hl=en_u https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.auditor&hl=en_u https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.auditor&hl=en_u silva et al. 2022 hartley, s. d. (2011). world report on disability (who). technical report, who and world bank. jabbarvand, r. and malek, s. (2017). µdroid: an energyaware mutation testing framework for android. in proceedings of the 11th joint meeting on foundations of software engineering, esec/fse, pages 208–219. acm. jia, y. and harman, m. (2011). an analysis and survey of the development of mutation testing. ieee trans. software eng., 37(5):649–678. kirkpatrick, a., connor, j. o., campbell, a., and cooper, m. (2018). web content accessibility guidelines (wcag) 2.1. https://www.w3.org/tr/wcag21/. linares-vásquez, m., bavota, g., tufano, m., moran, k., di penta, m., vendome, c., bernal-cárdenas, c., and poshyvanyk, d. (2017). enabling mutation testing for android apps. in proceedings of the 2017 11th joint meeting on foundations of software engineering, esec/fse, pages 233–244, new york, ny, usa. acm. lisper, b., lindstrom, b., potena, p., saadatmand, m., and bohlin, m. (2017). targeted mutation: efficient mutation analysis for testing non-functional properties. in proceedings 10th ieee international conference on software testing, verification and validation workshops, (icstw), pages 65–68. luna, e. and el ariss, o. (2018). edroid: a mutation tool for android apps. in proceedings of the 6th international conference in software engineering research and innovation, conisoft, pages 99–108. ieee. mao, k., harman, m., and jia, y. (2016). sapienz: multiobjective automated testing for android applications. in proceedings of the 25th international symposium on software testing and analysis, issta 2016, page 94–105, new york, ny, usa. association for computing machinery. moher,d.,liberati,a.,tetzlaff,j.,andaltman,d.g.(2009). preferred reporting items for systematic reviews and meta-analyses: the prisma statement. bmj, 339. moran, k., tufano, m., bernal-cárdenas, c., linaresvásquez, m., bavota, g., vendome, c., di penta, m., and poshyvanyk, d. (2018). mdroid+: a mutation testing framework for android. in proceedings of the 40th international conference on software engineering: companion proceeedings, pages 33–36. acm. reda, r. (2019). robotiumtech: android ui testing. https://github.com/robotiumtech/robotium. sejasidier (2015). guide to the development of accessible mobile applications. http://www.sidi.org.br/ guiadeacessibilidade/index.html. silva, c., eler, m. m., and fraser, g. (2018). a survey on the tool support for the automatic evaluation of mobile accessibility. in proceedings of the 8th international conference on software development and technologies for enhancing accessibility and fighting info-exclusion, dsai 2018, page 286–293. acm. silva, h. n., endo, a. t., eler, m. m., vergilio, s. r., and durelli, v. h. r. (2020). on the relation between code elements and accessibility issues in android apps. in proceedings of the v brazilian symposium on systematic and automated software testing, sast. silva, h. n., prado lima, j. a., endo, a. t., and vergilio, s. r. (2021). a mapping study on mutation testing for mobile applications. software testing, verification reliability. su, t., meng, g., chen, y., wu, k., yang, w., yao, y., pu, g., liu, y., and su, z. (2017). guided, stochastic modelbased gui testing of android apps. in proceedings of the 11th joint meeting on foundations of software engineering, esec/fse, paderborn, germany, september 48, pages 245–256. toff, d. (2018). a11y ally. https://github.com/ quittle/a11y-ally. vendome, c., solano, d., liñán, s., and linares-vásquez, m. (2019). can everyone use my app? an empirical study on accessibility in android apps. in 2019 ieee international conference on software maintenance and evolution (icsme), pages 41–52. w3c (2019). w3c accessibility standards overview. https://www.w3.org/wai/ standards-guidelines/. wei, y. (2015). mudroid: mutation testing for android apps. technical report, ucl-uk. undergraduate final year individual project. wille, k., dumke, r. r., and wille, c. (2016). measuring the accessability based on web content accessibility guidelines. in 2016 joint conference of the international workshop on software measurement and the international conference on software process and product measurement (iwsm-mensura), pages 164–169. yan, s. and ramachandran, p. g. (2019). the current status of accessibility in mobile apps. acm transactions on accessible computing, 12. zeng, x., li, d., zheng, w., xia, f., deng, y., lam, w., yang, w., and xie, t. (2016). automated test input generation for android: are we really there yet in an industrial case? in proceedings of the 2016 24th acm sigsoft international symposium on foundations of software engineering, fse 2016, page 987–992. https://www.w3.org/tr/wcag21/ https://github.com/robotiumtech/robotium http://www.sidi.org.br/guiadeacessibilidade/index.html http://www.sidi.org.br/guiadeacessibilidade/index.html https://github.com/quittle/a11y-ally https://github.com/quittle/a11y-ally https://www.w3.org/wai/standards-guidelines/ https://www.w3.org/wai/standards-guidelines/ introduction related work mutation testing of android apps accessibility evaluation of android apps a mutation approach for accessibility testing fault model mutation operators mutation process implementation evaluation study setup analysis of results threats to validity concluding remarks journal of software engineering research and development, 2021, 9:8, doi: 10.5753/jserd.2021.1893 this work is licensed under a creative commons attribution 4.0 international license.. on the test smells detection: an empirical study on the jnose test accuracy tássio virgínio [ federal institute of tocantins | tassio.virginio@ifto.edu.br ] luana martins [ federal university of bahia | martins.luana@ufba.br ] railana santana [ federal university of bahia | railana.santana@ufba.br ] adriana cruz [ federal university of lavras | adriana.cruz@estudante.ufla.br ] larissa rocha [federal university of bahia / state univ. of feira de santana | larissa@ecomp.uefs.br] heitor costa [ federal university of lavras | heitor@ufla.br ] ivan machado [ federal university of bahia | ivan.machado@ufba.br ] abstract several strategies have supported test quality measurement and analysis. for example, code coverage, a widely used one, enables verification of the test case to cover as many source code branches as possible. another set of affordable strategies to evaluate the test code quality exists, such as test smells analysis. test smells are poor design choices in test code implementation, and their occurrence might reduce the test suite quality. a practical and largescale test smells identification depends on automated tool support. otherwise, test smells analysis could become a cost-ineffective strategy. in an earlier study, we proposed the jnose test, automated tool support to detect test smells and analyze test suite quality from the test smells perspective. this study extends the previous one in two directions: i) we implemented the jnose-core, an api encompassing the test smells detection rules. through an extensible architecture, the tool is now capable of accomodating new detection rules or programming languages; and ii) we performed an empirical study to evaluate the jnose test effectiveness and compare it against the state-ofthe-art tool, the tsdetect. results showed that the jnose-core precision score ranges from 91% to 100%, and the recall score from 89% to 100%. it also presented a slight improvement in the test smells detection rules compared to the tsdetect for the test smells detection at the class level. keywords: tests quality, test evolution, test smells, evidence-based software engineering 1 introduction ensuring end-user satisfaction, detecting software defects before go-live, and increasing software or product quality is among the most commonly reported software testing objectives, as written by the annual report of a global consulting firm (capgemini, 2018). recently published reports estimate over $ 2 trillion to quantify the impact of poor software quality on the united states economy, referencing publicly available source material for the year 2020 (cisq, 2021). such data illustrates the need for employing software testing techniques in software development processes, as they could anticipate bug identification and fixing, thus reducing its likely effects still during implementation (or even when existing functionalities are under evolution) (palomba et al., 2018; spadini et al., 2018; grano et al., 2019). in a well-defined software engineering process, test code should co-evolve together with production code, as highquality test code is essential to ease the maintenance and evolution of production and test code (yusifoğlu et al., 2015; guerra calle et al., 2019). however, it might be time-consuming and cost-ineffective (yusifoğlu et al., 2015; guerra calle et al., 2019). several approaches have been proposed in the literature to assess the quality of test suites. for example, code coverage measurement has been widely used to check the quality of automated tests. it measures the test suite quality based on how much a test covers structural elements, such as functions, instructions, branches, and lines of code (gopinath et al., 2014). nonetheless, even with high code coverage, the test code might encompass poor design choices in their implementation, the so-called test smells. the presence of smells in test code may reduce the quality of test suites and, consequently, the production code quality (deursen et al., 2001). additionally, poorly-written tests can be challenging to comprehend and onerous for testers to maintain the code and detect faults (bavota et al., 2015; grano et al., 2019). the software testing literature has introduced a set of tools focused on validating the quality of test suites, mainly through metrics analysis. for example, codecover1 is an open-source java tool for code coverage executed via a graphical user interface (with eclipse ide) and commandline; tsdetect2 is a command-line tool for test smells detection. other tools use code coverage results to predict test smells, such as teredetect (negar and garousi, 2010) and tecrevis (koochakzadeh and garousi, 2010). generally, these tools have many different data outputs, which might be hard for testers to establish a relationship between code coverage and internal test code quality. moreover, several types of test smells have not been investigated in conjunction with code coverage yet, but could also provide opportunities to improve test code quality. in previous studies (virginio et al., 2019, 2020), we introduced the jnose test, a tool to analyze the quality 1available at: https://codecover.org 2available at: https://testsmells.github.io https://orcid.org/0000-0001-6259-4957 mailto:tassio.virginio@ifto.edu.br https://orcid.org/0000-0001-6340-7615 mailto:martins.luana@ufba.br https://orcid.org/0000-0002-1153-8960 mailto:railana.santana@ufba.br https://orcid.org/0000-0001-5196-6356 adriana.cruz@estudante.ufla.br https://orcid.org/0000-0002-8069-5249 mailto:larissa@ecomp.uefs.br https://orcid.org/0000-0002-9903-7414 mailto:heitor@ufla.br https://orcid.org/0000-0001-9027-2293 mailto:ivan.machado@ufba.br https://codecover.org https://testsmells.github.io on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 of test suites from the test smells perspective. the jnose test provides an automated test strategy focused on (i) identifying possible test design flaws, (ii) analyzing the software project quality evolution, and (iii) reducing the effort for performing quality assurance of a test suite. the jnose test integrates a conceptual framework which encompasses strategies for test smells prevention, identification, refactoring, and visualization to improve the test code quality. raide3 (santana et al., 2020) and tsvizzevolution4 tools are part of this framework. in this study, we proposed the jnose-core, an api (application programming interface) to detect test smells in the test code. it provides a flexible architecture to support the insertion of new test smells detection rules. the jnose test implements the interface methods the jnose-core provides and organizes the data flow in a web-based user interface. in this new version, our tool: i) detects test smells in different code granularities (line, method, block, and class); ii) detects test smells more accurately according to the literature definition; and iii) presents the outputs in a more user-friendly interface. additionally, we also extended our previous work by validating the test smells detection rules implemented in the jnose test tool. we conducted an empirical evaluation to investigate two objectives: (i) verify the jnose test accuracy compared with the tsdetect in terms of precision and recall at a class level, and (ii) verify the jnose test accuracy compared with the manual analysis in terms of precision and recall at a fine-grained level. the results show that in a test class level, the jnose test obtained slightly better results than the tsdetect for specific types of test smells, such as assertion roulette, lazy test, and eager test. when analyzing the test smells at a fine-grained level, our tool shows higher accuracy when detecting the test smells location. the remainder of this paper is structured as follows. section 2 introduces the test smells concept and types. section 3 presents an overview of the jnose-core api. section 4 presents the jnose test, a web application for test smells detection. section 5 describes the empirical study to evaluate the jnose test accuracy. section 6 presents the results. section 7 discusses related work. section 8 presents the threats to the validity of our study. finally, section 9 draws concluding remarks. 2 background test code development is not a trivial task (palomba et al., 2018; virginio et al., 2019). in real-world practice, developers are likely to use anti-patterns during test development (bavota et al., 2012; junior et al., 2020). those anti-patterns may negatively impact the test code quality and maintenance and reduce its capability for detecting software faults (bell et al., 2018; spadini et al., 2020). several studies have investigated different types of test smells. initially, deursen et al. (2001) defined a catalog of 11 test smells and refactorings to remove them from the test 3available at https://raideplugin.github.io 4available at https://github.com/arieslab/tsvizzevolution code. next, several authors extended this catalog and analyzed the test smells effects on the production and test code (meszaros et al., 2003; bavota et al., 2012; greiler et al., 2013; bavota et al., 2015; bell et al., 2018; virginio et al., 2019; spadini et al., 2020). as a result of the researchers’ efforts to identify anti-patterns, garousi and küçük (2018) listed more than 190 test smells in a literature review. in this study, we selected twenty-one types of test smells currently discussed in the literature (peruma et al., 2019): • assertion roulette (ar). it occurs when a test method contains non-documented assertions. if an assertion fails, it can be difficult to identify which one failed; • conditional test logic (ctl). it occurs when a test method contains conditional expression or loop structures. conditions within the test method may alter its behavior which leads the test to fail; • constructor initialization (ci). it occurs when a test method contains a constructor; • default test (dt). it occurs when a test class is created by default; • dependent test (dept). it occurs when the test being executed depends on other tests’ success; • duplicate assert (da). it occurs when a test method tests for the same condition multiple times within the same test method; • eager test (et). it occurs when a test method checks more than one method of the production class; • empty test (ept). it occurs when a test method does not contain executable statements; • exception catching throwing (ect). it occurs when a test method is explicitly dependent on the production method throwing an exception; • general fixture (gf). it occurs when the test methods only access part of the test case fixture (setup method); • ignored test (igt). it occurs when a test method is suppressed from running; • lazy test (lt). it occurs when several test methods check the same production method; • magic number test (mnt). it occurs when assert statements contain numeric literals; • mystery guest (mg). it occurs when a test method utilizes external resources (e.g., a file containing test data), and thus it is not self-contained; • print statement (ps). it occurs when unit tests contains print statements; • redundant assertion (ra). it occurs when the test method contains an assertion statement that always is true or false; • resource optimism (ro). it occurs when a test method makes optimistic assumptions about the existence and state of external resources; • sensitive equality (se). it occurs in test methods that contains an equality check using a tostring() method. the test may fail when the tostring() method is changed; • sleepy test (st). it occurs when the execution of a test https://raideplugin.github.io https://github.com/arieslab/tsvizzevolution on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 method is paused for a certain period (e.g., simulate an external event) and then continues its execution; • unknown test (ut). it occurs when a test method does not encompass an assertion statement. • verbose test (vt). it occurs when the tests use too much code to do what they are supposed to do. in other words, the test code is not clean and simple. 3 jnose core in our previous work (virginio et al., 2020), we introduced the first version of the jnose test, a web application for the detection and coverage calculation of test smells. we reused and also expanded the test smells detection rules from the tsdetect (peruma et al., 2020). therefore, the jnose test provides: (i) a graphical interface to facilitate the interaction between user and tool, (ii) the amount and location of the detected test smells, and (iii) support for the test smells analysis through several project versions. when improving the detection rules from tsdetect, we faced some challenges regarding the coupling and dependency between the test framework and test code. the test frameworks, specifically the junit framework5, require different implementations depending on the version used. for example, junit 4 uses a tag @ignore to disable a test class or test method, while junit 5 uses the tag @disabled. regarding the assertions, junit 4 accepts an optional parameter for error message as the first argument, and junit 5 uses the last argument in the method signature. therefore, to facilitate the detection rules expansion and reuse of other tools, we implemented the jnose-core api.6 it is beneficial for the conceptual framework we are working on to evaluate the test code quality. the detection module is the framework base; the test smells detected are the same that should be removed by the refactoring module (raide tool) and presented to the user by the visualization module (tsvizzevolution). 3.1 architecture we designed the jnose-core as a maven7 project to simplify and standardized the build process. additionally, we provided a jnose-core compiled version that can be imported by other projects built with maven. the requirement to use the compiled version is to import the library in the pom.xml of the project, as listing 1 shows. as a result, the jnose-core provides methods to instantiate for the test smells detection. the jnose-core is licensed under the gnu general public license, and its architecture comprises four packages, as follows (figure 1): • core. it implements the jnosecore, a facade class that receives a instance of the config interface. the con5junit is a java library for testing source code, which has advanced to the de-facto standard in unit testing. available at https://junit.org/. 6available at https://github.com/arieslab/jnose-core 7maven is a software project management and comprehension tool. maven can manage a project’s build, reporting and documentation from a central piece of information. available at https://maven.apache.org/ figure 1. jnose-core api internal architecture fig interface contains the methods signature for the test smells detection; • detector. it implements a structure to detect the smelly elements and contains classes to support a test code static analysis through an ast (abstract syntax tree) generated by javaparser8. • smell. it implements the detection rules for junit 4 and improves the detection rules from tsdetect (section 2) to identify test smells at different granularity levels. several classes are implemented (for each type of test smell) and use javaparser to collect additional information on the location and number of test smells. • dto (data transfer object). it implements the classes responsible for transferring data among the packages. 1 2 br. ufba .jnose groupid > 3 jnose -core artifactid > 4 0.7 snapshot version > 5 dependency > listing 1: pom.xml configuration to use jnose-core 3.2 detection rules we revisited the test smells definitions in the literature to identify how we should improve the detection rules from tsdetect. table 1 shows the granularity levels that we defined to detect the exact test smells location in the test code, as follows: (i) line, test smells that occur in a specific line; (ii) block, test smells that occur in a statement block level, e.g., try/catch and conditional statements; (iii) method, test smells that occur in the method level; and (iv) class, test smells that occur in a test class level. additionally, we made improvements in the test smells detection rules. we next detail the main modificationsw we performed: • nested structures. we improved the rules for detecting the ctl, ect, and mnt test smell to consider nested 8available at: https://javaparser.org/ https://junit.org/ https://github.com/arieslab/jnose-core https://maven.apache.org/ https://javaparser.org/ on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 table 1. test smells detection rules. name detection rule granularity assertion roulette a line with assertion statements without the explanation/message parameter line constructor initialization a method that is a constructor declaration method conditional test logic a code block with conditional statements block duplicate assert a line with assertion whose parameters equal the other assertion inside of same test method line default test a method called exampleunittest() or exampleinstrumentedtest() method dependent test a method that depends on the previous execution of another test method method empty test a method that does not contain a single executable statement method eager test a line that contains a call to another production method line exception catching throwing a block that contains either a throw statement or a catch clause block general fixture a line with a field instantiated within the setup() method that is not utilized by all test methods line ignored test a method that contains the @ignore annotation method lazy test a line of method that calls the same production method that other test method line mystery guest a method that accessing object instances of files and databases classes method magic number test a line with assertion method that contains a numeric literal as an argument line print statement a line that invokes either the print() or println() or printf() or writes method of the system class line redundant assertion a line containing an assertion statement in which the expected and actual parameters are the same line resource optimism a method that uses an external resource without checking the state of the object method sensitive equality a method that contains an assertion that invokes the tostring() method of an object method sleepy test a line that invokes the threadsleep() method line unknown test a method that uses the @test annotation but does not contain assertions statement method verbose test a method with more than 30 lines counting non executable statements and annotations method structures. when the tool reports a nested conditional structure as one test smell, it might be hard to identify which part of the test code needs refactoring at first glance. if the nested conditional is too long, the user may refactor parts of it. when rerunning the tool, the user will see that the problem is still there, making the refactoring process longer. therefore, the tool presents one test smell for each structure; • empty or non-assertive. the ut and ept test smells present similar definitions. the ut test smell identifies methods without assertions, and the ept test smell identifies methods with non-executable statements. test methods without a body neither contain executable statements nor assertions. therefore, we added another rule to separate both definitions; the ut test smell identifies methods that contain a body and does not identify asserts; • general fixture. the gf test smell occurs when test methods use only a setup method part, representing the cohesion among the test class’s methods. therefore, we improved the detection rules to show that all the test class methods are used with setup fixtures. it allows the user to identify the test method to which a fixture should be moved; • missing structures. each version of the test framework requires the static analysis of different code structures. the assert structures used in junit 3 is different from junit 4, which is also different from junit 5. therefore, to improve the detection rules to junit 4, we added the code structures that were missing to detect the ctl, ar, da, and ect test smells; • methods overload. similar to the preceding item, there are differences among the junit versions regarding the overloaded methods. when analyzing test cases written with junit 3, we were not concerned about overloaded methods. however, to focus on the current detection rules for junit 4, we needed to improve the ar, and da test smells to support the overloaded methods. 4 jnose test the jnose test9 enables test code quality analysis through test smells detection and code coverage over several software project versions. therefore, it is possible to compare whether a project test quality has either improved or declined throughout its life cycle. the jnose test operation involves three key processes (figure 2): (i) data input, receives the settings for the tool execution, i.e., the list of types of test smells, analysis mode (by testclass, by testsmell, by testfile, and evolution), and the project to be analyzed; (ii) project analysis, calls the jnose-core, an api to perform the project analysis according to the analysis mode selected; and (iii) data output, shows the execution status and the analysis results. 4.1 processes description java development kit (jdk) 11 and maven 3 (or superior) are necessary to install the jnose test. upon installation, the user would be able to use jetty (embedded on maven) and build and run the jnose test. 9available at https://jnosetest.github.io https://jnosetest.github.io on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 figure 2. schematic overview of the jnose test tool and its main features after starting the tool, the user must configure the data input (figure 2). first, the user should import the projects to be analyzed (figure 3a step 1). the jnose test clones the repository directly from github, and allows the user to manage it (figure 3a step 2). second, the user selects the analysis mode, i.e., by testclass, by testsmells, by testfile, and evolution (figure 3a step 3). each analysis mode provides a menu where the user chooses the repositories to be analyzed. by default, the tool detects twenty-one types of test smells, but the user could configure this feature as well (figure 3a step 4). after completing the project import and defining the settings detection, the tool starts the project analysis (figure 2). for each analysis mode, the jnose test tool presents an interface with (i) a list of cloned projects (figure 3b step 1), (ii) a menu with specific analysis mode settings (figure 3b step 2), and (iii) a menu with the data output options (figure 3b step 3). the project analysis considers the analysis mode selected by the user, described below. (1) by testclass. in the data input process, the user could enable the coverage metrics calculation and select the projects to be analyzed. then, to analyze the project by test class, the project analysis calls the jnose-core and optionally executes the code coverage module. finally, the data output process generates a view that contains a table with the number of test smells by test class. that table presents a row for each test class, and each column represents the type of parameter collected: project name, test class, and production class location, twenty-one columns for the types of test smells, the number of test class lines, the number of test methods, and five columns with coverage data. that table could be downloaded as a .csv file. additionally, the user could view a chart or download it as a .png file with the amount of each test smell in the project. (2) by testsmells. the project analysis process only calls the jnose-core to analyze the project by test smell. during the data input process, the user needs to select the projects to explore. unlike the previous analysis, by testsmells provides the exact location of a test smell. the last, the data output, offers a view with the data analysis results, which could also be downloaded as a .csv file. each row of the table represents a test smell, and it has five columns to show the type of parameter collected: the project name, the test class location, the production class location, the test smell name, the test smell location. (3) by testfile. the project analysis process only calls the jnose-core to analyze the project by test file. during the data input process, the user should select a test class and optionally its respective production class. besides the production class selection is an optional feature, the eager test and lazy test test smells are not detected without it. then, the data output provides a view containing a row for each detected test smell and its location. (4) evolution. the project analysis process executes the git mining module and the jnose-core to analyze the project by version. during the data input, the user should select projects to explore and search to be applied (by commits or by tags). this analysis provides the test smell detection for each project version, in addition to data about the author who committed the test smell. the data output process provides a view containing the data analysis results by test smells, downloadable as a .csv file. the table rows represent the test classes by commit. the columns encompass the following parameters: project name, test class and production class location, number of test smells, commit identification, authorship, date, and message. additionally, the user could view a chart and download it as a .png file with the amount of test smells in each project version or the number of test smells committed by an author. the tool also automatically calculates the authorship of a test smell by guilt, i.e., the tester who last modified the method and did not fix it. different analysis mode allows other data visualization. on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 (a) data input: cloning projects from github (b) project analysis: configuring by testclass analysis mode (c) data output: an excerpt of the table with the by testclass results figure 3. jnose test process execution therefore, the data output generates tables or charts depending on the analysis mode. tables are generated for all analysis modes (figure 3c). charts are generated for by testclass and evolution. by testclass charts present the total amount of test smells inserted in a project, and evolution charts present the amount of test smells by project version or by author. 4.2 tool architecture the jnose test is implemented as a java project and comprises five packages, as figure 4 shows: (i) base, responsible for instantiating the jnose-core interface implementation and calculating the coverage metrics; (ii) page, responsible for presenting the web pages and their content; (iii) dtolocal, responsible for encompassing the classes used in dto; (iv) entity, responsible for the domain objects persistence from the database; and (v) business, responsible for applying the business rules to present the results. the base package implements the project analysis (figure 3a), which was split into three other packages, as follows: • coverage. it applies the rules necessary to calculate figure 4. packages of the jnose test on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 coverage. it runs the jacoco library10 to calculate code coverage in the java language. it performs dynamic analysis of the production code branches (bc), instructions (ic), lines (lc), complexity (cc), and methods (mc) to determine which one is either missed or covered by the test (virginio et al., 2019); • git mining. it applies business rules for github mining. it uses the github api for java library11 to clone the projects from github and extract information about the project’s tags, commits, and authors; • jnose-core. it performs test code static analysis through an ast generated by javaparser.12 then, it extracts information about the code structure to apply the rules for the test smells detection, and it collects additional information about the location and number of test smells. the detection rules were improved from the tsdetect tool (section 2) to identify test smells at different granularity levels (table 1). the jnose test interface was implemented in the page package based on the apache wicket13, a framework for web application development in java. we also used html5 and css3 to develop the web pages. this package implements the data input (figure 2). the business implements utility classes responsible for generating the results. it is possible to generate a different type of report for each analysis mode. this package implements the data output (figure 2). in the dto package, we have the classes used to transfer data among the project layers. that package implements the communication among data input, project analysis, and data output (figure 2). additionally, a local database stores the data generated by those processes, comprising persistence rules implemented in the entity package. the jnose test execution uses parallel processes, i.e., the tool creates threads for each uploaded project, for each test class, and so on. with parallel processing, the jnose test could be used to analyze a massive set of projects in a short time (virginio et al., 2019). 4.3 running example we carried out an experimental study to verify the correlation between the coverage metrics and test smells in previous work. we selected eleven software projects to perform that study, in which we collected twenty-one test smells and five coverage metrics using the jnose test. this section presents an example considering the different types of analysis modes supported by the jnose test. we used the commons-io project14 (release 2.7-rc1), a library of utilities, to assist i/o development. we next discuss each supported method. 4.3.1 by testclass analysis we ran the jnose test by testclass to analyze which type of test smells would achieve the highest diffusion over the 10available at https://www.eclemma.org/jacoco/ 11available at https://github-api.kohsuke.org/ 12available at https://javaparser.org/ 13available at https://wicket.apache.org/ 14available at https://github.com/apache/commons-io commons-io project. therefore, we took the following steps: (i) select all types of test smells; (ii) select the project path; and (iii) enable code coverage. the tool returned 58 test classes. we checked the number of classes where each test smell was present to understand the test smell type diffusion. for example, the ect test smell was present in 23 classes, followed by ar test smell in 17 test classes, and et test smell in 16 test classes. each type of test smell could occur many times in a test class. those three types of test smell presented the highest occurrence in the project, counting 316, 175, and 157 times, respectively. table 2 shows five test classes with the highest number of ect, ar, and et test smells. for example, the test class proxycollectionwritertest contains the highest number of those test smells. additionally, most test classes achieved good code coverage when considering the ic, lc, and mc coverage metrics (>70%). therefore, even with high coverage, the test code might present low-quality. 4.3.2 by testsmell once we found that the ect, ar, and et test smells had the highest diffusion numbers in the commons-io project test classes, we may improve the test code quality by fixing the problems. then, we executed the jnose test by testsmell by taking the following steps: (i) select the ect, ar, and et test smells; and (ii) select the project. table 3 shows a results excerpt filtered by the proxycollectionwritertest test class. 4.3.3 by testfile in the previous example (by testsmells), we filtered the results to present only the ones related to the proxycollectionwritertest test class. in the by testfile analysis, that class could be analyzed individually. therefore, we executed the jnose test by taking the following steps: (i) select the ect, ar, and et test smells; and (ii) select the proxycollectionwritertest and proxycollectionwriter files. the results are the same as the filter presented in table 3. listing 2 shows the proxycollectionwritertest test class with the testarrayioexceptiononappendchar1() test method (lines 39-53). we observed that the assertequals() method is called twice within the test method (lines 50-51). each one checks a different condition, but there is no explanation message for them. thus, if the test method fails, there is no clue to identify which assertion caused the failure. that issue refers to the ar test smell. moreover, those assertions are also related to the ect test smell because they may fail when a specific exception occurs. furthermore, a test method is supposed to check just one production class method; otherwise, the code has one et test smell (proxycollectionwriter() on line 43 and append() on line 46). 4.3.4 evolution analysis the evolution analysis might help us identify whether the commons-io has improved over time. we should take the following steps to perform this analysis: (i) select all test smells, https://www.eclemma.org/jacoco/ https://github-api.kohsuke.org/ https://javaparser.org/ https://wicket.apache.org/ https://github.com/apache/commons-io on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 table 2. classes with high diffusion of test smell by testclass tesfilename ... loc met ut igt ro ... st lt da et ar ctl ci dt ept ect gf mg ps dpt ic bc lc cc mc proxycollectiontest ... 448 23 1 0 0 ... 0 61 1 23 21 1 0 0 0 23 0 0 0 0 72 0 76 100 100 trewritertest ... 448 23 1 0 0 ... 0 30 1 2 21 1 0 0 0 23 0 0 0 0 100 0 100 100 100 proxywritetest ... 275 21 3 0 0 ... 0 23 0 4 0 0 0 0 0 21 0 0 0 0 83 0 87 93 93 boundedreadertest ... 246 22 1 1 1 ... 0 48 1 8 3 2 0 0 0 16 0 1 0 0 100 100 100 100 100 endianutilstest ... 316 22 1 0 0 ... 0 46 8 20 15 1 0 0 0 14 0 0 0 0 100 100 100 100 100 table 3. test smells location in proxycollectionwritertest tesfilename ... testsmell methodlocationname lines proxycollectiontest ... ar testarrayioexceptiononappendchar1 50,51 proxycollectiontest ... ar testarrayioexceptiononappendchar2 66,67 proxycollectiontest ... ar testarrayioexceptiononappendcharse 82,83 proxycollectiontest ... et testarrayioexceptiononappendchar1 50,51 proxycollectiontest ... et testarrayioexceptiononappendchar2 66,67 proxycollectiontest ... et testarrayioexceptiononappendcharse 82,83 proxycollectiontest ... ect testarrayioexceptiononappendchar1 45-52 proxycollectiontest ... ect testarrayioexceptiononappendchar2 61-69 proxycollectiontest ... ect testarrayioexceptiononappendcharse 77-84 37 public class proxycollectionwritertest { 38 39 @test 40 public void testarrayioexceptiononappendchar1 () throws ioexception { 41 final writer badw = new brokenwriter (); 42 final stringwriter goodw = mock ( stringwriter . class ); 43 final proxycollectionwriter tw = new proxycollectionwriter (badw , goodw , null ); 44 final char data = 'a'; 45 try { 46 tw. append ( data ); 47 fail (" expected "+ ioexception . class . getname ()); 48 } catch ( final ioexceptionlist e) { 49 verify ( goodw ). append ( data ); 50 assertequals (1,e. getcauselist (). size ()); 51 assertequals (0,e. getcause (0, ioindexedexception . class ). getindex ()); 52 } 53 } listing 2: proxycollectionwritertest test class (ii) select the analysis by commit, and (ii) select the project path. the project has 2,337 commits, 52 releases, and 56 contributors from the beginning until the release 2.7rc1. we filtered the five test class results with more ect, et, and ar test smells (table 4). figure 5 shows the evolution of those classes and the project. the proxycollectionwritertest, trewritertest, and proxywritertest test classes are stable, as no test smell was either inserted or fixed. however, the boundedreadertest test class presented novel test smells during 2014-2016 and fixed them during 2016-2020. we could observe that the number of test smells increased over time, which might indicate that people involved in the project test suite development have not worked to get rid of test smells yet. in addition, authorship is calculated by fault, so the authors from that example might not have inserted all detected test smells. table 4. classes with high diffusion of test smell evolution tesfilename ... testsmell commitid commitname commitdate proxycollectionwrite ... 153 b739ce7c adam retter 03:39:47 2020 proxycollectionwrite ... 153 bcb36041 david georg 00:09:03 2018 trewritertest ... 101 b739ce7c adam retter 03:39:47 2020 trewritertest ... 101 bcb36041 david georg 00:09:03 2018 proxywritetest ... 59 b739cc7c adam retter 03:39:47 2020 proxywritetest ... 59 bcb36041 david georg 00:09:03 2018 boundedreadertest ... 92 b739ce7c adam retter 03:39:47 2020 boundedreadertest ... 96 51f13c84 kristian rose 15:36:15 2016 boundedreadertest ... 83 9a9b8385 gary d. greg 01:17:05 2014 endianutilstest ... 118 b739ce7c adam retter 03:39:47 2020 endianutilstest ... 117 8940848g gary d. greg 18:47:06 2018 5 empirical evaluation this empirical evaluation aims to investigate the jnose test accuracy in detecting test smells. we designed the empirical study in four steps, as figure 6 shows: (i) dataset selection, in which we defined the test classes to analyze; (ii) oracle definition, in which we manually detected the test smells instances; (iii) data collection, where we applied the jnose test and tsdetect to collect the test smells instances; and (iv) data analysis, in which we analyzed the data collected to investigate our objectives. 5.1 dataset selection for this analysis, we used the dataset made available by peruma et al. (2020), which contains 65 test classes extracted from github projects. as we initially reused the jnose test detection rules from the tsdetect, we decided to use the same dataset they used to perform a fair comparison between both tools and assess the jnose test effectiveness. to build the dataset, peruma et al. (2020) selected android apps neither duplicated nor forked. upon the smells identification in a test file, they randomly selected 65 test classes from the selected projects and followed the definitions to detect the test smells. although the tsdetect implements detection rules for twenty-one types of test smells, only nineteen were validated. it did not detect the dt and dpt test smells. the same limitation applies to our study. since the authors did not have access to the test results from manual detection performed by peruma et al. (2020), we created a new oracle using the same test and production classes for this study. even if we had access to the peruma et al. (2020) manual detection results, we would have to detect the test smells at a fine-grained level to validate the jnose test. the reason for such assumption is that the jnose test detects the test smells exact location, rather than just their presence (like the tsdetect). on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 figure 5. evolution of the commons-io project and classes with high diffusion of test smells 5.2 oracle definition to manually detect the test smells instances, we followed a design not fully crossed to assign coders to the subjects, i.e., different subjects are analyzed by different subsets of coders (hallgren, 2012). the subjects are the 65 test classes, and four authors of this study served as coders. the coders are experts in test smells with at least three years of experience. additionally, their java programming development experience ranged from 4 to 15 years, including unit test development. we organized the codes into two groups of two coders each, where one group analyzed 32 test classes and the other group 33 test classes. two coders individually analyzed each test class. they collected data regarding the test smells type and location, following definitions from table 1. as a result, each coder generated a document with all the test smells detected. subsequently, the coders compiled the individual records into one document after discussing the divergences. the review process of the test smells manually detected was time and effort-consuming (~60 minutes). the final oracle version supports the detection of eighteen types of test smells. in addition to the non-existence of the dpt and dt test smells in the dataset, previously reported by peruma et al. (2020), we did not detect any igt test instances smell. the analysis process of the test classes and the discussion about the classification divergences took about 60 hours. 5.3 data collection data collection consisted of detecting 65 test classes in two different analyses: detection with tsdetect and detection with jnose test tool. detection with tsdetect. we downloaded the tsdetect version 2.0 to collect the data. it executes three modules: i) the test file detector to detect the test classes, ii) test file mapping to link the test classes to production classes, and iii) tsdetect to detect the test smells. all modules were executed by command line in the terminal sequentially. as a result, the tsdetect generates a file that contains a boolean value for each type of test smell detected in the test class. therefore, the result provided by the tsdetect has a classlevel granularity. the detection process took about 7 minutes, considering the tool execution time and the participants’ expertise with the operating system terminal to exercise the necessary commands for its execution. detection with jnose test. we use the jnose test version 2.1 to detect the test smells. after running the tool, the output file with the result encompassed each test smell for each test class detected. the test smells detection granularity followed table 1. the automated detection with jnose test took about 1 minute due to the unified process to detect the test classes, production classes, and test smells. a friendly graphical interface makes this process easier. 5.4 data analysis we used the oracle to calculate the jnose test and tsdetect accuracy against the manual analysis. both tools present distinct granularity levels to detect test smells. tsdetect indicates whether a test class contains a test smell instance, i.e., returns a boolean value for each test smell in a class. jnose test detects all instances of a test smell with its exact location (line, block, method, or class). therefore, we carried out what follows: 1. we compared the jnose test and tsdetect accuracy considering the class-level. we treated the jnose test output to show boolean values at the class-level to compare with the tsdetect. as the jnose test detection rules were reused from the tsdetect, our goal is to determine the extension we improved those detection rules. in this comparison, the accuracy is given at the class-level considering its precision and recall. 2. we compared the jnose test and manual analysis accuracy considering a fine-grained level. for example, by evaluating the line-level of granularity, we can detect the ar test smell; therefore, we collected data at the line level to see it manually and automatically. our goal is to show the jnose test accuracy to indicate the test smells location. therefore, we provide the accuon the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 figure 6. steps to conduct the experiment racy value at a fine-grained level in terms of precision and recall. 6 results this section reports the results of our empirical study. the data for replication purposes are available online (virgínio et al., 2021). 6.1 comparison between jnose and tsdetect table 5 reports precision and recall accuracy when detecting test smells with jnose test and tsdetect. this comparison was made at the test class-level. table 5. jnose test and tsdetect comparison class-level accuracy (%) precision (%) recall (%) f1-score (%) test smell jnose tsdetect jnose tsdetect jnose tsdetect jnose tsdetect ar 100 75.38 100 90 100 75 100 78 ci 100 100 100 100 100 100 100 100 ctl 100 100 100 100 100 100 100 100 da 98.46 96.92 99 98 98 97 99 97 ect 100 46.15 100 92 100 46 100 55 et 95.38 86.15 95 87 95 86 95 86 ept 100 100 100 100 100 100 100 100 gf 98.46 98.46 99 99 98 98 99 99 lt 100 93.85 100 94 100 94 100 94 mg 90.77 90.77 92 92 91 91 89 89 mnt 95.38 90.77 96 92 95 91 95 90 ps 100 100 100 100 100 100 100 100 ra 100 100 100 100 100 100 100 100 ro 89.23 89.23 91 91 89 89 88 88 se 100 100 100 100 100 100 100 100 st 100 100 100 100 100 100 100 100 ut 100 93.85 100 94 100 94 100 94 the results obtained with the tsdetect diverges from those reported by peruma et al. (2020). such study yielded precision values from 85.71% to 100% and recall values from 95% to 100%. they could detect nineteen types of test smells. the tsdetect achieved a precision from 87.71% to 100% and recall from 46% to 100% for eighteen types of test smells when using our oracle. as we mentioned earlier, we did not detect any igt test smell instances in none of the tools. those divergences highlight the challenges of building an oracle due to different interpretations that a coder may have about the test smells definitions. regarding the results obtained with the jnose test, the precision ranged from 91% to 100%, and the recall from 89% to 100% to detect eighteen types of test smells. as we reused the tsdetect detection rules, we showed the improvements we achieved. considering the f1-score metric, the jnose test presented accuracy improvement of 45% for the ect test smell, followed by 22% for the ar test smell, 11% for the vt test smell, 9% for the et test smell, 6% for the lt, and ut test smells, 5% for the mnt test smell, and 2% for the da test smell. other test smells detection rules did not present any relevant improvement at the test class level. next, we showed the reason for the divergence between the results obtained by the tools for the ect test smell detection. the jnose test considers three compliant solutions to handle exceptions (listing 3): i) the use of the tag test with the expected parameter (lines 1-4), ii) the use of assertthrows statement (lines 6-9), or iii) throw the exception in the method signature (lines 11-14). as a noncompliant solution, it considers the try/catch structure within the method body (lines 16-23). the tsdetect considers the try/catch structure and the throw-in method signature as a non-compliant solution (lines 11-23). we identified that the tsdetect does not consider the junit overloaded methods when using an assert statement regarding the ar test smell. for example, the assertequals on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 asserts that (listing 4) (i) two objects are equal (lines 1-9) or (ii) two objects are equal within a positive delta (lines 1119). the optional value is a string that describes the assertion. the tool simplifies the number of parameters expected by the assert statement. it detects as a test smell only methods with two parameters (lines 14). the problem occurs because the tool always classifies the assertequals as a non-test smell when the assert has three parameters. however, it is necessary to verify the fourth parameter to decide whether it is either a test smell or not. we improved the jnose test in this direction. additionally, there was a conflict in the ept, and ut test smells definition. the ept test smell is a test method without executable statements (empty method). the ut test smell is a test method with executable statements but no assertions. the tsdetect considers methods without a body as both ept and ut. therefore, we implemented the rules necessary to differentiate those test smells. we performed some minor fixes to detect other types of test smells. for example, for the vt test smells, the tsdetect considers a class with more than 123 lines as one verbose test. as the jnose test detects the test smells at a fine-grained level, we defined that a test method with more than 30 lines is verbose. therefore, we found more instances because of our definition. 1 @test ( expected = exception . class ) 2 public void tag_usage (){ 3 // some code 4 } 5 6 @test 7 void trows_statement_usage () { 8 assertthrows (" exception message ", exception . class , parameter ); 9 } 10 11 @test 12 public void trows_signature_usage () throws exception . class { 13 // some code 14 } 15 16 @test 17 public void try_catch_usage () { 18 try { 19 // some code 20 } catch ( myexception e) { 21 assert . fail (e. getmessage ()); 22 } 23 } listing 3: (non)compliant solutions for ect considered by jnose test 6.2 jnose and manual analysis comparison table 6 reports accuracy through precision and recall values when detecting test smells with jnose test and manual analysis. this comparison considered the granularity level for the test smells detection. in a fine-grained level, the jnose test precision score ranges from 84% to 100%, and the recall ranges from 47% to 100%. at the class level, the detection difficulties related 1 @test 2 public void two_parameters (){ 3 assertequals ( float expected , float actual ) 4 } 5 6 @test 7 public void three_parameters_with_message (){ 8 assertequals ( string message , float expected , float actual ) 9 } 10 11 @test 12 public void four_parameters (){ 13 assertequals ( string message , float expected , float actual , float delta ) 14 } 15 16 @test 17 public void three_parameters_no_message (){ 18 assertequals ( float expected , float actual , float delta ) 19 } listing 4: solutions for ar considered by jnose test table 6. jnose test and manual analysis comparison fine granularity level test smell accuracy (%) precision (%) recall (%) f1-score (%) ar 100 100 100 100 ci 100 100 100 100 ctl 100 100 100 100 da 94.12 100 94 97 ect 100 100 100 100 et 89.13 100 89 94 ept 100 100 100 100 gf 90 100 90 95 lt 96.55 100 97 98 mg 50 100 50 67 mnt 94.74 100 95 97 ps 100 100 100 100 ra 100 100 100 100 ro 47.06 84 47 60 se 100 100 100 100 st 100 100 100 100 ut 100 100 100 100 vt 100 100 100 100 to specific cases are not evident because it returns a boolean value for test smells in the whole test class. however, when we performed a more detailed test smell detection, we noticed some test code-specific characteristics that the tool does not detect. the most divergent results between the classand fine granularity-level are the mg and ro test smells. at the class level, those test smells have the accuracy of 90.77% and 89.23%, respectively. however, those test smells present accuracy of 50% and 47.06%, respectively. both the test smells to deal with external resources. a test method that makes optimistic assumptions about external resources’ existence has the ro test smell (listing 5, lines 10-21). the test method that uses external resources has the mg test smell (listing 5, lines 2-5). as the jnose test performs test code static analysis, we only considered the direct calls for external resources (listing 5, lines 1-15). however, whether a test method calls a production class from any part of the project and that class calls for external resources, the test class uses on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 external resources indirectly (listing 5, lines 17-21). in this scenario, the mg and ro test smells need additional work to determine the indirect calls. we identified a specific characteristic that can detect other false positives instances using the da test smell. that false positive occurs when one test method uses an assertion structure implemented by a json library similar to the assertion structure implemented by the junit. this is because the junit has the assertthat(string reason, t actual, m matcher) the other jsonassert library implements the assertthat(string).contains(string). when performing the static analysis, all the statements that start with assert were considered a junit assertion. therefore, we may improve it by detecting the libraries imported in the test class. however, the tool might miss test smells instances if using a test class with another assert library. other types of test smell required minor fixes. the lt and et test smell miss some instances due to default constructors. we considered that the same way a different test method should not call the same production class method, a class is instantiated several times in different test methods. if many test methods need to instantiate the same object, it should be moved to a setup method. therefore, we need to improve the jnose test to detect calls for the default constructors. 1 @test 2 public void external_file (){ 3 file file = openfile (" config . xml "); 4 if ( file . exists ()){ 5 xmlpullparser config = xmlparserfactory . fromfile ( file ); 6 // some code 7 } 8 } 9 10 @test 11 public void external_file_without_checking (){ 12 file file = openfile (" config . xml "); 13 xmlpullparser config = xmlparserfactory . fromfile ( file ); 14 // some code 15 } 16 17 @test 18 public void external_resource_indirectly (){ 19 xmlreader reader = new xmlreader (" xml / config . xml "); 20 // some code 21 } listing 5: mystery guest and resource optimus 7 related work in large-sized test suites, software engineers barely perform manual detection of test smells. this practice is rather timeconsuming and infeasible in many scenarios. therefore, the research community has proposed automated tool support for detecting test smells. the test smell detector (tsd) detects nine types of test smells (bavota et al., 2015). the tsd detection rules overestimate the presence of test smells in the code to ensure high recall (87%). it returns a list of candidate-affected classes. similarly, tsdetect, the state-of-the-art tool to detect test smells, identifies twenty-one types of test smell (section 2). it indicates whether a particular test smell appears in the test class with the precision score ranging from 85% to 100%, and recall score from 90% to 100% (peruma et al., 2020). other tools correlate test smells with structural and coverage metrics. the intellij plug-in coined vitrum (vizualization of test-related metrics) is an extension of tsdetect. it collects a set of seven types of test smells and structural metrics (pecorelli et al., 2020). teredetect (negar and garousi, 2010) and tecrevis (koochakzadeh and garousi, 2010) use code coverage analysis, held by codecover, to detect test smells related to code duplication. our tool uses a test smells rule-based detection instead of a metricor coverage-based detection. it extends the tsdetect tool in several respects. for example, our tool provides the number of test smells identified in a test class and the method line and name with each test smell’s location. moreover, it supports the test suite analysis through several project versions, by mining git for providing information about when and by who introduced the test smells. additionally, our tool supports other tools for test smells refactoring (raide) (santana et al., 2020) and visualization (tsvizzevolution). the raide is an eclipse ide plugin to detect and refactor the ar and da test smells. the tsvizzevolution is a test smells visualization tool that aims to help the user understand problems in the test code by using three visualization techniques (graph view, treemap view, and timeline view). it represents the twenty-one types of test smells detected by jnose test. 8 threats to validity internal validity. in the manual analysis to construct the oracle, there may have been divergences among the researchers’ analysis. we mitigated this threat by resolving disagreements collectively. after collecting data with the jnose test and tsdetect tools, we checked if any test smells detected by the tools were not considered in the manual analysis. external validity. our study results may not be generalized to other suites of test classes or other types of test smells. to mitigate this threat, we used the same dataset used in the study to validate the tsdetect tool (peruma et al., 2020). conclusion validity. although the jnose test detects twenty-one types of test smells, this study only validated eighteen ones because the dataset used did not have the dpt, dt, and igt test smells. on the other hand, we used the same dataset used to evaluate tsdetect (peruma et al., 2020). construct validity. although we used four coders to build the oracle, they were experts with more than three years of experience with test smells. they were aware of the test code of the test smells detection tools. 9 conclusion this paper presents the jnose test and its api, the jnose-core. the api supports the detection of twenty-one types of test smells. it provides a flexible architecture to supon the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 port the insertion of new test smells detection rules. the jnose test tool is a web application to detect test smells and calculate coverage for java projects. to validate the detection rules implemented by jnose-core, we conducted an empirical study to compare our tool’s accuracy with the state-of-the-art tool and manual analysis. we built an oracle to detect test smells to perform the comparison. the oracle contains sixty-five test classes analyzed by specialists in the subject. the comparison between jnose and tsdetect was made at the class-level. the results showed that jnose presented higher accuracy than tsdetect, in terms of precision and recall. as we reused the detection rules from the tsdetect to implement the jnose test, the results indicated that we successfully improved them. additionally, the jnose also detects test smells at a fine-grained level. as the tsdetect does not support this feature, we could only compare the fine-grained level detection against the manual analysis. results showed a high accuracy to determine the exact line location, but it still needs further improvements. there are many opportunities for other investigations. for example, it would be interesting to validate our tool efficiency in a real-world environment through a user study. such a study could also consider significant usability concerns. there is open room for introducing new features in the jnose test in terms of both detection and refactoring, and as necessary, in terms of how it behaves in practice considering quality attributes. acknowledgements this research was partially funded by ines 2.0; cnpq grants 465614/2014-0 and 408356/2018-9 and fapesb grants jcb0060/2016 and bol0188/2020. references bavota, g., qusef, a., oliveto, r., de lucia, a., and binkley, d. (2015). are test smells really harmful? an empirical study. empirical software engineering, 20(4):1052–1094. bavota, g., qusef, a., oliveto, r., lucia, a., and binkley, d. (2012). an empirical analysis of the distribution of unit test smells and their impact on software maintenance. in 28th ieee international conference on software maintenance (icsm). bell, j., legunsen, o., hilton, m., eloussi, l., yung, t., and marinov, d. (2018). deflaker: automatically detecting flaky tests. in ieee/acm 40th international conference on software engineering (icse), pages 433–444. capgemini (2018). world quality report 201819. https://www.capgemini.com/service/ world-quality-report-2018-19/. accessed: march 1st, 2021. cisq (2021). the cost of poor software quality in the us: a 2020 report. https://www.it-cisq.org/pdf/ cpsq-2020-report.pdf. acessed: march 1st, 2021. deursen, a., moonen, l. m., bergh, a., and kok, g. (2001). refactoring test code. in refactoring test code, amsterdam, the netherlands, the netherlands. cwi (centre for mathematics and computer science). garousi, v. and küçük, b. (2018). smells in software test code: a survey of knowledge in industry and academia. journal of systems and software, 138:52 – 81. gopinath, r., jensen, c., and groce, a. (2014). code coverage for suite evaluation by developers. in proceedings of the 36th international conference on software engineering (icse), new york, ny, usa. acm. grano, g., palomba, f., di nucci, d., de lucia, a., and gall, h. c. (2019). scented since the beginning: on the diffuseness of test smells in automatically generated test code. journal of systems and software, 156:312–327. greiler, m., van deursen, a., and storey, m. (2013). automated detection of test fixture strategies and smells. in ieee sixth international conference on software testing, verification and validation, pages 322–331. guerra calle, d., delplanque, j., and ducasse, s. (2019). exposing test analysis results with drtests. in international workshop on smalltalk technologies, pages 1–5, cologne, germany. hal. hallgren, k. a. (2012). computing inter-rater reliability for observational data: an overview and tutorial. tutorials in quantitative methods for psychology, 8(1):23. junior, n. s., rocha, l., martins, l. a., and machado, i. (2020). a survey on test practitioners’ awareness of test smells. in proceedings of the xxiii iberoamerican conference on software engineering, cibse 2020, pages 462– 475. curran associates. koochakzadeh, n. and garousi, v. (2010). tecrevis: a tool for test coverage and test redundancy visualization. in bottaci, l. and fraser, g., editors, testing – practice and research techniques, pages 129–136, berlin, heidelberg. springer berlin heidelberg. meszaros, g., smith, s. m., and andrea, j. (2003). the test automation manifesto. in maurer, f. and wells, d., editors, extreme programming and agile methods xp/agile universe 2003, berlin, heidelberg. springer berlin heidelberg. negar, k. and garousi, v. (2010). a tester-assisted methodology for test redundancy detection. advances in software engineering, 2010. palomba, f., zaidman, a., and lucia, a. d. (2018). automatic test smell detection using information retrieval techniques. in ieee international conference on software maintenance and evolution (icsme), pages 311– 322, madrid, spain. ieee. pecorelli, f., di lillo, g., palomba, f., and de lucia, a. (2020). vitrum: a plug-in for the visualization of testrelated metrics. in proceedings of the international conference on advanced visual interfaces, new york, ny, usa. acm. peruma, a., almalki, k., newman, c. d., mkaouer, m. w., ouni, a., and palomba, f. (2019). on the distribution of test smells in open source android applications: an exploratory study. in proceedings of the 29th annual international conference on computer science and software https://www.capgemini.com/service/world-quality-report-2018-19/ https://www.capgemini.com/service/world-quality-report-2018-19/ https://www.it-cisq.org/pdf/cpsq-2020-report.pdf https://www.it-cisq.org/pdf/cpsq-2020-report.pdf on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 engineering (cascon), riverton, nj, usa. ibm. peruma, a., almalki, k., newman, c. d., mkaouer, m. w., ouni, a., and palomba, f. (2020). tsdetect: an open source test smells detection tool. acm, new york, ny, usa. santana, r., martins, l., rocha, l., virginio, t., cruz, a., costa, h., and machado, i. (2020). raide: a tool for assertion roulette and duplicate assert identification and refactoring. in proceedings of the 34th brazilian symposium on software engineering (sbes). acm. spadini, d., palomba, f., zaidman, a., bruntink, m., and bacchelli, a. (2018). on the relation of test smells to software code quality. in international conference on software maintenance and evolution (icsme), pages 1–12. ieee. spadini, d., schvarcbacher, m., oprescu, a.-m., bruntink, m., and bacchelli, a. (2020). investigating severity thresholds for test smells. in proceedings of the 17th international conference on mining software repositories (msr). acm. virginio, t., martins, l., soares, l. r., railana, s., costa, h., and machado, i. (2020). an empirical study of automatically-generated tests from the perspective of test smells. in proceedings of the xxxiv brazilian symposium on software engineering (sbes), new york, ny, usa. acm. virginio, t., santana, r., martins, l. a., soares, l. r., costa, h., and machado, i. (2019). on the influence of test smells on test coverage. in proceedings of the xxxiii brazilian symposium on software engineering (sbes), pages 467– 471, new york, ny, usa. acm. virgínio, t., martins, l., santana, r., cruz, a., rocha, l., costa, h., and machado, i. (2021). on the test smells detection: an empirical study on the jnose test accuracy [dataset]. available at: https://doi.org/10.5281/ zenodo.4570751. yusifoğlu, v. g., amannejad, y., and can, a. b. (2015). software test-code engineering: a systematic mapping. information and software technology, 58:123 – 147. https://doi.org/10.5281/zenodo.4570751 https://doi.org/10.5281/zenodo.4570751 introduction background jnose core architecture detection rules jnose test processes description tool architecture running example by testclass analysis by testsmell by testfile evolution analysis empirical evaluation dataset selection oracle definition data collection data analysis results comparison between jnose and tsdetect jnose and manual analysis comparison related work threats to validity conclusion microsoft word proceso de ir para el mcdai_v8.10.docx journal of software engineering research and development, 2020, 8:2, doi: 10.5753/jserd.2019.459 this work is licensed under a creative commons attribution 4.0 international license. requirements engineering base process for a quality model in cuba yoandy lazo alvarado [ centro nacional de calidad de software (calisoft) | yoandy.lazo@calisoft.cu ] leanet tamayo oro [ centro nacional de calidad de software (calisoft) | leanet.tamayo@calisoft.cu ] odannis enamorado pérez [ centro nacional de calidad de software (calisoft) odannis.enamorado@calisoft.cu ] karine ramos [ alloy digital product development & marketing technology | karine.rb19@gmail.com ] abstract a high percentage of projects worldwide fail or are canceled, due to incorrect requirements engineering. incorporating good practices into the requirements engineering process provides the appropriate mechanism to understand and analyze what stakeholders want and need. this process also allows you to evaluate and negotiate a reasonable solution; and specify, validate and manage the requirements as they become a functional system. the objective of this research is to elaborate a process of requirements engineering for the quality model for software development that contributes to raising the percentage of successful projects in cuban´s software development organizations, regarding the fulfillment of the agreed requirements. to reach the desired goal a bibliographic review was made about the requirements engineering discipline, as well as interviews and surveys to roles related to this activity in cuban´s software development organizations. the solution was evaluated by experts, in a focus group and put into practice, as a pilot, in three organizations. as a result, a basic requirements engineering process was obtained that contains specific requirements divided by the three levels of maturity of the model, and a graphic and textual description of the process. the satisfaction of the end user was measured through the implementation of iadov technique, obtaining a group satisfaction index equals 1, meaning maximum user’s satisfaction with the process. keywords: requirement, requirements engineering, software, process 1. introduction a significant percentage of software development projects, worldwide, are canceled or fail according to studies carried out. between the years 2011 and 2015 those canceled accounted for 39%, 46%, 40%, 47% and 45%, respectively, and the unsuccessful represented 22%, 17%, 19%, 17% and 19%, respectively (rosato, 2018; the standish group international, 2015). the behavior of projects in 2018 was similar to previous years since canceled projects reached 36% and 20% reported as unsuccessful (international, 2018). an investigation carried out by lehtinen et al. suggests that the causes of failures in projects occur in several processes that include management, sales and requirements, and implementation. it also states that the failures are related to the project environment, people, methods, and tasks (lehtinen, mäntylä, vanhanen, itkonen, & lassenius, 2014; mcleod & macdonell, 2011). an analysis of the standish group publication in 2014 on a study of more than 2,000 projects in 1,000 companies allowed to know that, although project is considered successful regarding the compliance with delivery deadlines, budget and agreed requirements, the percentages of utilization of functionalities that compose the systems are: 7% always, 13% often, 16% sometimes, 19% rarely, 45% never (the standish group international, 2014). according to del toro, this is because: 1) the client did not request it, but it could happen due to a misinterpretation of a requirement or the developers considered that it could be useful or interesting; or 2) the client requested it, i) but later he realized that he did not describe it correctly and he does not want it anymore; ii) the client described it correctly, but when he saw it implemented he realized that he asked for something wrong; or iii) the client described it correctly, but now wants something different (del toro, 2018). the authors of this research agree with del toro on the importance of maintaining adequate feedback with stakeholders during the software development life cycle to reduce the effects of the volatility of the requirements. to guarantee that feedback, an indispensable bridge that goes through the stakeholder needs, the design and development of the product is the requirements engineering (re) process. it provides an appropriate mechanism to understand and analyze the stakeholders needs, evaluate the feasibility, negotiate a reasonable solution, specify the solution without ambiguities, validate the specification and manage the requirements as they become a functional system (pressman, 2010). a diagnosis performed in 2014 by the software quality national center (calisoft) to a sample of 43.75% of cuban´s software development organizations, allowed to characterize these organizations through the application of interviews and surveys to the roles involved in the re process (calisoft, 2014). among the evaluated aspects, one of them is the fulfillment of the activities that compose the re process proposed by pressman (pressman, 2010). an analysis of the obtained results allowed the authors of this research to know that 7.14% do not identify the stakeholder requirements, 14.26% do not specify them, 21.43% do not validate them, 28.57% do not implement presenting the new sbc journal template viterbo et al. 2019 changes control, and 57.14% do not maintain traceability with the requirements. another result obtained describes the behavior of the projects completed in the period 20112014, identifying that 16.42% of the projects did not complete all the requirements agreed with the client; 11.94% delivered out of time and did not complete all the agreed requirements; 1.49% did not complete all the requirements and were over budget (calisoft, 2014; pérez & aveleira, 2016). another diagnosis performed in 2017 to a sample of 28.13% of cuban´s software development organizations. it allowed to know that 48% of the completed projects did it successfully, 20% canceled, and 32% failed. it also allowed identifying that the percentage of implementation of the re process reached 48% (calisoft, 2017). for all the above described it can be stated that the process in question is not mature and recognizes the need to establish activities that provide more feedback with the client. the software development organizations in cuba have used the nc-iso 9001, and cmmi to reach maturity levels in their development processes (y. a. lazo, 2016). at the same time, calisoft researchers work on the development of the quality model for the development of computer applications (mcdai) to provide the industry with a model based on international best practices. mcdai takes into account national characteristics and is based on the following principles: easy to understand, easy to apply, and serve as a basis for evaluations in other internationally recognized models (pérez, 2014). that being said the objective of this research is to develop a process of requirements engineering for the mcdai that contributes to raise the percentage of successful projects in cuban´s software development organizations, regarding compliance with the agreed requirements. 2. theoretical framework 2.1. requirements engineering in software development as part of the construction of the theoretical framework of the research, a bibliographic review was carried out (goguen, 1994; ieee, 2014; iso, iec, & ieee, 2017; oficina nacional de normalización, 2015b; sommerville, 2011; team, 2010), this allowed to conceptualize the term requirements as: need or expectation established, generally implicit or mandatory, expressing a condition or capacity demanded by the stakeholders or the organization, which must comply or have a process, product or product component to solve a problem or achieve an objective and to satisfy a contract, standard, specification or other formally imposed document. the broad spectrum of tasks and techniques that lead to understanding requirements is called re. from the software process perspective, re is one of the software engineering important actions, that begins during the communication activity and continues in the modeling activity. pressman argues that as part of the re seven different tasks are performed which are: conception, inquiry, elaboration, negotiation, specification, validation, and management (pressman, 2010). however, somerville identifies that the main activities of re are the acquisition, analysis, and validation of requirements. he also explains the importance of the requirements administration to plan the re process activities and control requirements changes (sommerville, 2011). the guide to the software engineering body of knowledge (swebok) contains the software requirements knowledge area (ka) that is concerned with the elicitation, analysis, specification, and validation of software requirements as well as the management of requirements during the whole life cycle of the software product (ieee, 2014). 2.1.1. requirements engineering according to nc-iso 9001 nc-iso 9001:2015 quality management systems requirements, uses the process approach, which incorporates the plan-do-verify-act cycle and risk-based thinking. the organizations that use it do so with a strategic vision to improve their overall performance. this standard can be used in any organization, including those that develop software. in this standard, the re process is not explicitly delimited, but it states that “the organization must plan, implement and control the processes necessary to meet the requirements for the provision of products and services, and implement the determined actions”; the aforementioned gives the organization a possibility to implement a re process for software development. also, it raises several requirements related to the re process: “8.2.2 determining the requirements for products and services”, “8.2.3 review of the requirements for products and services”, and “8.2.4 changes to requirements for products and services” (oficina nacional de normalización, 2015a). organizations that develop software can use iso/iec/ieee 90003:2018 as a guide for the application of iso 9001. iso/iec/ieee 90003 explains in detail how to comply with the requirements mentioned above (iso, iec, & ieee, 2018). 2.1.2. re according to iso/iec/iee 12207 and iso/iec/ieee 15288 iso/iec/ieee 15288:2015 and iso/iec/ieee 12207:2017 contain the life cycle processes for the system and software, respectively. both international standards contain 30 processes, including two related to re. during the revision of these standards, it was possible to identify that the activities they propose for er are similar. 1. the purpose of the stakeholder needs and requirements definition process is to define the stakeholder requirements for a system that can provide the capabilities needed by users and other stakeholders in a defined environment (iso, iec, & ieee, 2015; iso et al., 2017). presenting the new sbc journal template viterbo et al. 2019 the project to declare full compliance with this process shall implement the following activities: a) prepare for stakeholder needs and requirements definition; b) define stakeholder needs; c) develop the operational concept and other life cycle concepts; d) transform stakeholder needs into stakeholder requirements; e) analyze stakeholder requirements; and f) manage the stakeholder needs and requirements definition. 2. the purpose of the system/software requirements definition process is to transform the stakeholder, user-oriented view of desired capabilities into a technical view of a solution that meets the operational needs of the user. (iso et al., 2015, 2017). the project to declare full compliance with this process shall implement the following activities: a) prepare for system/software requirements definition; b) define system/software requirements; c) analyze system/software requirements; and d) manage system/software requirements. 2.1.3. re according to capability maturity model integration (cmmi) cmmi is a process improvement maturity model for the development of products and services. it includes the best practices that deal with development and maintenance activities that cover the product's life cycle, from conception to delivery and maintenance. it was created by the software engineering institute at carnegie mellon university and is the result of the integration of several models (cmmi institute, 2015). cmmi has 22 process areas. a process area is a set of related practices that when implemented collectively, satisfy a set of objectives considered important to improve that process area (cmmi institute, 2015). the process areas are composed of specific goals (sg) and specific practices (sp) that guide in a more detailed way how to achieve the goals. two of the model areas work on the topic of software requirements: requirements development (rd) and requirements management (reqm). the purpose of the rd process area is to elicit, analyze, and establish customer, product, and product component requirements (team, 2010). this process area includes three sg. sg 1 develop customer requirements. sp 1.1 elicit needs. sp 1.2 transform stakeholder needs into customer requirements. sg 2 develop product requirements. sp 2.1 establish product and product component requirements. sp 2.2 allocate product component requirements. sp 2.3 identify interface requirements. sg 3 analyze and validate requirements. sp 3.1 establish operational concepts and scenarios. sp 3.2 establish a definition of required functionality and quality attributes. sp 3.3 analyze requirements. sp 3.4 analyze requirements to achieve balance. sp 3.5 validate requirements. the purpose of the reqm process area is to manage requirements of the project's products and product components and to ensure alignment between those requirements and the project's plans and work products (team, 2010). this process area is composed of a single sg. sg 1 manage requirements. sp 1.1 understand requirements. sp 1.2 obtain commitment to requirements. sp 1.3 manage requirements changes. sp 1.4 maintain bidirectional traceability of requirements. sp 1.5 ensure alignment between project work and requirements. 2.1.4. re according to brazilian software process improvement (mps.br) mps.br was created by the association for the promotion of the excellence of brazilian software (softex). it has three components: reference model (mr-mps), evaluation method (ma-mps) and business model (mn-mps). it is composed of 19 process areas. it focuses on the profile of companies with different sizes and characteristics, although with special attention to micro, small, and medium enterprises. the model has two areas that address re: requirements development (dre) and requirements management (gre) (montoni, rocha, & weber, 2009). the purpose of the dre process area is to define customer, product, and product components requirements. the expected results of dre process are (softex, 2009a): dre1 the client needs, expectations, and restrictions, both of the product and its interfaces, are identified. dre2 a defined set of customer requirements is specified based on needs, expectations, and restrictions identified. dre3 a set of product and product components functional and non-functional requirements that describe the problem solution to be solved is defined and maintained based on the client requirements. dre4 each product component functional and non-functional requirements are refined, elaborated, and designated. dre5 product and each product component internal and external interfaces are defined. dre6 operating concepts and scenarios are developed. presenting the new sbc journal template viterbo et al. 2019 dre7 the requirements are analyzed using defined criteria to balance stakeholder needs with the existing restrictions. dre8 the requirements are validated. the purpose of the gre process area is to manage product and product components requirements of the project, and identify inconsistencies between requirements and project plans and project work products. the expected results of ger process are (softex, 2009b): gre1 the requirements are understood, evaluated, and accepted together with the requirements providers, using objective criteria. gre2 the technical team's commitment to the approved requirements is obtained. gre3 bidirectional traceability between requirements and work products is established and maintained. gre4 revisions in plans and work products of the project are carried out to identify and correct inconsistencies with the requirements. gre5 requirements changes during the project are managed. 2.1.5. re according to moprosoft and competisoft software industry processes model (moprosoft) emerges as part of the ministry of economy software industry development program from mexico, to reach international levels of process capacity by small and medium-sized mexican software development companies (hanna oktaba, 2015). this model was the basis for the preparation of the iso/iec 29110 – lifecycle profiles for very small entities, and the model process improvement to promote the competitiveness of the ibero-american small and medium software industry (competisoft) (competisoft, 2006). moprosoft and competisoft are divides into three categories senior management, management, and operation. the operations category contains the software development and maintenance process, which allows to systematically carrying out requirements engineering activities, with a set of activities whose purpose is to obtain the documentation of the requirement specification and the system, to have a common understanding between the customer and the project. some of these activities for moprosoft model are (competisoft, 2006; hanna oktaba, 2005): a2.2. document or modify requirement specifications. identify and query information sources (customers, users, previous systems, documents, etc.) to obtain new requirements. analyze requirements identified to limit their scope and feasibility, considering the customer’s or project’s business environment restrictions. prepare or modify the user interface prototype. generate or update the requirement specifications. a2.3. verify the requirements specification. a2.4. correct defects found in requirement specification based on verification report and obtain approval of corrections. a2.5. validate requirements specification. a2.6. correct defects found in requirements specification based on validation report and obtain approval of corrections. a3.2. document or modify analysis and design. generate or modify the traceability record. competisoft incorporates other activities that complex moprosoft, e.g., a2.2 includes the task to identify and establish the security requirements of the information standard to obtain the required level of security. in general way, they both describe similar activities for requirements engineering. 2.1.6. good practices extracted from the models and standards re has a fundamental role in software development projects because it is the process that allows communication with stakeholders to obtain the requirements of the product in development. some models and standards group re activities in requirements development and requirements management. after analyzing the bibliography studied, it can be affirmed that cmmi and mps.br treats similarly the re. the same way happens with moprosoft and competisoft, as well as iso/iec/ieee 12207 and 15288 standards. table 1 identifies the good practices of the re. the requirements elicitation, specification, analysis, and validation stand out in the requirements development. in the case of requirements management, the most common is to achieve understanding, control changes, and maintain bidirectional traceability. table 1. good practices in the re process (own preparation). no. good practices p re ss m a n s o m m er vi ll e n c -i s o 9 0 0 1 is o /i e c /i e e e 1 2 2 07 y 1 5 28 8 c m m i – d e v y m p s . b r m o p ro s o ft y c o m p e t is o f t s w e b o k requirements development 1 requirements elicitation 5.1 – c4s5 4.2 6.4.2.3 rd 9.2(ope.2chapter presenting the new sbc journal template viterbo et al. 2019 no. good practices p re ss m an s om m er v il le n c -i s o 9 0 0 1 is o /i e c /i e e e 12 2 0 7 y 1 5 2 8 8 c m m i – d e v y m p s . b r m op ro s of t y c o m p e t is o f t s w e b o k inquiry 5.3 5.1.2 (b [1, 2, 3, 4]) 6.4.2.3 (d [1, 2, 3]) (sg 1 [sp 1.1, sp 1.2]) dre 1 a2.2) ds(a2.2) (3) 2 requirements specification 5.1 – specification c4s2 6.4.3.3 (b [3, 4, 5]) rd (sg 2 [sp 2.1, sp 2.2) dre 2 dre 4 9.2(ope.2 a2.2, a3.2) ds(a2.2, a3.2) chapter (5) 3 elaboration of product requirements c4s5 6.4.3.3 (b [1, 2, 3]) rd (sg 2 [sp 2.1, sp 2.2) dre 3 ds(a3.2) chapter (4.2) 4 identify interface requirements 5.1 – elaboration 5.4 7 6.4.3.3 (b [5]) rd (sg 2 [sp 2.3) dre 5 9.2(ope.2 a2.2, a3.2) ds(a2.2, a3.2) 5 establish operational concepts and associated scenarios 5.1 – elaboration 5.5 6 c4s5 6.4.2.3 (c [1, 2]) rd (sg 3 [sp 3.1) dre 6 ds(a3.2) chapter (4.2) 6 establish a definition of required functionality and quality attributes c4s1 6.4.3.3 (b [5]) rd (sg 3 [sp 3.2) ds(a2.2) chapter (7.3) 7 analysis and negotiation 5.1 negotiation 5.6 c4s5 c12s5 8.2.1 8.2.3 6.4.2.3 (e [1, 2, 3, 4]) 6.4.3.3 (c [1, 2, 3, 4]) rd (sg 3 [sp 3.3, sp 3.4]) dre 7 9.2(ope.2 a3.2) ds(a2.2, a3.2) chapter (4) 8 validation of requirements 5.1 – validation 5.7 c4s6 rd (sg 3 [sp 3.5]) dre 8 9.2(ope.2 a2.5, a2.6) ds(a2.5, a2.6) chapter (6) requirements management 9 identify requirements source 5.1 management c4s5 8.1 6.4.2.3 (a [1, 2,]) 6.4.3.3 (a [1, 2,]) reqm (sg 1 [sp 1.1]) 9.2(ope.2 a2.2) ds(a2.2) chapter (3.1) presenting the new sbc journal template viterbo et al. 2019 no. good practices p re ss m an s om m er v il le n c -i s o 9 0 0 1 is o /i e c /i e e e 12 2 0 7 y 1 5 2 8 8 c m m i – d e v y m p s . b r m op ro s of t y c o m p e t is o f t s w e b o k 10 understand requirements 8.2.3 reqm (sg 1 [sp 1.1]) gre 1 11 obtain commitment to requirements 8.2.1 6.4.2.3 (f [1]) 6.4.3.3 (d [1]) reqm (sg 1 [sp 1.2]) gre 2 12 manage requirements changes c4s7 8.2.4 6.4.2.3 (f [3]) 6.4.3.3 (d [3]) reqm (sg 1 [sp 1.3]) gre 5 chapter (7.2) 13 maintain traceability of requirements 6.4.2.3 (f [2]) 6.4.3.3 (d [2]) reqm (sg 1 [sp 1.4]) gre 3 ds(a2.2, a3.2, a4.3, a4.6) chapter (7.4) 14 ensure alignment between project work and requirements 8.2.4 reqm (sg 1 [sp 1.5]) gre 4 9.2(ope.2 a1.1) ds(a1.1, a2.2, a2.3) the tendency to include the reuse management approach as a process of creating software systems from existing software, applying domain engineering, was also identified in these reference models. this approach has provided many organizations with competitive advantages in the market, in terms of product quality, development time, production costs, among others (bastarrica, 2011; manso martínez & garcía peñalvo, 2013; northrop et al., 2007; salazar, 2017). during the application of domain engineering requirements are also developed and managed. as a fundamental element of domain engineering is the application of the domain analysis technique. it allows capturing the critical information of the entities, data, and processes that characterize a particular business area and then develop and specify the requirements (brun, 2007). the main result of the application of this technique is the domain model, which describes at a high level of abstraction the common elements and variants of the family for a correct management of the variability of the resulting products (montoni et al., 2009). most of the studied models and standards are designed for large software development organizations since they need long periods of implementation, and great effort for their assimilation. also, they have a high cost associated with certification and consulting, so it is difficult for cuban organizations, which have limited resources, to adopt some of these. it is for this reason that countries that are characterized by the majority presence of small and medium enterprises (sme) such as mexico and brazil have adapted the internationally recognized models to their needs, like moprosoft in the case of mexico and mps for the brazilian development companies. however, these two projects are adapted to the context of these countries and their characteristics. most of the available models do not detail a strategy that allows organizations an agile process that guides improvement and facilitates the work of process engineers. 2.2. model for the development of computer applications the processes' capacity to adapt to the market or clients makes that management models, oriented to quality, focus their attention on processes as the most powerful lever to act on the results, in an effective and sustained way in a long time (concepción, 2010; zaratiegui, 1999). perez researcher states that the mcdai has a process approach and considers it an accepted proposal for the software development industry in cuba (pérez, 2014). presenting the new sbc journal template viterbo et al. 2019 the mcdai is composed of 1) general guide that describes the model and its components; 2) implementation guide that contains the general requirements that must be met by the twelve base processes that compose the model, as well as defining each of the base processes; 3) evaluation guide that describes the process and evaluation method to determine the organization maturity level and capacity of its processes related to the model (see figure 1) (pérez, 2014). figure 1. mcdai components. implementation guide groups the processes into the following categories 1) organizational management gathers the base processes that have a direct influence on the organization, and it executes at a high level or on management´s responsibility; 2) project management gathers the base processes related to the project work organization; 3) engineering gathers the technical base processes necessary for software development; 4) support gathers the base processes that supports software development (see figure 2). figure 2. mcdai categories. each base process contains a purpose, specific requirements, and a process modeling suggested that meets the requirements. in the case of specific requirements, they are defined by basic, intermediate, and advanced levels (see figure 3). figure 3. base process structure. the specific requirements are divided into three parts: title, description, and recommended evidence. the recommended evidences are examples of what the work products could be. the specific requirements of each base process and mcdai's generic requirements are used as a reference standard by evaluators to determine the organization's processes capacity. the organization's maturity level (basic, intermediate, or advanced), is determined by taking into account the capacity of all its processes. organizations that decide to adopt the mcdai shall implement the requirements depending on the maturity level, and/or capacity desired. the process modeling suggestion with a graphic and textual representation is also shown as part of each base process. this process modeling is done to exemplify how to implement generic and specific requirements. 2.2.1. mcdai's generic requirements table 2 shown mcdai's generic requirements (gr) necessary to reach the desired capacity. each base process has to implement these requirements including re base process, that this investigation is presenting. table 2. mcdai's generic requirements. basic level gr 1 define the process to follow. gr 2 define roles and responsibilities. gr 3 plan process execution. gr 4 provide resources. gr 5 monitor process execution. gr 6 identify and preserve the configuration items. gr 7 evaluate the execution of the established process. gr 8 analyze the process status with the management. intermediate level gr 9 institutionalize the process. gr 10 manage indicators. gr 11 train staff. gr 12 manage the knowledge generated by the process. gr 13 identify and treat risks. advanced level gr 14 perform process improvement. presenting the new sbc journal template viterbo et al. 2019 2.3. process representation to model the re base process is necessary to analyze graphic or textual representation techniques: flow diagram, notation lanes, idef, etvx, business process modeling notation (bpmn), and textual description (losavio, guzmán, & matteo, 2011; manene, 2013; medina, 2012; murcia-oeste–arrixaca, 2013; silega, 2014; batista anisbert suárez, 2013). the result of this analysis allowed the authors of this investigation to determine that the bpmn and textual description combination are the most optimal variant because: they allow a graphic notation that describes the logic of the steps of a business process; coordinates the sequence of processes and messages that flow between the participants of the different activities; allows processes modeling in a unified and standardized way which facilitates an understanding to everybody in the organization; explains the activities and covers the information about the needs of the process, when it begins, the people involved, the duration, how the activities are carried out, when it ends, and the different scenarios that may arise (y. a. lazo, 2016). 3. requirements engineering base process this research proposes the re base process. it is part of the mcdai, therefore it's aligned to its structure. 3.1. purpose and specific requirements the purpose of the re base process is to identify the stakeholder requirements for a software product so that it can provide the capabilities needed by them, in a defined environment and transform the stakeholder's view into a technical vision that meets the operational needs of users. to fulfill this purpose and based on the good practices of re identified as part of the construction of the theoretical framework, specific requirements divided by three mcdai's maturity levels (basic, intermediate and advanced) were proposed (see table 3 1 ). requirements division in maturity levels was made to facilitate the model adoption through process improvement stepwise with small changes. table 3. specific requirements of the re base process. basic level re 1 define the relevant stakeholder requirements. (1 and 9) re 2 analyze and specify the requirements. (2, 6 and 7) re 2.2 prioritize requirements. (7) 1 in table 3, you can find the requirements statements distributed by levels, and in parentheses, it is related to the good practices identified in table 1. re 3 achieve understanding and commitment to technical requirements. (10 and 11) re 4 validate technical requirements. (8) cm 4 control changes. (12) intermediate level re 5 model the technical requirements. (3 and 5) advanced level re 2.1 approve technical requirements. re 5.1 modeling requirements based on reuse. re 6 establish bidirectional traceability. (13) qa 6 perform inconsistency reviews. (14) re 1 define the relevant stakeholder requirements. the appropriate sources and suppliers shall be identified to obtain relevant stakeholder requirements. the requirements shall be defined based on the needs and expectations of the suppliers and an analysis of the sources identified. recommended evidence: providers list and requirements list. re 2 analyze and specify the requirements. the stakeholder requirements shall be analyzed taking into account whether they are necessary or sufficient to meet the objectives of the product; from this analysis, new derived and/or implicit requirements can be defined. the functional and non-functional requirements shall be formally specified and with sufficient technical detail. shall be reviewed the viability of technical requirements. recommended evidence: requirements specification. re 2.1 approve technical requirements. a benchmarking shall be carried out in the corresponding application domain, to identify functionalities of similar products. the functionalities identified with the technical requirements shall be homologated, and define additional requirements that the product could contains to increase customer satisfaction. recommended evidence: requirements specification. re 2.2 prioritize requirements. priority to requirements that will be implemented according to the stakeholder needs, market conditions, and/or business objectives, shall be given. recommended evidence: prioritization of requirements. re 3 achieve understanding and commitment to technical requirements. shall be achieved requirements understanding between the suppliers and the project team. shall be resolved conflicts arising between the requirements. shall be obtained the project team commitment with the current and approved requirements implementation, as well as making the necessary changes, to plans, activities, and related work if the requirements evolve. recommended evidence: tasks in the management tool (assigned and accepted), meeting notes. re 4 validate technical requirements. the technical requirements shall be validated to ensure that the resulting product meets the stakeholder needs and presenting the new sbc journal template viterbo et al. 2019 expectations and works as intended, in the environment of the end user. recommended evidence: requirements specification. re 5 model the technical requirements. shall be modeled the technical requirements to obtain a better understanding of the product to be developed. shall be grouped the requirements taking into account criteria. note: the modeling of the requirements could be done taking into account different paradigms such as structured analysis; object-oriented analysis; among others. in the first case, models are created to represent the flow and content of the information (data and control), the product is divided into functional and behavioral partitions and the essence of what is to be built is described. for example: data flow diagrams (dfds); state transition diagram (dte); data dictionary. in the second case, the objective is to model the concepts (objects) of the domain of the product, its relationships and behaviors. that model is continuously refined until a model with sufficient detail is obtained for its implementation in the form of executable code. for example: use case models and operation scenarios; class model; sequence and activity diagrams; state diagrams. recommended evidence: document realization of requirements. re 5.1 model requirements based on reuse. a domain model(s) shall be defined and maintained that describes the borders of each domain with reuse potential and specifies its characteristics, capabilities, common elements and variants, optional or mandatory. the domain model(s) shall be incorporated into a repository of reusable assets, once they are formally evaluated and approved. recommended evidence: domain model. re 6 establish bidirectional traceability. bidirectional traceability between the project's objectives, stakeholder requirements, technical requirements, derived work products, and tasks that will fulfill it, shall be determined. shall be updated traceability throughout the project as appropriate. recommended evidence: traceability tool with the built-in elements. to obtain the desired capacity level of the re base process, in addition to fulfilling the specific requirements described above, the following shall be met: for basic level with the specific requirement, cm 4 control changes, from configuration management base process, to manage requested changes on requirements. for advanced level with the specific requirement, qa 6 perform inconsistency reviews, from quality assurance base process, to ensure alignment between project work and requirements. for the construction of the specific requirements described above were executed three stages. first, the authors prepared a proposal taking into account their experience and the good practices identified in table 1. second, was presented the proposal to 22 researchers who were working on the definition of the mcdai, for dividing the specific requirements into the three maturity levels (basic, intermediate, and advanced) of the model, and for identifying the relationship with the rest of the mcdai's base processes. the third stage executed after updating the proposal with the obtained feedback. seven experts were identified, with an average of 7 years of experience working on the re discipline, 100% engineers in computer science and with the scientific category of master. the specific requirements and the proposal of at what level they might be grouped were presented to experts, to obtain their assessment of them. the feedback with the experts allowed updating the specific requirements and the levels that group them. finally, the last version of the re base process, shown in the next section, was obtained. 3.2. process and activities as part of the solution, the graphic description (see figure 1) and the textual description of the re base process as an example of how to put the specific requirements into practice, is proposed. presenting the new sbc journal template viterbo et al. 2019 figure 4. graphic description of the re process. presenting the new sbc journal template viterbo et al. 2019 below is the textual re process description. 1. characterize and select the requirement sources. the analyst and the client taking into account the stakeholders identified in ppmc project planning, monitoring, and control (batista anisbert suárez et al., 2016), obtain requirements source and characterize them. for the advanced level when domain engineering is applied, the development is directed to an application family, therefore, requirement sources vary from specific clients to market and business studies; when domain engineering is applied, the sources are given by domain assets and the specific client. the analyst, project manager, and client select requirement sources taking into account their characterization and the provider(s) that represent the client’s interests, if applicable, and take responsibility for providing the requirements. the “requirement sources list” is obtained as a result of the execution of this activity. 2. obtain the stakeholder requirements. the analyst uses the “requirement sources list”, the “offer”, and/or the “technical project” prepared when the project was conceived to analyze the requirements that would be needed to comply with the project's goals. also, he identifies providers’ needs and expectations, characterizes the organization operating environments, and prepares a comprehensive list of them. this list is continuously updated by monitoring any changes that may occur going forward and based on suppliers’ suggestions. in case of not being satisfied, the analyst reidentifies/improves the requirements with the help of other techniques such as prototype, focus group, business use cases, business process model, among others. for the advanced level, when domain engineering is applied, the requirements are obtained through market and business studies and the analysis of past projects. in these cases, usually, the project does not have a specific client since it is working in the development of a generic product, for this reason, the analysis to resolve conflicts between requirements is made with functional experts. when application engineering is applied, the obtaining of the requirements is done by analyzing the existing domain assets with the specific client, for which common and variational elements, optional or mandatory, are taken into account, if necessary, to adopt them or design new requirements. the “requirement and restrictions list” is obtained as a result of the execution of this activity. 3. match requirements. from the advanced level, the analyst taking into account the “requirement and restrictions list” performs a benchmarking in the corresponding application domain to identify similar products. also, he matches the stakeholder´s requirements with similar product functionalities to identify additional requirements and verify that the identified functional needs correspond to this product type. the “benchmarking” and “requirement and restrictions list” with new requirements in cases to apply is obtained as a result of the execution of this activity. 4. achieve an understanding of the stakeholder requirements. the analyst, taking into account the “requirement and restrictions list”, identifies the conflicts between the requirements and makes proposals on how to eliminate them, using functional experts. the analyst and stakeholders meet to reach a consensus on the resolution of the conflicts identified, taking into account the proposal made. the project team and the requirements providers achieve a common understanding of the “requirement and restrictions list”. the “requirement and restrictions list” (updated) is obtained as a result of the execution of this activity. 5. prioritize the requirements. the analyst taking into account the “requirement and restrictions list” identifies the appropriate method for prioritization of the stakeholder requirements (e.g., hierarchical analysis, cumulative vote, numerical assignment, value-based prioritization, cost, and risks, among others). also, he prioritizes the requirements using the appropriate method. the “requirement and restrictions list” with the prioritized requirements is obtained as a result of the execution of this activity. 6. analyze and specify the stakeholder requirements. the analyst taking into account the “requirement and restrictions list” with the prioritization groups the functional and non-functional requirements that correspond to the iteration. analyzes if he can reuse the requirements of the previous projects. identifies whether the requirements are necessary or sufficient to develop the product that satisfies the stakeholder and, if required, identifies new derived requirements (functional requirements). he refines the functional requirements in terms of its description and functionality details. he analyzes the “requirement and restrictions list” taking into account the software product quality model defined in nciso/iec 25010, to identify implicit requirements (nonfunctional requirements). he refines the non-functional requirements by assigning allowable values to the quality attributes that the product should have. he reviews functional and non-functional requirements viability to determine if they are complete, feasible, and verifiable. from the intermediate level, he also specifies the internal and external interface requirements of the system. the “requirements specification” is obtained as a result of the execution of this activity; hereafter these requirements will be treated as technical requirements. 7. achieve an understanding of the technical requirements. the analyst and stakeholders taking into account the “requirements specification” meet to arrives at a common understanding of described technical requirements. presenting the new sbc journal template viterbo et al. 2019 the “requirements specification” (updated) is obtained as a result of the execution of this activity. 8. validate technical requirements. the project manager, analyst, and client taking into account the “requirements specification” validate the technical requirements using the prototype technique, where candidates of the system interfaces and the input and output elements are shown to the final user. the analyst, in case of any indication or observation by the clients, updates the “requirements specification”. the “requirements specification” (signed) is obtained as a result of the execution of this activity. 9. model the technical requirements. from the intermediate level, the analyst defines the conceptual model, establishing the relationship of the entities of the system or subsystem and their fundamental attributes, future persistent classes, and candidates for the data model. also, he models the requirements by making a technical description (use case model, operation scenarios, class model, and user stories, where applicable). the project manager and architect distribute the requirements in the modules or subsystems of the project. from the advanced level, the domain model corresponding to the applications family is taken into account to define the product analysis model that is going to be developed for a specific client. the “analysis model” is obtained as a result of the execution of this activity. 10. model technical requirements based on reuse. from the advanced level, the analyst defines the domain model where are described its boundaries with other domains, and specifies the characteristics, capacities, common elements, and variants, optional or mandatory. he defines the conceptual model, establishing the relationships of the entities that are part of the domain and its fundamental attributes, future persistent classes, and candidates for the standard data model for the family of applications. also, he models the requirements by making a technical description of the applications family (use case model, operation scenarios, class model, and user stories, in the cases that apply). the project manager and architect distribute the requirements in the project modules or subsystems. the “domain model” as a result of the execution of this activity, is obtained. 11. qa-perform evaluation. the evaluation team verifies that the “domain model” is technically correct guided by the sub-process qaperform evaluation (y. a. lazo, 2016). the “evaluation file” is obtained as a result of the execution of this activity. 12. create/update traceability system. in the advanced level, the analyst taking into account the traceability guide inserts in the selected tool, as they are being developed the objectives of the project, the stakeholder requirements, the technical requirements, the work products and the tasks that they will comply with the agreed requirements. he establishes the corresponding bidirectional relationships between the elements inserted in the tool. also, he updates the tool, if changes to the requirements or work products arise. the traceability tool, with its established relationships is obtained as a result of the execution of this activity. 13. cm-control the changes. the change control committee analyzes the change request on the requirements as established by the subprocess cm-control the changes (garcía, 2017). the “change request” (accepted) is obtained as a result of the execution of this activity. 14. make changes to requirements. the analyst taking into account the “change request” accepted, at the basic and intermediate levels, makes the corresponding changes on the requirements and the related work products. in the advanced level, the changes are made using the traceability tool. the “requirements specification” and related (updated) artifacts are obtained as a result of the execution of this activity. 15. qa-perform evaluation. at the advanced level, the evaluation team executes the inconsistency review between the requirements and the associated work products, taking into account the tool and the traceability guide, as established by the qa-perform evaluation sub-process (y. a. lazo, 2016). the “file of the evaluation” is obtained as a result of the execution of this activity. 3.3. re base process relationship with mcdai the re base process has a close relationship with other base processes that compose the mcdai. this relationship allows providing input elements for other base processes (see figure 5). for example, the requirements specification and the domain model are input elements of the tsd base process and are taken into account for product design and implementation. in this relationship, it is also appreciated that the result of other base processes is used in the re base process, for example, the changes requests to requirements are accepted or rejected by the cm base process, among others. presenting the new sbc journal template viterbo et al. 2019 figure 5. relationship of the re base process with other mcdai processes. also, this relationship assures to comply with the model's generic requirements. as shown in figure 6, through the opm base process the re process is defined, is defined it's associated roles and responsibilities, is provided the resources necessary to execute the process, and is institutionalized the re process throughout the organization. ppmc base process, plan and monitor the re process execution, as well as manage the training of the project personnel internally to the process execute. cm base process identifies and preserves the configuration elements that are generated in the re process. qa base process evaluates that the defined re process is executing in the organization, and keep the management informed of the status of that process. mi base process defines necessary indicators to measure the re base process and makes improvements to it. km base process manages the project team training about the re process, that could not be satisfied in the project, and knowledge generated by it. finally, the rm base process identifies and treats risks associated with the re base process. figure 6. re base process compliance with generic requirement. 3.4. measure the re base process to measure the re base process influence on the software development projects' success is proposed the indicator requirement compliance index (rci). it aims to evaluate requirements compliance agreed with clients. it as an unfulfilled requirement is understood when it has not been developed, or are not those agreed-upon results obtained after its implementation. for this, are identified the following base measures (arq: agreed requirements quantity, riq: requirements implemented quantity). the following measurement function 𝑅𝐶𝐼 = is used to calculate the rci. the projects are considered successful if 𝑅𝐶𝐼 > 0,95. this indicator was selected by 7 experts with presenting the new sbc journal template viterbo et al. 2019 an average of 7 years of experience working in the discipline of requirements engineering. 4. validation 4.1. analysis of the proposal taking into account focal group the authors of the present investigation consider that the focal groups constitute a valuable and widely used technique to obtain information. for this reason, they decided to use it to know if the solution proposal uses the correct terminology and is technically viable. for its conformation the criteria issued by aigneren and méndez were taken into account (aigneren, 2009; méndez, 2007), those who state that the size of the group should oscillate between 4 and 12 participants; that all the participants have the possibility of issuing their criteria; and that the group must be homogeneous in order to ensure the diversity of ideas. in order to comply with the above, 12 specialists were summoned, with more than 5 years of experience in the roles of analyst and architect. the selected ones represented the organizations calisoft, desoft, xetid, etecsa, transoft, eicma, aicros and segurmática (a. y. lazo, tamayo, enamorado, pérez, & sánchez osorio, 2018). the final result was a re process enriched with the experiences of each participant and the unanimous criterion that it is an accepted proposal that meets the needs of the software development organizations in cuba. 4.2. analysis of the implementation of the process in pilot projects a pre-experiment was applied in pilot projects to evaluate that when introducing the re base process in software development projects, the project's success is greater than 48%, concerning the dimension of the requirement compliance index variable. sampieri suggests pre-experiment can be done, through a case study with a single measurement, or the design of a pre-test post-test with a single group (hernández, fernández, & baptista, 1991). when analyzing the two options, it was found that in the first one, there is no manipulation of the independent variables, and there is no previous reference of what was the situation before performing the stimulus. in the second one, there is an initial reference point to see what level the group had in the dependent variables before the stimulus, allowing a followup. taking into account the foregoing, researchers selected the second variant, knowing that pre-experimental designs are not suitable for the establishment of relationships between independent and dependent variables. but they consider it important because can yield results that when compared with those of other methods help to reach conclusions. to implement the pre-experiment in the period from january 2018 to july 2018 were selected six projects from 3 different organizations (datys, aicros, and transoft). the projects were developing web applications, with a team of six people, with mastery of the technologies used and with more than 5 years of experience in the business, the requirements average was 110, functional and non-functional. at the end of the pre-experiment, five of six pilot projects were successful taking into account the rci indicator, which represents 83.33%. according to table 4, project 2 was the one that did not reach an 𝑅𝐶𝐼 > 0.95. table 4. rci of project. projects riq arq rci p1 120 120 1.00 p2 108 120 0.90 p3 99 100 0.99 p4 110 110 1.00 p5 110 110 1.00 p6 97 100 0.97 carrying out a comparative analysis between the diagnosis made in 2017 and the result obtained, in the first case only 48% of the projects completed successfully, and in the second case, an improvement in the indicator is seen in 83.33%. however, carrying out an exhaustive analysis of the project that did not comply with the indicator, in the review of adherence to the re process, it reached a 50% implementation of the activities, an aspect that could influence the obtained results. among the re process activities not executed in some project iterations, were analysis, negotiation, and validation of the requirements, due to the distance between the client and the project team. the absence of these activities caused that there was no understanding between the parties about the requirements in early stages and that the client was dissatisfied with seven of the agreed requirements because they did not work as expected and another five had problems related to usability. the results allowed the authors of the research to appreciate an improvement in the success of the projects taking into account the dimension of the rci variable, after introducing the proposed process. 4.3. satisfaction of end users the technique of v.a. iadov was created by n.v. kuzmina in 1970, for the study of satisfaction with pedagogical careers. subsequently, it has been used in several investigations to evaluate satisfaction in different contexts. iadov consists of five questions, three closed, and two open. in this research, this technique is used to assess user satisfaction in pilot projects respect to the re process. for this, a survey was applied to six analysts and six architects. the criteria measured in the survey are based on the relationships established between the three closed questions, related through the iadov logical table (see table 5). presenting the new sbc journal template viterbo et al. 2019 table 5. iadov logical table for the re base process (modified by the authors of this research). 1. do you consider the requirements engineering base process complex and difficult to understand? no i don't know yes 3. is the requirements engineering base process used to your liking? 2. if you were to carry out another project, would you use the proposed requirements engineering process? yes i don't know no yes i don't know no yes i don't know no clearly pleased 1 2 6 2 2 6 6 6 6 more pleased than unpleased 2 2 3 2 3 3 6 3 6 not defined 3 3 3 3 3 3 3 3 3 more unpleased than pleased 6 3 6 3 4 4 3 4 4 clearly unpleased 6 6 6 6 4 4 6 4 5 contradictory 2 3 6 3 3 3 6 3 4 the number resulting from the interrelation of the three questions indicates the position of each respondent on the satisfaction scale. respondents used the following satisfaction scale, to which a value is assigned to determine the group satisfaction index: 1. clearly pleased: +1 2. more pleased than unpleased: +0.5 3. not defined: 0 4. more unpleased than pleased: -0.5 5. clearly unpleased: -1 6. contradictory: 0 below is the calculation of the group satisfaction index (gsi) in the following formula: 𝐺𝑆𝐼 = 𝐴(+1) + 𝐵(+2) + 𝐶(0) + 𝐷(−0.5) + 𝐸(−1) 𝑁 = 12(+1) + 0(+2) + 0(0) + 0(−0.5) + 0(−1) 12 = 1 the group index yields values between + 1 and 1 and is classified as follows: satisfaction: values between 0.5 and 1 contradiction: values between -0.49 and 0.49 dissatisfaction: values between -1 and 0.5 the result of 𝐺𝑆𝐼 = 1 , means maximum satisfaction for the proposed er base process. this result was corroborated with the answers to open questions 4 and 5, where respondents expressed that they would not change anything in the base process because it fits their needs. 5. conclusion the good practices for re were grouped, into requirements development and requirements management. requirements development's main practices are identifying stakeholder needs, and specifying, analyzing and negotiating requirements. requirements management's main practices are controlling changes and maintaining traceability. the graphic and textual description of the re base process is a guide to adopt the mcdai's requirements divided by the three levels of maturity and facilitate their adoption. incorporating feedback activities with clients in the re process is a factor that influences the success of the project, because it allows identifying the necessary changes to the requirements, at the appropriate time, for the product to respond to the client’s needs and expectations. the proposal validation contributed to verify the user satisfaction with the proposed process and that the execution of the process can contribute to the project success. it is recommended to measure the process impact on the volatility of the requirements to contribute to the project planning fulfillment. references aigneren, m. (2009). la técnica de recolección de información mediante grupos focales. la sociología en sus escenarios. bastarrica, c. (2011). productividad en la industria tic. bits. brun, r. e. (2007). técnicas de análisis de dominio: organización del conocimiento para la construcción de sistemas software. paper presented at the la interdisciplinariedad y la transdisciplinariedad en la organización del conocimiento científico: interdisciplinarity and transdisciplinarity in the organization of scientific knowledge: actas del viii congreso isko-españa, león, 18, 19 y 20 de abril de 2007. calisoft, c. n. d. c. d. s. (2014). cs-03-d (14-001) libro de diagnóstico. calisoft, c. n. d. c. d. s. (2017). cs-03-d (17-001) libro de diagnóstico. cmmi institute. (2015). retrieved 02/11/2015, 2015, from https://sas.cmmiinstitute.com/pars/pars_detail.aspx?a=25 323 competisoft, p. (2006). competisoft-mejora de procesos para fomentar la competitividad de la pequeña y mediana industria del software de iberoamérica. versión 0.2. diciembre. presenting the new sbc journal template viterbo et al. 2019 del toro, a. a. (2018). una mirada desde el desarrollo ágil a los requisitos de software. experiencias en datys villa clara. paper presented at the taller 2 ingeniería de requisitos. garcía, y. g. (2017). proceso base gestión de la configuración para un modelo de calidad en cuba. universidad de las ciencias informáticas. goguen, j. a. (1994). requirements engineering as the reconciliation of social and technical issues (san diego: academic press professional ed.). hernández, s. r., fernández, c. c., & baptista, l. p. (1991). metodología de la investigación. ieee. (2014). swebok. guide to the software engineering body of knowledge (versión 3 ed.). international, t. s. g. (2018). chaos report. iso, iec, & ieee. (2015). iso/iec/ieee 15288 systems and software engineering — system life cycle processes. iso, iec, & ieee. (2017). iso/iec/ieee 12207, systems and software engineering — software life cycle processes. iso, iec, & ieee. (2018). iso/iec/ieee 90003, software engineering — guidelines for the application of iso 9001:2015 to computer software. lazo, a. y., tamayo, o. l., enamorado, p. o., pérez, m. d., & sánchez osorio, y. (2018). apuntes sobre el modelo de la calidad para el desarrollo de aplicaciones informáticas (mcdai). paper presented at the xvii convención y feria internacional informática 2018, la habana. http://www.informaticahabana.cu/es/node/3703 lazo, y. a. (2016). proceso base de aseguramiento de la calidad para el desarrollo de software en cuba. universidad de las ciencias informáticas. lehtinen, t. o., mäntylä, m. v., vanhanen, j., itkonen, j., & lassenius, c. (2014). perceived causes of software project failures–an analysis of their relationships. information and software technology, 56(6), 623-643. losavio, f., guzmán, j. c., & matteo, a. (2011). correspondencia semántica entre los lenguajes bpmn y grl. enl@ ce, 8(1). manene, l. m. (2013). los diagramas de flujo: su definición, objetivo, ventajas, elaboración, fases, reglas y ejemplos de aplicaciones. los diagramas de flujo. manso martínez, m., & garcía peñalvo, f. j. (2013). medición en la reutilización orientada a objetos. mcleod, l., & macdonell, s. g. (2011). factors that affect software systems development project outcomes: a survey of research. acm computing surveys (csur), 43(4), 24. medina, y. t. (2012). modelado de procesos con idef en la metodología rup. serie científica-universidad de las ciencias informáticas, 5(2). méndez, a. l. d. (2007). la entrevista y los grupos focales. montoni, m. a., rocha, a. r., & weber, k. c. (2009). mps. br: a successful program for software process improvement in brazil. software process: improvement and practice, 14(5), 289-300. murcia-oeste–arrixaca, á. i. (2013). manual para el diseño de procesos. northrop, l., clements, p., bachmann, f., bergey, j., chastek, g., cohen, s., . . . little, r. (2007). a framework for software product line practice, version 5.0. sei.–2007– http://www. sei. cmu. edu/productlines/index. html. oficina nacional de normalización. (2015a). nc-iso 9001 sistema de gestión de la calidad — requisitos. oficina nacional de normalización. (2015b). nc iso 9000 sistema de gestión de la calidad fundamentos y vocabulario. oktaba, h. (2005). modelo de procesos para la industria de software-moprosoft-versión 1.3, agosto de 2005: nmx-059/01-nyce-2005. oktaba, h. (2015). historia de una norma. moprosoft y sus primeros pasos. retrieved 1, 2015, from http://sg.com.mx/content/view/390 pérez, d. m. (2014). guía general para un modelo cubano de desarrollo de aplicaciones informáticas. universidad de las ciencias informáticas. retrieved from https://repositorio.uci.cu/jspui/handle/ident/8725 pérez, d. m., & aveleira, d. q. (2016). evolución del modelo de la calidad para el desarrollo de aplicaciones informáticas. paper presented at the xvi convención y feria internacional informática 2016, la habana. http://www.informaticahabana.cu/es/node/664 pressman, r. s. (2010). ingeniería de software. un enfoque práctico (séptima edición ed.). méxico. rosato, m. (2018). go small for project success. pm world journal, vii(v). salazar, l. l. (2017). desarrollo del proceso solución técnica para los proyectos de desarrollo de la universidad de la ciencias informáticas. universidad de las ciencias informáticas (uci). silega, m. n. (2014). método para la transformación automatizada de modelos de procesos de negocio a modelos de componentes para sistemas de gestión empresarial. universidad de las ciencias informáticas (uci). softex. (2009a). mps.br mejora de proceso del software brasileño (vol. guía de implementación – parte 4: fundamentos para implementación del nivel d del mrmps). softex. (2009b). mps.br mejora de proceso del software brasileño (vol. guía de implementación – parte 1: fundamentos para implementación del nivel g del mrmps). sommerville, i. (2011). ingeniería de software (novena edición ed.). méxico. suárez, b. a. (2013). marco de procesos para las entidades de servicios de tecnología de la información de la universidad de las ciencias informáticas. universidad de las ciencias informáticas (uci). suárez, b. a., sánchez, o. y., muñoz, r. m., ruenes, c. s. b., gómez, b. c., gutierrez, f. l. m., & calunga, á. a. (2016). modelo de calidad para el desarrollo de aplicaciones informáticas: categoría de gestión de proyecto. paper presented at the xvi convención y feria internacional informática 2016, la habana. team, c. p. (2010). cmmi® for development, version 1.3, improving processes for developing better products and services. no. cmu/sei-2010-tr-033. software engineering institute. the standish group international. (2014). the standish group report. the standish group international. (2015). chaos report 2015. journal of software engineering research and development, 2021, 9:7, doi: 10.5753/jserd.2021.1049 this work is licensed under a creative commons attribution 4.0 international license.. representation of software design using templates: impact on software quality and effort silvana moreno [ universidad de la república, uruguay | smoreno@fing.edu.uy ] vanessa casella [ universidad de la república, uruguay | vcasella@fing.edu.uy ] martín solari [ universidad ort uruguay | martin.solari@ort.edu.uy ] diego vallespir [ universidad de la república, uruguay | dvallesp@fing.edu.uy ] abstract as a practice, software design seeks to contribute to developing quality software. during this software devel opment stage, the requirements are translated into a representation of the software (also known as design), whose quality can be evaluated and improved. for undergraduate students, the design is difficult to understand and make. in fact, building a good design seems to require a certain level of cognitive development that few students achieve. the aim of this study is to know the effort dedicated to software detailed design and the effect on software quality when graduating students use templates to represent their design. we conducted a controlled experiment where stu dents develop eight projects following a defined process and recording data from its execution in a software tool. we found that the use of design templates did not improve the quality of the code, measured as the defect density in the unit test phase. also, the use of templates did not reduce the number of code smells in the analyzed code. regarding the effort, students who use templates dedicated greater development effort to designing than to coding. meanwhile, students who did not use templates dedicated four times less effort to designing than to coding. keywords: detailed design, software quality, graduating students 1 introduction software design is one of the most important components to ensure the success of a software system (hu, 2013). be tweentherequirementsanalysisphaseandthesoftwarebuild ing phase, software design has two main activities: architec tural design and detailed design. during architectural design, highlevel components are structured and identified. dur ing detailed design, every component is specified in detail (bourque and fairley, 2014). this work is focused specifi cally on detailed design. design is a difficult discipline for undergraduate students to understand, and success (i.e. building a good design) seems to require a certain level of cognitive development that few students achieve (carrington and k kim, 2003; hu, 2013; linder et al., 2006). students’ ability to build a good design is related to the abstraction, understanding, reasoning and dataprocessing ability (kramer, 2007; leung and bol loju, 2005; siau and tan, 2005). building quality software is increasingly relevant. we highly depend on software in our daily lives and its quality has a great impact. a quality software design allows us to build quality software, with fewer defects and is more main tainable. industry practitioners are aware of the importance of software design quality and they use clean code practices, reviews and tools, among others, to contribute in this regard (brown et al., 1998; fowler, 2018; stevenson and wood, 2018). knowing how undergraduate students design is of interest to several authors (chen et al., 2005; eckerdal et al., 2006a,b; loftus et al., 2011; tenenberg, 2005). most of their studies found that students do not manage to produce a good soft ware design. some of the problems detected are lack of con sistency between design artifacts and code, incomplete de signs, and the lack of understanding of what kind of infor mation to include when designing software (eckerdal et al., 2006a,b; loftus et al., 2011). in this work, we study the software design practice in grad uating students. we conducted an experiment within the con text of some courses over three consecutive years to know the effort dedicated to software design and the effect that the representation of design using specific templates has on soft ware quality. we use the term graduating for our students, because, in fact, they are in the fourth year of the degree of the school of engineering of universidad de la república, in uruguay. the curriculum of the school of engineering is a fiveyear degree, similar to the ieee/acm’s proposal for the computer science undergraduate curriculum (joint task force on computing curricula acm and ieee computer society, 2013). students have already passed courses where detailed software design is taught: design principles, artifacts and design diagrams, uml, design patterns, etc. this work is an extension of the article published at the iberoamerican conference on software engineering (cibse) 2020: “the representation of detailed design using templates and their effects on software quality”. our article was se lected to participate for the publication of in a special issue in the journal of software engineering research and devel opment (jserd). below, we detail the extension of our work with respect to cibse article: the work presented at cibse 2020 aims to know the effect on software quality when graduating stu dents use templates to represent the detailed design. in this work we present an empirical study where students develop 8 projects following a defined process and recording data from the execution in a tool. we found that the use of design tem plates did not improve the quality of the code measured as the defects density in the unit test phase. neither did the use https://orcid.org/0000-0002-1677-6212 mailto:smoreno@fing.edu.uy https://orcid.org/0000-0002-0339-6624 mailto:vcasella@fing.edu.uy https://orcid.org/0000-0001-5532-3227 mailto:martin.solari@ort.edu.uy https://orcid.org/0000-0003-1701-353x mailto:dvallesp@fing.edu.uy moreno et al. 2021 of templates manage to reduce the number of code smells present in the analyzed code. the extension carried out in this work consists, on the one hand, of expanding and deepening aspects that for limited space reasons are not in the cibse ar ticle. on the other hand, we add a new research question and its analysis, which allows to knowing the effort that implies the use of design templates. specifically, a new section explaining the experimental de sign in depth was added. the analysis of external quality was expanded and deepened. descriptive statistics were added and analyzed and tables were added with the data of the av erage density of defects in ut for the students. in addition, a statistical analysis was added within the betweengroup anal ysis that checks the homogeneity of the groups studied (trd, notrd). threats to validity were expanded, grouping them by type (construct, internal, external, conclusion), and dis cussion and conclusions sections were expanded. a research question was added that seeks to know the effort that students dedicate to design, and how that effort varies after the use of templates. to answer this question, the relationship between the effort dedicated to the design phase and the effort dedicated to the coding phase was studied. de scriptive and statistical analyses were presented as part of the analysis of results. the results obtained are discussed and re lated to those previously obtained in the discussion section. the document is structured as follows: section 2 presents related works; section 3 presents the research methodology; section 4 presents the results, and section 5 is discussed; threats to validity are mentioned in section 6, and section 7 presents the conclusions and future work. 2 related work software design is an important activity to ensure the qual ity of a software system (hu, 2013; taylor, 2011). it involves identifying and abstractly describing the software system and its relationships. good designs help develop robust, main tainable software and with few defects (pierce et al., 1991; sommerville, 2016). detailed software design is a creative activity, which can be done in different ways: implicitly, in the developer’s mind before coding, on a sketch on paper, through diagrams, using both formal and informal languages or tools (chemuturi, 2018). software quality is the degree to which a software product meets stakeholders’ needs both explicit and implicit. qual ity models represent quality in terms of a set of elements of the model and their relationships (nistala et al., 2019). these models define internal and external software quality attributes. the internal ones are those that do not depend on the software execution (static), while the external ones are those that are applicable to the execution. in recent years, the use of clean code practices and tools has contributed to improved design quality (stevenson and wood, 2018). code smells, anti patterns and design flaws can be used to measure the quality of a software design (mar tin, 2002; gibbon, 1997; brown et al., 1998; fowler, 2018). sonarqube (campbell and papapetrou, 2013) and findbugs (ayewah et al., 2008) are some of the tools used to measure the quality of the code by detecting bad smells. current industry practices require practitioners with the necessary skills to understand and build good software de signs. however, students have difficulties designing. build inggooddesignsrequiresacertainlevelofcognitivedevelop ment that few students achieve (carrington and k kim, 2003; hu, 2013; linder et al., 2006). this cognitive development is related to the ability to recognize design patterns, architec tural design styles, and related data and actions that can be extracted into appropriate design abstractions (hu, 2013). in fact, for students, learning to design is more difficult than learning to code. this difficulty occurs because for most programming languages, students get compiler feedback and run time errors. however, this does not happen with design (karasneh et al., 2015). objectoriented design (ood) is one of the most widely used design approaches in the industry and one of the sub jects normally taught in universities (flores and medinilla, 2017). by using oo modeling diagrams and languages, static and dynamic models of software systems can be created. sev eral empirical studies analyze the understanding and bene fits of using uml diagrams (budgen et al., 2011; fernández sáez et al., 2013; arisholm et al., 2006; gravino et al., 2015; torchiano et al., 2017). in some studies, students failed to obtain design benefits using uml diagrams (gravino et al., 2015; torchiano et al., 2017). gravino et al. found that students who use uml di agrams to design do not make significant improvements in their source code comprehension tasks compared to students who do not use them. also, students who use diagrams spend twice as much time on the same source code comprehension task than as students who do not use them. when analyzing the experience factor, they find that the most experienced stu dents achieve an improvement in the understanding of the source code (gravino et al., 2015; soh et al., 2012). for industry professionals, the use of uml continues to be resisted to a certain degree (stevenson and wood, 2018). a survey conducted to on 50 software professionals indicates that although the quality of the software is an important as pect, the use of uml is selective (informal, only for a while, then it is discarded) and with low frequency (petre, 2013). the use of modeldriven development (mdd) methodol ogy to design software has shown improvements in software quality. panach et al. conducted an experiment and found that students using mdd achieve better quality products (mea sured through test cases) than students using the traditional software development method (panach et al., 2021). undergraduate students’ design skills are reported by pre vious studies examining artifacts produced by them to learn how they design software (chen et al., 2005; eckerdal et al., 2006a,b; loftus et al., 2011; tenenberg, 2005). these studies use the same requirements specification for which students must produce a design. the studies use different approaches: designs produced individually, designs made in groups, and designs produced at different levels of training. in general, all the works mentioned agree on the fact that graduating students are not capable of designing a soft ware system. lack of consistency between design artifacts and code, incomplete designs, and lack of understanding of what kind of information to include when designing software are some of the major difficulties reported (eckerdal et al., moreno et al. 2021 2006a,b; loftus et al., 2011). we believe, just as loftus et al. (loftus et al., 2011), that students do not precisely know what to do when they have to design software. besides, several authors analyzed the ar tifacts produced and they agree on the fact that students do not know how to design (chen et al., 2005; eckerdal et al., 2006a,b; loftus et al., 2011; tenenberg, 2005). this moti vated the work presented in this paper, in which we pro vide students with design templates as a support tool for de sign representation. unlike gravino and torchiano, who an alyzed the benefits of using diagrams in code comprehen sion (gravino et al., 2015; torchiano et al., 2017), our ap proach tries to analyze the effort dedicated to designing and coding; and the impact of the use of templates on software quality. we studied quality from two perspectives: defects on the code and code smells. we also analyzed the effort as the time in minutes that students dedicate to the design and code phases. the focus of our research is the ood at the class level, including source code organization, the identification and re lationship between classes, and the interaction of users with the system. as kitchenham pointed out (kitchenham and pfleeger, 1996), this corresponds, to the “product view”, an examination of the inside of a software product. we used an approach focused on objects because a large part of the current software is developed using that technology (group, 2015). 3 research methodology we studied the effect of design in software quality when grad uating students represent their design using a specific set of templates and the effort they dedicate to the design activity. we conducted three experiments within the context of three consecutive undergraduate courses, from 2015 to 2017. 3.1 course context the course principles and foundations of personal software process (pfpsp) have the same format every year and lasts 9 weeks. in the first week (week 1), a base process is taught, and the dynamics of the practical work to be done throughout the remaining eight weeks are explained. students participate in the course on a voluntary basis. the base process is a defined and disciplined process that intends to help the software development tasks and to col lect product and process metrics. the process has different phases, scripts that guide the work in each phase, and logs that are used to collect data (see figure 1). the base process is divided into the following phases: plan, design, code, compile, unit test (ut), and postmortem. to follow the process, students are provided with a set of scripts. scripts are a one page guide that establishes the inputs, out puts and activities to be carried out in each phase. scripts help students guide the development activities but without demanding how they must be carried out. in each phase of the process, students must log the time dedicated to the phase, as well as data on the defects he or she removes (injection phase, removal phase, type of defect and time spent to correct it). in the postmortem phase students log the size in line of code (loc) of the program built. figure 1. base process the practical work consists of each student developing 8 small projects following the base process and recording the process data in a tool. students carry out the projects individ ually and consecutively. project 2 does not begin until project 1 has been completed, and so on with the remaining projects. from week 2 to week 9, one project is assigned per week. at the beginning of each week, a teacher sends the student the requirements of each project. each student’s submission must contain the code that solves the problem, the test cases executed, and the export of the data that was registered in the tool. once the student submits the solution, the teacher reviews the work and sends corrections back to the student if necessary. students carry out the projects at home and have a teacher assigned, who will be responsible for assigning the projects, correcting them and answering questions. before starting project 1, each student must choose the programming language to use throughout the course. our in terest is to collect data of the execution of the development process with the use of a programming language familiar to the student. projects are small in size and of low and similar difficulty, so the design phase refers to detailed design (i.e. identifying classes, attributes, operations, program scenarios, status diagram, and pseudocode). the nature of project 2 is different from the other projects. in project 2, students have to build a sizemeasuring software, while in the remaining projects, they must produce mathe matical solutions (standard deviation, simpson’s rule, corre lation parameters). previous studies show that process mea sures and product measures in project 2 have greater diffi culty than in the rest of the projects (i.e., project 2 is an out lier), and it is usually discarded in statistical analysis (grazi oli et al., 2014b; moreno and vallespir, 2018). therefore, we excluded the data of this project from the analyzes presented in this article. however, it is relevant to mention that project 2 is an integral part of our course, and it is used for students from projects 3 to 8 to count the lines of code they produce in each project. percentiles 5 and 95 of the data collected for all the stu dents throughout the 8 projects are 26 loc and 242 loc respectively. each replication of the experiment corresponds to an in stance of a different run of the course. students who par ticipated in one course do not participate again in a later course. the teachers participating were the same throughout the three courses (20152017). moreno et al. 2021 3.2 goals and research questions the aims of the experiment are to know the effect on soft ware quality when students represent their designs using tem plates, and to study the effort they dedicate to the design ac tivity. templates are documents with a predefined structure in which students have to represent their designs. the templates we used allow to describing the detailed de sign of a project. we used four templates, a brief description of each of them is presented below: • operational template: specifies the interaction between the program and the users. the content may look similar to a usecase description. • functional template: the behavior of the program’s invo cations and returns are specified in this template. vari ables, functions, classes and methods are described. fig ure 2 presents an example of the use of this template for project 6. • logical template: in this template, the pseudocode of each method that appears in the functional template is registered. • state template: it can be used to define the transactions and conditions of the program’s internal states. the con tent is similar to state machine diagrams. the selected templates emerge from the personal pro cess (psp) framework(humphrey, 1995). the psp consid ers a design to be complete when it defines all four di mensions (internalstatic, internaldynamic, externalstatic, externaldynamic). the way to correspond to each of the four dimensions is by using the four templates (operational, func tional, logical, state). completing the four templates allows describing the designs entirely and precisely (humphrey, 1995). several studies have shown an improvement in devel oper performance with templates insertion (hayes and over, 1997; prechelt and unger, 2001; gopichand et al., 2010). in the experiment context, we proposed the following re search questions and the corresponding research hypotheses: rq1: is there an improvement in the quality of the products when students represent the design using templates? rq2: what is the relation between the effort dedicated to designing and the effort dedicated to coding? are there any variations in effort when students use templates? to answer rq1, we analyzed the external and internal quality of the software developed in each project. to study the external quality, we considered the following research hypothesis: h1.0: representing software design using design templates, does not change the software defect density in unit testing h1.1: representing software design using design templates, changes the software defect density in unit testing to study the internal quality, we descriptively analyzed certain code smells introduced by students when producing software (fowler, 2018). we are interested in knowing if the use of templates to represent software design prevents stu dents from incurring into some type of code smells. to answer rq2, we studied the time spent on the design and code phases. we analyzed the following research hypoth esis: h2.0: the time spent on designing equals the time spent on coding. h2.1: the time spent on designing does not equal the time spent on coding. 3.3 experimental design our design is a repeated measures design with one factor (the base process) and two levels: with templates to represent the software design and without templates to represent the soft ware design. response variables considered in this experi ment are internal and external software quality, and the effort dedicated by the students to the design and code phases. our experimental design implies that students develop 8 projects. the base process introduces practices in the first 2 projects that allow for guiding the work and measure the pro cess. therefore, during the first or second project (depend ing on the subject), they are already following the process adequately. people have high variability among themselves when applying software development techniques or processes (humphrey, 2005). when high variability among people ex ists in an experiment with human subjects, a withinsubjects design is preferable to a betweensubjects experiment (senn, 2002). moreover, in repeated measures experiments, sub jects serve as their own control (jones and kenward, 2014). this reinforces the choice of our design, in which each stu dent carries out several projects. the effect of students’ learning throughout these 8 exer cises could be a problem in our experimental design. how ever, this was previously studied from different approaches, and the results indicate in both studies that repetition of pro gramming did not contribute to performance improvements (grazioli et al., 2014b; grazioli and nichols, 2012; grazioli et al., 2014a). as we already mentioned, to evaluate the external quality, we considered the defect density in the unit test phase of the base process. that is to say, the number of defects detected in that phase are counted and divided between the locs of the project. to evaluate the internal quality, we analyzed the code smells in which students incur. knowing the number of code smells present in the product’s source code gives us an idea of the maintenance costs in the future (fowler, 2018). the effort in design and code is measured as the time in minutes that the student dedicates to the phase in question. the experimental design is presented in figure 3. all stu dents apply the base process in projects 1 to 4, in which sub mitting the design representation to the teachers is not re quired. when students finished project 4, they were divided randomly into two groups: the control group and the experi mental group. the control group, called “without templates to represent the design” (notrd), continues to apply the base process throughout projects 5 to 8. the experimental group, called “with templates to represent the design” (trd), started to apply the templates from project 5 to 8. moreno et al. 2021 figure 2. functional template the trd group attends a theoretical class where the four design templates are presented and explained (and exam ples are shown). the submission of the design representation for this group was mandatory (except for the state template which is optional). when a student submitted the project, the assigned teacher checked the completeness of the templates and the consistency with the code. in this way, students de signing a solution and then coding another one is reduced. however, the fact that the design is complete and verifiable is not controlled. our experimental design allows us to study the behavior of the groups before and after the use of the templates. on the one hand, we propose to analyze the trd (representing design using template) andnotrd (representing design with out template) groups during project 1 to 4 to confirm they are homogeneous groups; that is, the quality of the software de veloped is similar in both groups from programs 14 (when students do not use templates in any of the groups). on the other hand, we are interested in knowing if students who use templates develop betterquality software. we pro pose studying the groups trd and notrd during projects 5 to 8 to know if representing the design using templates has some effect on the software quality. 3.4 operation the experiment was replicated in the course for three years: 2015, 2016, and 2017. the number of students that took part in the experiment was: 25, 17, and 19 respectively. out of the 61 students participating in the experiment, 29 are part of the trd group, and 32 of the notrd group. this unbalance between the groups is due to the unbalance gen erated when students were assigned to the trd and notrd groups in each of the three replications. 4 analysis and results to answer rq1: “is there any improvement in the quality of the products when students represent the design using tem plates?”, we analyzed the quality from the internal and exter nal points of view. 4.1 external quality we measured the external quality as the defect density in ut, that is, the number of defects in ut/kloc. to analyze the external quality, we defined the following research hypotheses: h1.0: representingsoftwaredesign using design templates does not change the software defect density in ut h1.1: representing software design using design templates changes the software defect density in ut we analyzed the external quality in two ways: intra groups and between groups. between groups refers to knowing if there is a significant difference in the quality between the trd group and the notrd group. intra group refers to study ing the quality of the software in the trd group before and after the use of templates. between groups the analysis between groups consists, on the one hand, of analyzing the trd and notrd groups during projects 1, 3 and 4; and on the other hand, analyzing the trd and notrd groups during projects 5 to 8. due to the difficulty of project 2 compared with the rest of the projects, we decided not to include this project’s data in the analysis. during projects 1, 3 and 4, both groups apply the base pro cess, so, comparing the software quality of both groups dur ing those projects allows to confirm that they are homoge neous groups, and thus establishing the experimental frame. for this analysis, we defined the following hypothesis of in vestigation: h1.0: median (def. density in ut of notrd) = median (def. density in ut of trd) h1.1: median (def. density in ut of notrd) = > median (def. density in moreno et al. 2021 figure 3. experimental design ut of trd) each sample corresponds to the average defect density in ut of a student considering projects 1, 3 and 4. 1000 ∗ ∑4 n=1 #def ectsu tn∑4 n=1 #locn (1) where n varies between 1, 3 and 4. during the analysis, we detected that the data from a stu dent of the trd group was not accurate, that is, that the process followed had not been accurately recorded. so, data from that student was eliminated from the analysis and then there were 28 students remaining as part of the trd group. the descriptive statistics of the trd and notrd groups considering projects 1, 3 and 4 are presented on table 1. the values of the mean and interquartile range indicate there seems not to be great variability between the groups. to confirm this, we applied the mannwhitney test for inde pendent samples, since they correspond to different students. table 1. mean and interquartile range in projects 1, 3 and 4 mean interquartile range trd 30.22 25.54 notrd 32.88 28.9 the result indicates a pvalue = 0.3467, with which we cannot reject the null hypothesis (significance = 0.05). this result does not allow us to affirm that there is a difference in quality between trd and notrd groups. we can assert that both groups have a similar or homogeneous behavior. this gives us more confidence to study the software quality be tween the trd and notrd groups after the use of templates eliminating the possibility that the result is due to the behav ior of the groups rather than to using or not using templates. studying the trd and notrd groups during projects 5 to 8 aims to know if representing the design using templates has some effect in the software quality. for the analysis between groups during projects 5 to 8, we defined the following hypothesis of investigation: h1.0: median (def. density in ut of notrd) = median (def. density in ut of trd) h1.1: median (def. density in ut of notrd) = > median (def. density in ut of trd) table2presents the average defect density inut for the 28 students of the trd group and the 32 students of the notrd group in projects 5 to 8. the values of the mean and of the interquartile shown in table 3 indicate low variability of the groups. that is to say, the use of templates by thetrd group does not produce a sig nificant difference in the defect density compared to notrd group not using templates. to study the behavior of both groups we used hypothesis tests. the samples are different because they correspond to different students, thus, the mannwhitney test is applied. results indicate pvalue = 0.165, therefore, the null hy pothesis cannot be rejected. thus, we cannot affirm that the students who use the templates manage to develop software with fewer ut defect density than students who do not use templates. intra groups as already mentioned, intra groups refers to knowing if students of trd group improve the software quality after the use of templates to prepare the design. to know this, the de fect density in ut from thetrdgroup is analyzed in projects 1 to 4 (without project 2) and projects 5 to 8. studying the be havior of the same group allows to know if there is a change in the software quality after the use of templates. we define the following research hypotheses: h1.0: median (def. density in ut of trd134) = median (def. density in ut of trd58) h1.1: median (def. density in ut of trd134) = > median (def. density in ut of trd58) being trd134 are the students of trd group during projects 1, 3 and 4; and trd58 are the same students of trd group during projects 5 to 8. table 4 presents the defect density in ut for the students of thetrdgroup in projects 1, 3 and 4, and the same students in projects 5 to 8. the descriptive statistics presented in table 5 indicate some variability in defect density. even though the mean is similar, it seems that using templates (after project 5) to rep resent the design achieves products with less defects. to statistically study the data, we applied the wilcoxon test (signed rank test) for paired samples (because for this analysis the data come from the same students). results indi cate a value of v = 138 and pvalue = 0.1438. since pvalue is higher than 0.05 (value of significance) it is not possible moreno et al. 2021 table 2. average defect density in ut for the students of the trd group and notrd group in projects 5 to 8 group student defect density group student defect density trd 1 8.83 notrd 1 27.98 trd 2 23.16 notrd 2 24.86 trd 3 33.78 notrd 3 23.59 trd 4 40.76 notrd 4 14.35 trd 5 83.33 notrd 5 21.37 trd 6 16.10 notrd 6 12.19 trd 7 5.74 notrd 7 22.79 trd 8 13.02 notrd 8 43.33 trd 9 28.07 notrd 9 27.02 trd 10 12.5 notrd 10 36.46 trd 11 9.49 notrd 11 38.98 trd 12 19.70 notrd 12 16.80 trd 13 11.70 notrd 13 37.65 trd 14 36.85 notrd 14 18.93 trd 15 20.53 notrd 15 18.25 trd 16 22.93 notrd 16 22.98 trd 17 11.80 notrd 17 47.12 trd 18 37.45 notrd 18 30.21 trd 19 26.05 notrd 19 35.03 trd 20 5.03 notrd 20 27.84 trd 21 23.35 notrd 21 12.22 trd 22 17.36 notrd 22 24.57 trd 23 10.08 notrd 23 15.65 trd 24 42.75 notrd 24 41.17 trd 25 33.43 notrd 25 44.89 trd 26 28.63 notrd 26 20.35 trd 27 44.02 notrd 27 38.80 trd 28 23.88 notrd 28 51.54 notrd 29 7.85 notrd 30 27.89 notrd 31 24.24 notrd 32 25.49 moreno et al. 2021 table 3. mean and the interquartile range in projects 5 to 8 mean interquartile range trd 24.65 21.2 notrd 27.57 16.9 to reject the null hypothesis. this indicates that we cannot affirm that students improve the quality of their software by using design templates. 4.2 internal quality to evaluate the internal quality, we carried out an analysis of those code smells introduced by students when develop ing the course projects. the aim of this analysis is to inves tigate if the use of design templates prevents students from incurring into certain code smells. the analysis presented is preliminary and exploratory, seeking to obtain initial results that allow us to generate new research hypotheses. the code smell types depend on the programming lan guage. as students can choose the language in which they develop their projects, this analysis has to be done taking into account the different languages used. with the aim of doing an initial analysis, and that it added value to our re search, the students who developed their projects with java, c#, c, c++ and ruby were selected, excluding those devel oped with php and python. we excluded php and phyton because they do not have many code smells in common with the other languages. if we had added php and python, the number of code smells to analyze would have been reduced too much. so, both languages were excluded for this initial analysis. this left a total of 45 students for the analysis, 19 from 2015, 14 from 2016, and 12 from 2017. of those 45 students, 21 belong to the trd group (9 in 2015, 6 in 2016 and 6 in 2017) and 24 to the notrd group (10 in 2015, 8 in 2016 and 6 in 2017). to detect the code smells, the tool sonarqube1 was used, since it is a freesoftware tool for a variety of programming languages, which presents constant updates for the commu nity and a wide documentation, among others. we selected 16 code smell types for the analysis. these are common for the programming languages we chose and are detectable by sonarqube. the code smell types are: 1) statements “if ... else if” must end with the clause “else”; 2) statements “switch”/“case” must not be nested; 3) statements “switch”/“case” must not have too many “case”/“when” clauses; 4) the cognitive complexity of the functions or meth ods must not be too high; 5) “if” collapsible statements must merge; 6) the “if”, “for”, “while”, “switch” and “try” state ments of control flow must not nest too much; 7) the ex pression must not be too complex; 8) files must not have too many lines of code; 9) functions or methods must not have too many lines of code; 10) functions or methods must not have too many parameters; 11) lines of code must not be too long; 12) functions or methods must not be empty; 13) statements must be in separate lines; 14) two branches in one conditional structure must not have the exact same im plementation; 15) the parameters of one function or method not used must be eliminated; 16) the local variables not used 1http://www.sonarqube.org must be eliminated. a more detailed description of each one is not provided for articlelength reasons. table 6 shows the percentage of students that incurred in at least one code smell, segmented by project (from 1 to 8) and by group (notrd and trd). code smells 3, 8 and 12 are not present in any of the projects analyzed. when analyzing the table between the notrd and trd groups, as of program 5 (after using templates) a great vari ability arises, both if it is considered per project as it is con sidered per code smell. for code smells 4, 7, 10 and 13, it is observed that a group is better for certain projects, and the other group is better for certain other projects. for code smells 1, 2, 5, 6, 9 and 14, the difference between groups is very little. to sum up, changes after using templates are not observed for any of these code smells. for the case of code smell 11, a very minor percentage is observed in projects 5 and 7, and a minor percentage in project 8 on the part of the group using templates. in project 6, both groups have almost identical behavior. from the point of view of templates, maybe it is the pseudocode template that is helping the students decrease the introduction of this code smell. code smells 15 and 16 show a similar behavior. for both cases, trd group almost does not incur in them, while notrd does and sometimes in a high percentage. number 15 refers to parameters not used in the methods, and 16 to local variables not used. clearly, these types of code smells can be avoided with good software design. from the point of view of the use of templates, maybe the development of pseudocode (logic template) and the functional template are preventing the students of the trd group from incurring in these code smells. anyway, it is necessary to manually ana lyze the templates submitted by the students and have inter views with them to know better if this can be happening for the reasons already described. this has not been done yet. however, when analyzing the table, but only considering the data of the trd group throughout the 8 projects, we do not see that the use of templates improves the internal quality. it is worth noting that this group normally did not incur in code smells 15 and 16 (or did it in a very low percentage). ob serving projects 1 to 4 and 5 to 8 separately, we do not see any difference between them. that means, the behavior of this group before using templates and during its usage does not change for these code smells. so, the difference presented in the previous analysis between trd and notrd groups does not seem to respond to the use of templates. something similar happens with code smell 11. results do not show a decrease of this code smell when using templates. it can be observed that in project 8, the percentage of oc currence of code smells 4, 9 and 10 significantly increases for both groups. this increase makes us think that project 8 is more complex for the students. these three code smells in dicate that the code developed is too complex and long for its comprehension. that is, the use of templates did not help the students elaborate a less complex and understandable design. putting both analyses together, we conclude that the use of templates does not improve the internal quality. specifically (or being more precise), the use of templates does not seem to have an effect on the code smells in which the students moreno et al. 2021 table 4. defect density in ut for the students of the trd group in projects 1, 3 and 4, and in projects 5 to 8 group student defect density 1,3 and 4 defect density 5 to 8 trd 1 2.22 8.83 trd 2 7.22 23.16 trd 3 35.33 33.78 trd 4 14.24 40.76 trd 5 95.74 83.33 trd 6 17.85 16.10 trd 7 10.14 5.74 trd 8 21.18 13.02 trd 9 15.54 28.07 trd 10 39.80 12.5 trd 11 13.79 9.49 trd 12 18.31 19.70 trd 13 10.23 11.70 trd 14 60.60 36.85 trd 15 32.60 20.53 trd 16 25.83 22.93 trd 17 51.09 11.80 trd 18 48.78 37.45 trd 19 39.63 26.05 trd 20 15.56 5.03 trd 21 30.70 23.35 trd 22 25.77 17.36 trd 23 9.72 10.08 trd 24 32.71 42.75 trd 25 10.05 33.43 trd 26 42.70 28.63 trd 27 16.87 44.02 trd 28 102.04 23.88 moreno et al. 2021 table 5. mean and the interquartile range calculator project mean interquartile range 1, 3 y 4 30.22 25.5 5 to 8 24.65 21.2 incur when designing software. 4.3 effort dedicated to designing and coding to answer rq2: “what is the relation between the effort ded icated to designing and the effort dedicated to coding?, are there any variations in effort when students use templates?” , we analyzed the following hypothesis test: h2.0: median (tcod) <= median (tdld) h2.1: median (tcod) > median (tdld) as part of the base process, each student registered the time spent in the design phase (tdld) and the time spent in the code phase (tcod) for each project. to know the effort dedicated to designing and to coding by the group that uses the templates and the group that does not use them, we analyzed both groups independently during projects 5 to 8. that is, on the one hand, we carried out the analysis of the trd group during projects 5 to 8, and on the other hand, of the notrd group during projects 5 to 8. for each student, we calculated the time spent in design and the time spent in code for projects 5 to 8. the calculation for each pair of data is the following: ( 8∑ n=5 t dldn, 8∑ n=5 t codn) (2) where t dldn is the time spent in the design phase for project n, t codn is the time spent in the code phase for project n, and where n varies from 5 to 8. table 7 presents the 28 data pairs (tdld, tcod) for the trd group, and the 32 data pairs (tdld, tcod) for the notrd group. table 8 presents the mean and the interquartile range for the trd group and the notrd group. the mean value of the trd group shows that the use of templates takes more design time compared with the group that did not use templates. furthermore, the design time in the case of trd exceeds the time spent on coding. regarding the tcod’s mean, even though it is similar in the trd and notrd groups, a decrease in the trd group is observed. despite the fact that the decrease is not quite significant, the use of templates might have helped coding in less time. to determine the statistical test that best fits the problem to be solved, the distribution of the data was previously studied. when applying the kolmogorovsmirnov test for the trd group, a significance value of 0.00478 is obtained, indicating that the values do not fit a normal distribution. the result of applying kolmogorovsmirnov test for the notrd group returns 7.713e12 as a significance value, for that, the values do not fit a normal distribution. as the data of both does not follow a normal distribu tion, wilcoxon’s test is used for paired samples. the sam ples of each group are paired since the sampled pairs (tdld, tcod) correspond to the same student. we executed the test for the trd group and for the notrd independently. for the notrd group, we proposed to know the value of x such that tcod = x*tdld. we analyzed the following hypothesis test: h2.0: median (tcod of notrd) <= median (x*tdld of notrd) h2.1: median (tcod of notrd) > median (x*tdld of notrd) when executing the test for the notrd group with x=1, the null hypothesis is rejected (pvalue = 4.169e07, the sig nificance level is taken with a value of 0.05), confirming that the coding time is greater than the designing time. to know how much more or what is the relationship between these times (tcod = x*tdld) we applied the test again but now multiplying the tdld by an integer x value until the null hy pothesis cannot be rejected. table 9 presents the results for the wilcoxon test. the results indicate that for x=1, x=2 and x=3 the null hypothesis is rejected, so the coding time is greater than 3 times the design time. for x=4, the null hypothesis cannot be rejected (pvalue=0.541). in other words, students who did not use templates generally spent at least 3 times more time on coding than on designing. in the case of the trd group, the mean value shows that students tend to dedicate more time to design in relation to code.therefore,wecarriedout theanalysis inaninverseway, calculating x such that: x*tcod=tdld. we analyzed the following hypothesis test: h2.0: median (x*tcod of trd) >= median (tdld of trd) h2.1: median (x*tcod of trd) < median (tdld of trd) when executing wilcoxon test for the trd group with x=1, the null hypothesis is rejected (pvalue = 0.0007155), confirming that the coding time is less than the designing time. to know how many times more students spent in de signing, we applied the test again but now multiplying the tcod by an integer x value until the null hypothesis cannot be rejected. table 10 presents the results of the wilcoxon test applied to trd group. the results indicate that for x=2 the null hypothesis can not be rejected (pvalue = 0.998). so, students who use tem plates spend more time designing than coding, but not dou ble. this result indicates that the group that used templates ded icated a greater effort to design than the group that did not use templates. to confirm that the relationship between design ing time and coding time previously obtained by the trd group is due to the use of templates and not to another factor dependent on the group, we studied the relationship (tcod, tdld) but in this case during projects 1, 3 and 4 (without using templates). table 11 presents the mean and the interquartile range of the pairs (tdld, tcod) for the trd group in projects 1, 3 and 4. the values of the descriptive statistics of the trd group in projects 1, 3 and 4 are similar to those of thenotrdgroup. in other words, during projects in which students design with moreno et al. 2021 table 6. percentage of students who incur at least one code smell by code smell type and student group code smell group project 1 2 3 4 5 6 7 8 1 notrd 4% 29% 0% 4% 13% 13% 4% 13% trd 19% 19% 10% 0% 5% 5% 5% 5% 2 notrd 0% 0% 0% 0% 0% 0% 0% 0% trd 0% 0% 0% 0% 0% 0% 0% 5% 4 notrd 8% 58% 0% 13% 30% 46% 29% 50% trd 24% 43% 5% 10% 10% 43% 24% 95% 5 notrd 4% 21% 0% 0% 0% 0% 0% 0% trd 0% 24% 10% 0% 0% 5% 0% 5% 6 notrd 13% 63% 8% 29% 30% 38% 13% 42% trd 38% 67% 29% 29% 33% 52% 57% 62% 7 notrd 0% 25% 0% 0% 0% 4% 8% 0% trd 0% 19% 0% 0% 0% 5% 0% 5% 9 notrd 0% 4% 8% 17% 10% 21% 21% 67% trd 0% 10% 19% 14% 10% 29% 38% 71% 10 notrd 0% 0% 0% 0% 0% 0% 8% 54% trd 0% 0% 5% 0% 0% 0% 19% 38% 11 notrd 4% 46% 42% 8% 40% 4% 46% 75% trd 0% 29% 29% 0% 14% 5% 24% 62% 13 notrd 0% 0% 0% 0% 10% 0% 0% 4% trd 5% 0% 5% 0% 0% 0% 5% 19% 14 notrd 0% 8% 0% 0% 10% 0% 0% 0% trd 0% 0% 0% 0% 0% 0% 0% 0% 15 notrd 0% 0% 8% 4% 20% 0% 13% 17% trd 0% 0% 0% 0% 5% 0% 0% 0% 16 notrd 8% 13% 8% 8% 40% 8% 17% 29% trd 5% 5% 10% 10% 0% 0% 10% 10% moreno et al. 2021 table 7. data pairs for the trd group and the notrd group trd group notrd group tdld tcod tdld tcod 178 263 60 172 748 217 44 369 940 621 51 446 522 249 63 350 178 61 16 245 204 221 53 302 163 371 100 427 295 212 67 289 665 265 64 243 175 272 23 464 626 329 31 350 407 169 65 460 757 407 23 248 238 228 18 184 392 269 132 347 288 249 163 225 212 210 140 197 278 150 116 354 573 274 69 205 518 199 33 229 336 398 193 226 453 108 58 329 401 222 103 206 330 360 83 168 515 493 43 241 327 242 92 187 160 169 21 481 296 213 107 304 35 236 205 468 64 224 168 194 table 8. mean and the interquartile range for notrd and trd groups group mean interquartile range trd tdld 399.1 287.5 trd tcod 265.7 26.7 notrd tdld 19.5 15.7 notrd tcod 292.8 132 table 9. wilcoxon test for the notrd group in projects 5 to 8 x=1 x=2 x=3 x=4 4.169e07 4.088e05 0.03861 0.541 table 10. wilcoxon test for the trd group in projects 5 to 8 x=1 x=2 0.0007155 0.998 table 11. mean and the interquartile range of the pairs (tdld, tcod) for the trd group in projects 1, 3 and 4 mean interquartile range tdld 43 41.5 tcod 242 118 out using templates, the time spent on design is significantly less than the time spent on coding. table 12 presents the results of executing wilcoxon’s test to analyze the relation tcod = x*tdld of the trd group in projects 1, 3 and 4. table 12. wilcoxon test for the trd group in projects 1, 3 and 4 x=1 x=2 x=3 x=4 x=5 3.725e09 3.725e08 0.0002701 0.01245 0.09678 the results indicate that for x=5, the null hypothesis can not be rejected (pvalue = 0.09678). students of trd group in projects 1, 3 and 4 generally spent at least 4 times more time on coding than on designing. this result shows that there is an increase in the time dedicated to design after the students of the trd group begin to use design templates. 5 discussion in the context of our experiment, we found that design repre sentation using templates produced an increase in time spent designing (we were expecting this). however, it did not help to develop betterquality software products, nor from an in ternal point of view, neither from an external point of view. results show that the use of templates did not improve nei ther the number of defects the developed code has (measured as defects density in ut), nor the internal quality (measured as the number of code smells in the code). these results are related to those reported by gravino (gravino et al., 2015), where the use of uml diagrams did not achieve any improve ment in the comprehension of the source code visàvis not using them. in addition, the analysis of the relation between effort dedi cated to coding and effort dedicated to designing showed that the use of templates produced an increase in design time. stu dents who did not use the templates tended to spent 3 times more on code than on design. students who use templates spent more time designing than coding. moreover, students in both groups spent similar time in coding and before us ing templates the students in trd group behave similar to notrd group. we can conclude then, that using templates to represent design increases the effort dedicated to design but does not have a significant positive effect on quality or in reducing coding time. this can be due to several factors that we must analyze in the future. it could be, among other reasons, that students are not used to these templates and so they did not get the expected benefit; it could be that they just filled the templates but, in that moment, they did not care to think or de velop a quality system; it could be that students do not know how to design (as found in other studies); or as mentioned by chaiyo (chaiyo and ramingwong, 2013), it could be that the templates are difficult to use by students. we believe that students do not have the habit of designing and thinking of a solution before coding. although we think that the use of templates would be helpful, we believe that the students filled them in to achieve the goal without thinking of a design solution. rather, we believe that the usual stu dent practice is codeandfix. even though more analysis is moreno et al. 2021 needed, we agree with several authors on the fact that grad uating students have difficulties to design and they do not seem to understand what type of information to include to de sign software (eckerdal et al., 2006a,b; loftus et al., 2011). 6 threats to validity most empirical studies are threatened by the way research is conducted (wohlin et al., 2012). this section describes the threats to validity we have detected. internal validity threats: investigating with students in volves several threats. on the one hand, the fact that the con text of the experiment is a course implies that the students does not develop naturally. we tried to minimize this threat with a nongraded course, that is, the student approved or failed. besides, we remarked the importance of monitoring and registering the process just as it was, and we emphasized that students’ assessments would not be done according to results, defects found, or efforts made. on the other hand, there is a threat that students share in formation or solutions to projects. in this sense, the assigned teachers reviewed the submissions and compared them be tween students to ensure there were no duplicate submis sions. in addition, students carry out their projects at home, which causes limited control by teachers. to reduce this threat, we introduced supervision, corrections, and feedback between the student and the assigned teacher. besides, for the analysis, we did a data aggregation of the three courses, knowing that the different courses can have influence on the data collected for being a hierarchical model. we tried to reduce this threat through the use of a defined and disciplined process the students followed, and keeping the same material and the same teachers throughout the three courses. external validity threats: experimenting with students of a course has the advantage that they are available and are will ing to participate in experiments, and the disadvantage that their characteristics cannot be generalized. in our experiment, students took part of the pfpsp course voluntarily and did not know that they were part of an experiment until they fin ished the course. this reduces to the minimum the bias they might have when feeling part of a research. conversely, the results obtained in this experiment cannot be generalized to the students practice of design in other contexts. construct validity threats: this kind of threat is related to the way in which the response variables were measured. in our experiment, we measured effort as the time in min utes that the student spends on the phase and the quality as the number of defects in ut and the number of code smells in which students incur. to ensure a correct data recording, we used a data recording tool and framework that allows a disciplined and measurable process to be followed. conclusion validity threats: the number of students in the research constitutes a threat to the statistical conclusion. 61 students participated during the three replications. this causes the statistical analysis to be carried out using non parametric tests whose statistical power is lower than the parametric tests. as a measure to this threat, we completed the nonparametric tests with descriptive statistics. 7 conclusions this work is one step further towards the understanding of the software design practice. the results of our experiment show that graduating students do not improve the software quality when using templates for design representation. how ever, using templates produces a significant increase in the time spent on the design phase without reducing coding time. we analyzed the software quality from the internal and ex ternal points of view, and from the effort dedicated to design. on the one hand, we statistically proved that using templates for design representation does not improve the external soft ware quality, measured as the defect density in unit testing. from the internal quality perspective, the use of templates does not have a significant positive effect on the code smells in which students incur when designing software. regarding the effort, students who used templates dedi cate a greater effort to designing than to coding (which is not double). meanwhile, students that did not use templates dedicated four times less effort to designing than to coding. our results are related to those mentioned by gravino and torchiano (gravino et al., 2015; torchiano et al., 2017), where the use of uml diagrams to design does not make significant improvements in their source code comprehen sion tasks. also, regarding effort, students who use diagrams spend twice as much time on the same source code compre hension task than students who do not use them. gravino ana lyzes the experience factor, and they find that the most experi enced students achieve an improvement in the understanding of the source code (gravino et al., 2015). although we did not analyze the experience factor of the graduating students, it could be an analysis to be performed in the future. our research focuses on graduating students, most of them working in the uruguayan software industry as junior engi neers. these engineers usually perform programming tasks, which include lowlevel design. the results obtained in our experimentcannotbegeneralizedtoall juniordevelopersand even less to senior developers. our results raises new questions about the practice of soft ware design: what do students usually design? what kind of information do they include when designing? is it possible for them to produce their designs mentally, without repre senting them? do they know the effect of a good design in software quality? continuing with this line of research, in 2018, we executed an experiment that sought to know how students usually de sign. students performed the same 8 projects during this ex periment and delivered the design representation made in a natural way (without templates). although we have not yet fi nalized the data analysis, we have found that students do not deliver complete designs in a preliminary analysis. in gen eral, they use informal/natural language and incomplete class diagrams in a few cases. studying the students’ habitual be havior when designing software should help identify poten tial problems in the design practices and find better ways of teaching skills for developing quality software. in 2019 and 2020, no experiments could be performed, but in 2021 we moreno et al. 2021 are replicating the 2019 experiment to have more data. as future work, we will finish the abovementioned analysis to identify potential problems in the design practices and find better ways of teaching skills for developing quality software. also, we plan to analyze the designs produced with the tem plates to know what students design and conduct interviews with students to know their experience using templates. on the other hand, we find it interesting to experiment with some simple mdd tool to know the effect on software qual ity. references arisholm, e., briand, l. c., hove, s. e., and labiche, y. (2006). the impact of uml documentation on software maintenance: an experimental evaluation. ieee transac tions on software engineering, 32(6). ayewah, n., pugh, w., hovemeyer, d., morgenthaler, j. d., and penix, j. (2008). using static analysis to find bugs. ieee software, 25(5). bourque, p. and fairley, r. e. (2014). guide to the software engineering body of knowledge swebok v3.0. ieee computer society, 2014 version edition. brown, w. h., malveau, r. c., mccormick, h. w., and mowbray, t. j. (1998). antipatterns:refactoringsoftware, architectures, and projects in crisis. john wiley & sons, inc. budgen, d., burn, a. j., brereton, o. p., kitchenham, b. a., and pretorius, r. (2011). empirical evidence about the uml: a systematic literature review. software:practiceand experience, 41(4):363–392. campbell, g. a. and papapetrou, p. p. (2013). sonarqube in action. manning publications co, 2013 version edition. carrington, d. and k kim, s. (2003). teaching software de sign with open source software. in 33rd annual frontiers in education. chaiyo, y. and ramingwong, s. (2013). the develop ment of a design tool for personal software process (psp). in 10th international conference on electrical engineer ing/electronics, computer, telecommunications and in formation technology, pages 1–4. chemuturi, m. (2018). software design: a comprehen sive guide to software development projects. crc press/taylor & francis group. chen, t.y., cooper, s., mccartney, r., and schwartzman, l. (2005). the (relative) importance of software design criteria. sigcse bull., 37(3):34–38. eckerdal, a., mccartney, r., moström, j. e., ratcliffe, m., and zander, c. (2006a). can graduating students design software systems? in sigcse bull., page 403–407. acm, association for computing machinery. eckerdal, a., mccartney, r., moström, j. e., ratcliffe, m., and zander, c. (2006b). categorizing student software de signs: methods, results, and implications. computer sci ence education, 16(3):197–209. fernándezsáez, a., genero, m., and chaudron, m. (2013). empirical studies concerning the maintenance of uml dia grams and their use in the maintenance of code: a system atic mapping study. informationandsoftwaretechnology, 55:1119–1142. flores, p. and medinilla, n. (2017). conceptions of the stu dents around objectoriented design: a case study. in xii jornadas iberoamericanas de ingenieria de software e in geniería del conocimiento. fowler, m. (2018). refactoring: improving the design of ex isting code. addisonwesley professional. gibbon, c. a. (1997). heuristics for objectoriented design. phd thesis, university of nottingham. gopichand, m., swetha, v., and ananda rao, a. (2010). software defect detection and process improvement us ing personal software process data. in international con ference on communication control and computing tech nologies, pages 794–799. gravino, c., scanniello, g., and tortora, g. (2015). source code comprehension tasks supported by uml design mod els: results from a controlled experiment and a differenti ated replication. journal of visual languages & comput ing, 28:23 – 38. grazioli, f. and nichols, w. (2012). a cross course anal ysis of product quality improvement with psp. in team software process symposium 2012, pages 76–89. grazioli, f., nichols, w., and vallespir, d. (2014a). an anal ysis of student performance during the introduction of the psp: an empirical crosscourse comparison. in team soft ware process symposium 2013, pages 11–21. grazioli, f., vallespir, d., pérez, l., and moreno, s. (2014b). the impact of the psp on software quality: eliminating the learning effect threat through a controlled experiment. adv. soft. eng., 2014. group, s. (2015). the chaos report. the astrophysical jour nal supplement series. hayes, w. and over, j. (1997). the personal software pro cess (psp): an empirical study of the impact of psp on indi vidual engineers. technical report cmu/sei97tr001, software engineering institute, carnegie mellon univer sity, pittsburgh, pa. hu, c. (2013). the nature of software design and its teaching: an exposition. acm inroads, 4(2). humphrey, w. (2005). psp: a selfimprovement process for software engineers. addisonwesley professional. humphrey, w. s. (1995). a discipline for software engineer ing. addisonwesley longman publishing co., inc. joint task force on computing curricula acm and ieee computer society (2013). computer science curricula 2013: curriculum guidelines for undergraduate degree programs in computer science. association for comput ing machinery, new york, ny, usa. jones,b.and kenward,m.g. (2014). designandanalysisof crossover trials. chapman and hall/crc, 3rd edition. karasneh, b., jolak, r., and chaudron, m. r. v. (2015). us ing examples for teaching software design: an experiment using a repository of uml class diagrams. in 2015 asia pacific software engineering conference. kitchenham, b. and pfleeger, s. l. (1996). software quality: the elusive target. ieee software, 13(1):12–21. kramer, j. (2007). is abstraction the key to computing? com moreno et al. 2021 mun. acm, 50(4):36–42. leung, f. and bolloju, n. (2005). analyzing the quality of domain models developed by novice systems analysts. in 38th hawaii international conference on system sci ences. linder, s. p., abbott, d., and fromberger, m. j. (2006). an instructional scaffolding approach to teaching software de sign. journal of computing sciences in colleges, 21. loftus, c., thomas, l., and zander, c. (2011). can grad uating students design: revisited. in proceedings of the 42nd acm technical symposium on computer science ed ucation. acm. martin, r. c. (2002). agilesoftwaredevelopment:principles, patterns, and practices. prentice hall. moreno, s. and vallespir, d. (2018). ¿los estudiantes de pregrado son capaces de diseñar software? estudio de la relación entre el tiempo de codificación y el tiempo de diseño en el desarrollo de software. in conferencia iberoamericana de ingeniería de software 2018. nistala, p., nori, k. v., and reddy, r. (2019). software quality models: a systematic mapping study. in 2019 ieee/acminternationalconferenceonsoftwareandsys tem processes, pages 125–134. panach, j. i., dieste, o., marín, b., españa, s., vegas, s., pas tor, o., and juristo, n. (2021). evaluating modeldriven development claims with respect to quality: a family of experiments. ieee transactions on software engineer ing, 47(1):130–145. petre, m. (2013). uml in practice. international conference on software engineeringn, 35. pierce, k., deneen, l., and shute, g. (1991). teaching soft ware design in the freshman year. in softwareengineering education. springer berlin heidelberg. prechelt, l. and unger, b. (2001). an experiment measur ing the effects of personal software process (psp) training. ieee transactions on software engineering, 27(5):465– 472. senn, s. (2002). crossover trials in clinical research. john wiley & sons, ltd, 2nd edition. siau, k. and tan, x. (2005). improving the quality of concep tual modeling using cognitive mapping techniques. data & knowledge engineering, 55(3). quality in conceptual modeling. soh, z., sharafi, z., van den plas, b., cepeda porras, g., guéhéneuc, y.g., and antoniol, g. (2012). professional status and expertise for uml class diagram comprehension: an empirical study. in ieee international conference on program comprehension. sommerville, i. (2016). software engineering. pearson. stevenson, j. and wood, m. (2018). recognising object oriented software design quality: a practitionerbased questionnaire survey. software quality journal, 26. taylor, r. n. (2011). conference welcome message. in proc. 33rd international conference on software engineering. association for computing machinery. tenenberg, j. (2005). students designing software: a multi national, multiinstitutional study. informatics in educa tion, 4. torchiano, m., scanniello, g., ricca, f., reggio, g., and leotta, m. (2017). do uml object diagrams affect design comprehensibility?resultsfromafamilyoffourcontrolled experiments. journal of visual languages & computing, 41. wohlin, c., runeson, p., höst, m., ohlsson, m. c., regnell, b., and wesslén, a. (2012). experimentation in software engineering. springer science & business media. introduction related work research methodology course context goals and research questions experimental design operation analysis and results external quality internal quality effort dedicated to designing and coding discussion threats to validity conclusions journal of software engineering research and development, 2021, 9:6, doi: 10.5753/jserd.2021.1094 this work is licensed under a creative commons attribution 4.0 international license.. everest: an automatic modelbased testing tool for asynchronous reactive systems* adilson luiz bonifacio universidade estadual de londrina bonifacio@uel.br camila sonoda gomes universidade estadual de londrina camilasonoda@uel.br abstract reactive systems are characterized by their interaction with the environment, where the exchange of the input and output stimuli, usually, occurs asynchronously. systems of this nature, in general, require a rigorous testing activity in the development process. therefore modelbased testing has been successfully applied over asynchronous reactive systems using input output labeled transition system (iolts) as the basis. in this work, we present a reactive testing tool to check conformance, generate test suites, and run test cases using iolts models. our tool can check whether the behavior of an implementation under test (iut) complies with the behavior of its respective specification. we have implemented two conformance relations in our tool: the classical ioco relation; and the conformance based on regular languages. the tool also provides a test suite generation in a blackbox testing setting for finding faults over iuts according to a specific domain. in addition, we describe some case studies to probe the tool’s functionalities and also give a comparative analysis. finally, we offer practical experiments to evaluate the performance of our tool using several scenarios. keywords: modelbased testing, conformance checking, test generation, reactive systems, automatic tool published under the creative commons attribution 4.0 international public license (cc by 4.0) 1 introduction several realworld systems are ruled by reactive behaviors that interact constantly with the environment by receiving input stimuli and producing outputs in response. systems of this nature, in general, are also critical thus requiring precise and automatic support in the development process. modelbased testing methods and their respective tools have been largely applied in the testing activity when develop ing systems. the input output labeled transition system (iolts) (tretmans, 2008) has been commonly employed as the formalism on modeling and testing asynchronous reac tive systems. an iolts can then specify desired behaviors of an implementation candidate and the testing task can be applied to find faults on it. one important issue of modelbased testing is confor mance checking where we can verify whether a given imple mentation under test (iut) complies with its correspond ing specification according to a certain fault model. here we treat the classical notion of input output conformance testing (ioco) (tretmans, 2008) and the more recent testing conformance relation based on regular languages (bonifa cio and moura, 2019) to define fault models. the test gen eration is also an important task of modelbased testing, es pecially when generating test cases for reactive systems in a blackbox setting. in this work, we present an automatic tool, named everest1(gomes and bonifacio, 2019), that can check conformance between a given iut and its respective specification. our tool can also generate test suites based on specifications modeled by iolts and enable blackbox test ing over iuts. *supported by capes. 1conformance verification on testing reactive systems we show that everest has a wider range of applications when compared to other testing tools since it implements not only the classical ioco relation but also the more recent languagebased conformance checking. we also describe realworld scenarios to relate both approaches, where the languagebased conformance method has been able to find faults which cannot be detected by using ioco relation. fur ther, experiments are performed to evaluate our tool when generating and running test suites in a blackbox scenario, and also to compare and evaluate everest against a well known tool (belinfante, 2010a) from literature w.r.t. the con formance checking task. the remainder of this paper is organized as follows. we comment on related works in section 2. section 3 describes the conformance checking approaches and the test suite gen eration method. in section 4 we discuss important aspects comparing everest to another tool from the literature and also present a realworld case study. practical experiments of conformance checking and test suite generation are given in section 5. section 6 offers some concluding remarks and future directions. 2 related works reactive systems have been properly specified by iolts models to describe their syntax and semantics. hence model based testing techniques and practical tools have been ap plied in testing activities to support the system development using ioltss. therefore several works have studied aspects related to ioltsbased testing such as test generation, con formance relations, and their checking methods. here we sur vey on some works that are more closely related to our testing tool and its features. http://orcid.org/0000-0002-7348-8508 mailto:bonifacio@uel.br mailto:camilasonoda@uel.br an iolts modelbased testing tool bonifacio and gomes 2020 the ioco relation has been proposed by tretmans (2008) for iolts models, where iuts are treated as blackboxes, i.e., the tester, which is seen as an artificial environment that drives the test activity, has no access to the internal struc ture of iuts. however, some restrictions must be guaran teed over the specification, iut and tester models, such as inputcompleteness and outputdeterminism. further, the al gorithms therein are more theoretical and may lead to infi nite test suites, making it more difficult to devise solutions for practical applications. an iocobased testing theory has been also proposed by de vries (2001) to obtain ecomplete test suites. this ap proach focuses on specific test purposes that share particular properties related to certain testing goals. only observable behaviors based on specific criteria are considered when test ing blackbox iuts, which turns out that the test purposes somewhat limit the fault coverage spectrum, e.g., producing inconclusive verdicts. the test generation method also pro duces large, even infinite, test suites thus requiring a test se lection criteria to avoid this problem. simão and petrenko (2014) have described an approach to generate finite iococomplete test suites for a class of iolts models. however, their approach imposes a number of re strictions on the models. test purposes must be singleinput and outputcomplete, and specifications and iuts must be inputcomplete, progressive, and initiallyconnected. so the class of iolts models that can be tested are very restricted according to their fault model. roehm et al. (2016) have introduced a conformance test based on safety properties. despite being a weaker rela tion than traceinclusion conformance, it allows for tuning a tradeoff between accuracy and computational complex ity when checking conformance. so their approach searches for counterexamples instead of verifying the whole system. however, this approach and previous ones have a more the oretical leaning and we are not aware of practical tools and their algorithms. a more recent work has been proposed by bonifacio and moura (2019) where few restrictions are considered and fi nite sets of test purposes can be generated in practical situa tions. in some rare cases their algorithm may lead to exponen tial sized testers, but the approach allows for a wider class of iolts models, and a low degree polynomial time algorithm is devised for efficiently testing iococonformance in practi cal applications. in this work, we have implemented the more general and recent approach (bonifacio and moura, 2019) with the languagebased and ioco conformance relations, as well as the test suite generation for whitebox and blackbox scenar ios. in the literature, we have found jtorx (belinfante, 2014, 2010b), a more closely related testing tool that implements the ioco relation and the uioco variation for underspecified models. tgv (mark utting, 2007; calamé, 2005; jard and jéron, 2005) is also a testing tool designed for checking ioco conformance, similarly to testor (marsso et al., 2018), an on thefly test case generation tool. however, the test generation methods of tgv and testor are only sound, i.e., they are not exhaustive, and so we cannot get complete test suites. fur ther, the soundness property over the generated test suites is only guaranteed from both tools over specific test purposes. although several other tools have been proposed for modelbased testing many of them somewhat move away from the scope of our work. for instance, some tools and approaches implement a variation of iocotheory, e.g. rioco and sioco relations, such as stg (symbolic test genera tor) (clarke et al., 2002), torxakis (mostowski et al., 2009) and uppaaltron (larsen et al., 2005) that deal with sym bolic and timed models. 3 a modelbased testing method asynchronous reactive systems are commonly specified by iolts models, a variation of labeled transition systems (ltss) (tretmans, 1993) with the partitioning of input and output labels. definition 1 an input/output labeled transition system (iolts) is a tuple s = (s, s0, li , lu , t ) where: • s is a finite set of states; • s0 ∈ s is the initial state; • li is a set of input labels; • lu is a set of output labels; • l = li ∪ lu and li ∩ lu = ∅; • t ⊆ s × (l ∪ {τ }) × s is a finite set of transitions, where the internal action τ /∈ l; and • (s, s0, l, t ) is an underlying lts associated with s. we indicate a transition by (s, l, r) ∈ t where s ∈ s is the source state and r ∈ s is the target state labeled by l ∈ (l ∪ {τ }). a transition (s, τ, r) ∈ t indicates an internal action, which means that an external observer cannot see the movement from s to r in the model. an iolts may also have quiescent states. a state s is qui escent if no output x ∈ lu or an internal action τ is defined on it (tretmans, 2008). when a state s is quiescent a transi tion (s, δ, s) is added to t , where δ /∈ lτ . note that l∪{τ } is denoted by lτ to ease the notation. we also note that in a real blackbox testing scenario, where an iut sends messages to the tester and receives back responses, quiescence will indi cate that the iut can no longer respond to the tester, it has timed out, or that it is slow (bonifacio and moura, 2019). in what follows we define the semantics over iolts/lts models, but first we introduce the notion of paths. definition 2 let s = (s, s0, l, t ) be a lts and p, q ∈ s. let σ = l1, · · · , ln be a word in l⋆τ . we say that σ is a path from p to q in s if there are states ri ∈ s, and labels li ∈ lτ , 1 ≤ i ≤ n, such that (ri−1, li, ri) ∈ t , with r0 = p and rn = q. we say that α is an observable path from p to q in s if we remove all internal actions τ from σ. a path can also be denoted by s σ−→ s′, where the behavior σ ∈ l⋆τ starts in the state s ∈ s and reaches the state s′ ∈ s. an observable path σ, from s to s′, is denoted by s σ=⇒ s′. we can also write s σ−→ or s σ=⇒ when the target state is not important. all paths starting at a state s we call paths of s. now we give the semantics over iolts/lts models. definition 3 let s = (s, s0, l, t ) be a lts and s ∈ s: an iolts modelbased testing tool bonifacio and gomes 2020 1. the set of all paths from s is denoted by tr(s) = {σ|s σ−→ } and the set of all observable paths from s is denoted by otr(s) = {σ|s σ=⇒}. 2. the semantics of s is given by tr(s0) or tr(s) and the observable semantics of s is denoted by otr(s0) or otr(s). the semantics of an iolts is defined by the semantics of its underlying lts. 3.1 checking conformance on reactive sys tems given an iolts specification, a conformance checking task can determine whether an iut complies with the correspond ing specification according to a specific fault model. the classical ioco (tretmans, 2008) relation establishes a notion of conformance where input stimuli are applied to both iut and specification models to observe whether outputs pro duced by the iut are also defined in the specification (boni facio and moura, 2019). definition 4 let s = (s, s0, li , lu , t ) be a specification and let i = (q, q0, li , lu , r) be an iut, we say that i ioco s if, and only if, out(q0 af ter σ) ⊆ out(s0 af ter σ) for all σ ∈ otr(s), where s af ter σ = {q|s σ=⇒ q} for every s ∈ s. otherwise, we get that ri ioco s does not hold. a more recent conformance relation (bonifacio and moura, 2019) has also been proposed using regular lan guages. given an iut i, a specification s, and regular lan guages d and f , we say that i complies with s according (d, f ), i.e, i confd,f s if, and only if, no undesirable be havior of f is observed in i and it is specified in s, and all desirable behaviors of d are observed in i and they are also specified in s. definition 5 given an alphabet l = li ∪ lu and lan guages d, f ⊆ l∗ over l. let s and i be iolts models over l we have that i confd,f s if, and only if, (i) σ ∈ otr(i) ∩ f , then σ /∈ otr(s); and (ii) σ ∈ otr(i) ∩ d, so σ ∈ otr(s). this new notion with a wider fault coverage is established by proposition 1, where desirable and undesirable behaviors can be specified by regular languages. proposition 1 (bonifacio and moura, 2019). let s and i be iolts models over an alphabet l = li ∪ lu , and the regu lar languages d, f ⊆ l∗ over l. we say that i confd,f s if, and only if, otr(i) ∩ [(d ∩ otr(s)) ∩ (f ∩ otr(s))] = ∅, where otr(s) = l∗ − otr(s). both notions of conformance can be related by the fol lowing lemma, where the languagebased conformance rela tion given in definition 5 restrains the classical ioco relation given by definition 4. lemma 1 (bonifacio and moura, 2019). let i = (q, q0, li , lu , r) be an iut and let s = (s, s0, li , lu , t ) be a specification, we say that i ioco s if, and only if, i confd,f s when d = otr(s)lu and f = ∅. bonifacio and moura (2019) have proposed the language based conformance checking using the theory of au tomata (sipser, 2006). lts/iolts models are transformed into finite state automata (fsa), where the semantics of fsa is given by the language it accepts. so r ⊆ l⋆ is regu lar if there exists an fsa m such that l(m) = r, where l is an alphabet. therefore we can effectively construct the au tomatons ad and af where d and f are regular languages such that d = l(ad ) and f = l(af ). now we define test case and test suite according to regular languages. definition 6 let l be a set of symbols, a test suite t over l is a language t ⊆ l⋆, where each σ ∈ t is a test case. we can see that there will always be an fsa a that accepts a test suite since it is a regular language, where the final states are fault states. thus the set of undesirable behaviors, socalled fault model of s (bonifacio and moura, 2019), is defined by the fault states. therefore we can obtain a complete test suite for an iolts specification s and a pair of languages (d, f ) using propo sition 1. that is, we can detect the absence of desirable be haviors specified by d and the presence of undesirable be haviors specified by f in the specification s using the test suite t = [(d ∩ otr(s)) ∪ (f ∩ otr(s))]. an iut i is then declared in compliance to a specification s if there is no test case of the test suite t that is also a behavior of i (bonifacio and moura, 2019). the testing process first obtains an automaton a1 induced by the iolts specification s. since l(a1) = otr(s) we can effectively construct an fsa a2 such that l(a2) = l(af ) ∩ l(a1) = f ∩ otr(s). also, consider the fsa b1 obtained from a1 by reversing its set of final states, that is, a state s is a final state in b1 if, and only if, s is not a final state in a1. clearly, l(b1) = l(a1) = otr(s). we can now get an fsa b2 such that l(b2) = l(ad )∩l(b1) = d∩otr(s). since a2 and b2 are fsas, we can construct an fsa c such that l(c) = l(a2) ∪ l(b2), where l(c) = t . we can con clude that when d and f are regular languages and s is a deterministic specification, then a complete fsa t can be constructed such that l(t ) = t . next we state an algorithm with a polynomial time com plexity using the languagebased conformance relation. proposition 2 (bonifacio and moura, 2019) let s and i be the deterministic specification and implementation ioltss over l with ns and ni states, respectively. let also |l| = nl. let ad and af be deterministic fsas over l with nd and nf states, respectively, and such that l(ad ) = d and l(af ) = f . then, we can effectively construct a complete fsa t with (ns + 1)2nd nf states, and such that l(t ) is a complete test suite for s and (d, f ). moreover, there is an al gorithm, with polynomial time complexity θ(n2s ni nd nf nl) that effectively checks whether iconfd,f s holds. now using lemma 1 we establish a relationship between the ioco and languagebased relations in theorem 1. theorem 1 (bonifacio and moura, 2019) let s and i be deterministic ioltss over l with ns and ni states, respec tively. let l = li ∪ lu , and |l| = nl. then, we can effec an iolts modelbased testing tool bonifacio and gomes 2020 tively construct an algorithm with polynomial time complex ity θ(ns ni nl) that checks whether i ioco s holds. 3.2 complete test suite generation in this work, we also provide the test suite generation in a blackbox testing setting using the notion of test pur poses (tretmans, 2008). a test purpose (tp) is formally de fined by an iolts with two special states {pass, f ail} and, in practice, it represents an external tester that interacts with an iut. thus a fault model is composed of tps that are de rived from a given specification. to ease the notation from now on we will denote by io(li , lu ) the class of all ioltss over l = li ∪ lu . definition 7 let li and lu be the input and output alpha bets, respectively, with l = li ∪ lu . a test purpose (tp) over l is defined by an iolts t ∈ io(lu , li ) such that for all σ ∈ l∗ does not hold f ail σ=⇒ pass and pass σ=⇒ f ail. the fault model over l is the finite set of tps over l. the test case generation proposed by tretmans (2008), based on ioco relation, imposes some restrictions over the formal models. all tps must be acyclic, with a finite run, and inputenabled, since the tester cannot predict the output pro duced by a blackbox iut. therefore, all output actions that are produced by the iut must be enabled in the respective tp. moreover, they must be outputdeterministic, i.e. each state can send only one output symbol to the iut in order to avoid arbitrary and nondeterministic choices. in the pass and f ail states only selfloop transitions are allowed since verdicts are obtained in these states. definition 8 let s ∈ io(li , lu ). we say that s is output deterministic if |out(s)| = 1 and s is inputenabled if inp(s) = li for all s ∈ s, where out(s) and inp(s) give outputs and inputs, respectively, defined at state s. hence all restrictions imposed by tretmans (2008) are sat isfied when a tp is inputenabled, outputdeterministic, and acyclic except for pass and f ail states. however, we see that a bound over the number of states to be considered in the iuts must be imposed to keep the tp acyclic. so the test suite completeness property is guaranteed if given an iut i and a specification s, i ioco s for all iut that conforms to s. otherwise we say that i ioco s does not hold. therefore we define a class of implementations to guarantee the ioco completeness property on generating test suites establishing an upper bound on the number of states over the iuts. now we are in a position to construct a complete test suite using the notion of tps. but first we generate a multigraph structure as proposed by (bonifacio and moura, 2019). so given an iut i and a specification s, we remark that m is the bound over the number of states to be considered on the iuts, and n is the number of states in s. then the multigraph must have mn + 1 levels, and at each level if a transition of s gives rise to a cycle then we must create a transition onto states on next level of the multigraph. a f ail state is also added and new transitions from every state of the multigraph are defined to the f ail labeled by all l ∈ lu when l is not defined. having an acyclic multigraph at hand we can extract tps using a simple breadthfirst search algorithm from the ini tial state to f ail. we can guarantee the inputenabledness property by adding the pass state to the tp and, for every output of lu and all states, we add transitions to the pass state where the output is not defined. selfloops labeled by each l ∈ lu are also added to the pass and f ail states. the outputdeterministic property is also guaranteed by adding a transition from every state to pass where there is no input of li defined. note that we always refer to an input symbol of lu or an output symbol of li from the perspective of the iut, as commonly denoted in the literature (tretmans, 2008; bonifacio and moura, 2019). the test run is then defined by the synchronous product between a tp t and an iut i, denoted by i × t . the tp interacts with the iut producing outputs that are sent to i as inputs. likewise, the iut receives actions from the tp and produces outputs that are sent to t as inputs. so the output alphabet of t corresponds to li, the input alphabet of the iut, and the input alphabet of t corresponds to lu , the out put alphabet of the iut. definition 9 let i = (si , q0, li , lu , ti ) ∈ io(li , lu ) be an implementation and t = (st , q0, lu , li , tt )) ∈ io(lu , li ) be a tp. we say that i passes t if for any σ ∈ (li , lu )∗ and any state q ∈ si, we do not have (t0, q0) σ=⇒ (f ail, q) in t × i. a path can be denoted by q0 σ=⇒ q where the behavior σ starts in the state q0 and reaches the state q. let m be the fault model, we say that i pass m, if i passes all tps in m. then given an iolts s and a set imp ⊆ io(lu , li )[m], we say that m is miococomplete to s concerning imp if for all iut i ∈ imp we have i ioco s if, and only if, i passes m. the verdicts are obtained when tps reach the special states. the fail verdict gives rise to a fault behavior whereas the pass verdict denotes a desirable behavior. further details can be found in (bonifacio and moura, 2019; gomes and bonifacio, 2019). finally, the next proposition determines a fault model that is composed of tps obtained from a multigraph which, in turn, is constructed based on the corresponding specification. proposition 3 let the deterministic iolts s ∈ io(li , lu ) and m ≥ 1. then there is a fault model m that is miococomplete for s relatively to io(li , lu )[m], ioltss at most m states, whose tps are deterministic, outputdeterministic, inputenabled, and acyclic except for selfloops on pass and fail states. 4 a testing tool for reactive systems everest (gomes and bonifacio, 2019) has been developed to check conformance, generate test suites, and run tests over reactive systems specified by lts/iolts models. we have organized the tool’s architecture in four modules: configura tion; ioco conformance; languagebased conformance; and test generation & run. the configuration module allows us to settle the testing scenario, and the checking conformance modules can yield verdicts of testing. when an iut does not an iolts modelbased testing tool bonifacio and gomes 2020 conform to the specification our tool yields the verdict along with the paths induced by the test cases that could detect the corresponding faults. the test generation & run module en ables the multigraph and test purpose generation, and also allows for running test suites over the iuts. in this section, we look over the checking conformance and test suite generation processes. first we present some general examples to compare the conformance checking pro cesses of everest and jtorx. next we show how our test suite generation method using the languagebased confor mance stands out from the classical approach. finally we describe a realworld case study of an automatic teller ma chine (atm) to explore some real scenarios, and then give a comparative analysis between the practical tools. 4.1 conformance checking process we apply some examples to explore characteristics from both everest and jtorx tools when checking conformance. let s be a specification depicted in figure 1a and let r and q be iuts depicted in figures 1b and 1c, respectively, with li = {a, b} and lu = {x}. s0 s3 s1 s2 a b a b, x x b b a (a) specification s q0 q3 q1 q2 a b a b, x a b b,x a (b) iut r q0 q3 q1 q2 a b a b, x a,x b b a (c) iut q figure 1. iolts models in the first checking run we have verified whether the iut r conforms to the specification s. our tool yielded a verdict of nonconformance and generated the test suite t1 = {b, aa, ba, aaa, ab, ax, abb, axb}. all test cases were induced by different paths that reach a fault and were ex tracted using a transition cover strategy over the specifica tion. jtorx also yielded the same verdict for this first run, as expected, but it has generated a test suite t2 = {b, ax, ab}. we can see that t2 ⊆ t1, i.e., jtorx has generated only one test case per fault in contrast to everest that has pro duced several test cases using a transition coverage. hence we notice that everest has provided a wide range of cover age which can be more useful in a fault mitigation process. in a second scenario, we checked the iut q against the specification s. at this time no fault was detected by both tools using the classical ioco relation. however, everest could find a fault using the languagebased conformance relation, where the set of desirable behaviors were specified by the regular language d = (a|b)∗ax and no undesirable behavior was defined, so f = ∅. the set d denotes behav iors that are induced by paths finishing with an input action a followed by an output x produced in response. a verdict of nonconformance was obtained by our tool revealing a fault detected by the test suite t = {ababax, abaabax}. we re mark that jtorx, using the classical ioco relation, was not able to detect this fault. so we can note that everest is more s0,0 s1,0 s2,0 s3,0 s0,1 s1,1 s2,1 s3,1 s0,15 s1,15 s2,15 s3,15 s0,16 s1,16 s2,16 s3,16 f ail f ail f ail f ail f ail f ail a b δ b,x a x b a, δ b a b b,x a x a, δbb b a b δ b,x a x b a, δ b a b b,x a, δ x x x δ δ δ δ x x x x, δ δ δ δ δ x x,δ ... ... figure 2. a direct acyclic multigraph d for specification s general in this sense and can be applied to a wider range of scenarios when compared to jtorx. 4.2 everest test suite generation we have seen that a conformance checking is run over an iut against a given specification to yield test verdicts. if the verdict is positive, i.e., faults are detected, then an associated test suite is generated with test cases that can reveal such faults. in addition, everest can also generate complete test suites relative to a given specification. to illustrate the test suite generation process of everest again we assume s as the specification depicted in fig ure 1a. in the first step a direct acyclic multigraph must be constructed according to the specification, as described in section 3. figure 2 partially depicts the multigraph with four states at each level once the specification s has four states (n = 4). every transition in the multigraph must go either to the next level or from left to right in the figure at the same level to secure the acyclic property. in this case we have con sidered iuts with at most four states, i.e. the same number of states as found in the specification (m = n = 4). therefore the multigraph has mn + 1 = 17 levels. figure 2 shows the first two levels and also the two last levels of the multigraph. note that we replicate the fail state in order not to clutter the figure. with the multigraph at hand, we can apply a breadthfirst search algorithm to extract paths from the initial node s0,0 up to the fail state. we can take, for instance, the sequence α1 = aabbx. we see that α1 induces the path s0,0 → s1,0 → an iolts modelbased testing tool bonifacio and gomes 2020 s3,0 → s0,1 → s3,1 → f ail in the multigraph. from propo sition 3 we can then obtain a deterministic, acyclic, input enabled, and outputdeterministic test purpose t1 over α1 as depicted in figure 3a. s0, 0 s1, 0 s3, 0 s0, 1 s3, 1 f ail pass a b, δ a b, δ b a, δ b a, δ x a, b δ, x δ, x (a) tp t1 induced by aabbx s0, 0 s1, 0 s3, 0 s3, 1 f ail pass a b, δ a b, δ a b, δ x a, b δ, x δ, x (b) tp t2 induced by aaax figure 3. tps from multigraph of figure 2 note that the inputenabledness property is also guar anteed by adding a pass state and transitions from states where no output is defined to the pass state. the construc tion is complete by adding selfloops to the pass and fail states labeled by all output actions. regarding the output determinism property, for every state that no input action is defined, we also create a new transition from this state to the pass state labeled by any input action. for the sake of exemplification we take α2 = aaax as a distinct sequence. in the same way we obtain the induced path over the multigraph and construct the corresponding de terministic, acyclic, inputenabled and outputdeterministic test purpose t2 as depicted in figure 3b. everest has indeed automatically constructed other fifteen test purposes based on paths induced by the set {α1, α2, x, aδ, bx, δx, aax, bbx, axδ, abδ, δbx, bδx, aabx, bbbx, aaδx} of se quences. from the tps of figure 3, everest could generate the test suite t = {α1, α2, b, δ, bδ, bx, ab, aδ, aaδ, aaa, aab, aabδ, aaba, aaaa, aaab, aaaxδ, aaaxx, aabba, aabbb, aabbxδ, aabbxx}. we then apply the test suite t to the iut r and a fault could be detected. by a simple inspection we see that all test cases that lead r from state q0 to the same state q0 can detect this fault. notice that the output x is produced at state q3 of r whereas x is not defined at state s3 of s. so everest exhibits a verdict of nonconformance which means that r does not pass the test suite, declaring that r ioco s does not hold. 4.3 a realworld case study now we present a realworld system to be put under test us ing the automatic tools. we specify functionalities of an au tomatic teller machine (atm) (mark utting, 2007; naik and tripathy, 2018) using an iolts model with the input stimuli li = {ic, pin, acc, tra, sta, wd, amo}, and the output re sponses lu = {cpi, bpi, mon, rec, ins, sho}. the intended meaning of the input actions are: ic, denotes the action when the user inserts his/her card into the atm; pin, indicates the pin code has been provided by the user; tra, requires the transfer amount; acc, indicates that a target account has been provided; sta, requires an account statement; wd, indicates that the user has requested a withdrawal; and amo, denotes the balance account. also we give the meaning of the out put alphabet: cpi, says the pin code is correct; bpi, says the provided pin is wrong; mon, indicates the money has been released; rec, indicates the receipt has been provided to the user; ins, denotes an insufficient balance on the account; and sho, indicates the statement has been shown to the user. we model the withdrawal operation by the iolts a of figure 4. note that if the requested amount (amo) is greater than the available amount (ins) then the withdrawal cannot be performed and the process reaches state s3 where a new with drawal operation can be requested again. some additional s0 s1 s2 s3 s4 s5 ?ic ?pin !cpi !bpi ?wd ?amo !mon !ins figure 4. atm specification a functionalities are also specified by the iolts b of figure 5. in this case we consider not only the withdrawal (wd) opera tion but also the transfer (tra) and statement (sta) operations. s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 ?ic ?pin !cpi !bpi ?tra ?sta ?wd ?acc ?amo !rec !ins ?amo !mon !ins !sho figure 5. atm specification b an iolts modelbased testing tool bonifacio and gomes 2020 assume the iolts z depicted in figure 6 as an iut that implements the withdrawal (wd) and transfer (tra) operations. we observe that if the requested amount (amo) in a with drawal is greater than the available amount then the iut reaches the state s7 where the user can choose a new amount again. s0 s1 s2 s3 s4 s5 s6 s7 s8 ?ic ?pin !cpi !bpin ?tra? wd ?acc ?amo !rec !ins ?amo !mon !ins figure 6. iut z now as a first testing scenario we check whether the iut z conforms to the specification a. we, then, run jtorx and everest over these models to obtain conformance verdicts using the ioco relation. both tools have returned the same verdict where z complies with a. in a second round, we run everest using the languagebased confor mance relation and, at this time, a fault could be detected in the iut z. the set of desirable behaviors was given by d = {ic pin cpi wd amo ins amo}, i.e., a sequence of ac tions where the account balance is not enough according to the requested withdrawal, and the user must provide a new value. everest has generated the test case {ic −→ pin −→ cpi −→ wd −→ amo −→ ins −→ amo} because the behavior specified in d is not an observable behavior in the specifica tion model but the iut z implements it. in a second scenario we want to verify the reliability of verdicts obtained by jtorx using ioco relation. we know that original underspecified models (see section 4.4) requires some labor in such a way that selfloop transitions must be added to get all states completely specified so handing over an inputenabled model. notice that the iut z is underspec ified, so jtorx must change the model to guarantee the input enabledness property on the iut. after changing the model we check whether z iococonforms to the specification b us ing jtorx. it is easy to see that the original behavior of the iut has been modified and, in this case, a fault is then de tected by the test case {ic −→ pin −→ cpin −→ sta}. we have also applied this second scenario to everest using the ioco relation. in the opposite direction to the jtorx, no fault was detected by everest once the fault be havior {ic −→ pin −→ cpin −→ sta} is not specified in the iut z. we see that the detection of this fault by jtorx is, in fact, a false positive, due to an extra behavior that has been added after changing the iut z to become it inputenabled. we also remark that everest can detect this same fault when checking ioco conformance over the same modified model. note that ic is the only single action defined at state s0 of the iut z. when jtorx turns z into an inputenabled model all input actions become enabled at all states, which is contradictory to the real functionality. for instance, we see that the action amo, i.e., the amount value to be withdrawn, becomes enabled at state s0. however, if a transfer opera tion (tra) is chosen instead of a withdrawal (wd), the amount value to be withdrawn should not be enabled at this moment. hence we see that any change performed over the former model modifies the original behavior of the iut, leading to an inaccurate conformance checking verdict relative to the real functionality of the atm. in the last scenario we take the iolts y depicted in fig ure 7 as a new iut. the iut y differs from the specification s0 s1 s2 s3 s4 s5 ?ic ?pin !cpi !bpi ?wd !mon !mon !ins figure 7. iut y a depicted in figure 4 only by the transitions (s4, ?amo, s5) and (s4, !mon, s5), respectively. so the iut allows a with drawal operation with no checking over the balance (amo) before releasing the money (mon). by contrast, in the spec ification model, the balance is checked before releasing the money when the account balance is positive. the fault model was bounded at six states for the class of iuts and everest has generated eighty tps based on the corresponding specification a. the generated test suite has then been sub mitted to the iut y, and a fault verdict could be obtained by the path ic −→ pin −→ cpin −→ wd using our tool. for the sake of completeness we have also applied this last scenario to jtorx, and a fault has also been detected by the test case ?ic, δ, ?pin, !bpi, ?pin, !cpi, ?wd, !mon. 4.4 a comparative analysis here we list some main aspects and compare everest and jtorx. we have seen that both tools provide a mechanism to generate test suites, run test cases and check ioco confor mance. everest also provides the more general conformance checking based on regular languages. further, our tool allows a complete test generation not only for the ioco relation but also for this more general conformance relation with a wider range of possibilities to specify desirable and undesirable be haviors. jtorx’s test generation employs an exhaustive strategy leading to the state space explosion problem making the pro cess infeasible in practice. in the opposite direction, everest is more flexible and allows for a complete test suite gen an iolts modelbased testing tool bonifacio and gomes 2020 eration by setting the maximum number of states on the iuts to be taken into account in the fault model. jtorx also implements a random approach that chooses transitions to induce paths over the specification when gener ating test suites.everest, however, only applies a random ap proach over the languagebased conformance relation when desirable and/or undesirable behaviors are not provided by the tester. in this case, the test run is reduced to the problem of checking isomorphism between the iut and the specifica tion model. we also note that both tools implement an online testing approach when iuts are provided together with the specifi cation. but only everest provides an offline test generation process using the notion of multigraph and test purposes. regarding the conformance checking process, jtorx de fines an online strategy where test cases are generated and right after they are already applied to the iut. everest follows an offline process where the whole test suite is generated and then all test cases are applied to the iut. how ever we remark that everest also has an online alternative process to check conformance where each test case obtained from the fault model is applied to iut right after it is gener ated. table 1 summarizes these aspects. table 1. methods and features jtorx everest conformance checking ioco theory √ √ languagebased x √ generation test suite generation √ √ test strategy online/offline √ √ test purpose √ √ random approach √ √ we also probe some properties over the specification and iut models, test verdicts, and strategies of testing. see ta ble 2. some restrictions are naturally imposed over the mod table 2. properties and tools jtorx everest properties underspecified models √a √ noninputenabledness x √ quiescence √ √ veredicts test run √ √ conformance √ √ test mode white/black boxes testing √ √ abut the internal structure of models must be changed. els when checking ioco conformance. underspecified mod els, for instance, are not allowed on the iut side and their internal structure must be changed to guarantee the input enabledness. the languagebased conformance relation does not require any restriction, that is, the more general method can deal with underspecified iuts and specification models. there fore everest can handle underspecified models with no change over the models when checking conformance and also generating test suites using the languagebased rela tion. jtorx, on the other hand, must completely explore the model’s structure to add new transitions to guarantee the inputenabledness property. we see that both tools can deal with quiescence models, where selfloops with δ actions are added at the quiescent states, and also give verdicts of conformance and run test cases in a similar way. 5 practical evaluation in this section, we present the results of practical experiments that we have run to evaluate the tools’ performance. first, we provide experiments to compare the checking conformance methods of everest and jtorx in subsection 5.1. given an iut and a specification, both tools can check whether the iut is in conformance to the specification under ioco rela tion. secondly, subsection 5.2 assays the additional feature of everest on generating and running test suites. in this case, given a specification model, we can generate test suites for a certain class of iuts and then de facto apply them to the iuts. the experiments are classified into different groups ac cording to the parameters under evaluation. therefore, each group of experiments represents a different scenario, where either the specifications and the iut models are changed to capture different situations of conformance checking, or test suite generation or test runs, e.g. the models must have a cer tain number of states and transitions. all experiments were performed using randomly generated models both for spec ifications and iuts, while satisfying all required properties, if any, to avoid bias in the results. in some groups, we have taking into account submachines of specification models as the basis to generate iuts with a certain percentage of mod ification. we have organized all experiments by research questions (rqs) to get the desired analyses using different groups of scenarios. our experiments have been performed on intel core i5 1.8 ghz cpu, with 8 gb of ram on windows 10. 5.1 conformance checking of everest and jtorx tools here we report on some experiments to compare everest and jtorx when checking conformance between an iut and a given specification using their respective implementa tions of the ioco relation. so a single conformance checking run is defined by a pair of models, an iut and a specification, where the result can be positive or negative. a verdict is said to be positive (iococonformance), when the iut complies with the specification, or negative (noniococonformance), when the iut does not comply with the specification, accord ing to the ioco relation. we evaluate several parameters related to the specifica tions and iuts, such as the number of states and the number of input/output actions on the models. in addition, we also consider experiments that derive verdicts of conformance and nonconformance on separated scenarios. but we remark that only inputenabled and deterministic models have been an iolts modelbased testing tool bonifacio and gomes 2020 generated in our experiments to comply with the restrictions imposed by jtorx. each group of experiments on checking conformance is de fined between one specification and ten iut models. there fore, each run is settled down by a pair of models, one iut against the corresponding specification, and a group of ex periments with ten specifications and ten iuts for each spec ification that outcomes in hundred runs. s experiments with verdicts of iococonformance were run over iut models obtained as submachines of their respec tive specifications, while iut models with verdicts of non iococonformance were randomly constructed by changing transitions from their corresponding specifications with a cer tain percentage of modification. regarding iut models with more states than the corresponding specifications, new states and new transitions have been randomly added to the models. to illustrate, let s be a specification model with 20 states and 120 transitions. in order to get positive verdicts we ran domly take iut models as submachines of s, choosing sub sets of states from s and their respective transitions. when we want to guarantee negative verdicts for iuts with 4% of modification from its respective specification we have to ran domly choose 5 transitions to be modified, i.e., over these 5 transitions we have to change the source state, or the target state, or even the action symbol. we have decided to gener ate iuts using the specification models as the basis instead of completely randomized iut models because in practical situations developers can make mistakes but, in general, they minimally implement the specified model, that is, realworld iuts usually are not very assorted from the corresponding specification. hence in the first group of experiments the size of the al phabets are interchanged while we increase the number of states on the models; and in the second group we varied the number of states (and transitions, consequently) of the spec ification and iut models in a stress testing. all processing times that are found in the graphics have been figured out from the mean value of the processing time of all experiments in the group. 5.1.1 reversing the size of input/output alphabets in this first scenario we investigate the impact over confor mance checking runs when the size of input and output alpha bets are inversely proportional. we state the rq as follows: “how does the size of the input alphabet (output alphabet) impact the processing time on checking conformance?”. to answer this question we have run experiments where the size of input and output alphabets have been reversed on the models. first, we take a group of iolts models with 2 symbols in the input alphabet and 10 symbols in the output alphabet. in a second group of models we reserve the size of alphabets, so taking input alphabets with 10 symbols and output alphabets with only 2 symbols. we vary the number of states by 15, 25 and 35 on iuts and get the specification with a fixed number of 10 states, both for verdicts of conformance and nonconformance. from the results, we note only a small variation on the pro cessing time when running experiments with verdicts of con formance either when the input alphabet is larger than the out put alphabet or when we reverse them in size. see figure 8. our tool is only 2.56% faster when running models with 2 inputs and 10 outputs (see figure 8(a)) than when checking models whose size of their alphabets are reversed with 10 in puts and 2 outputs (see figure 8(b)). similarly, jtorx is only 3.51% faster when reversing the size of the alphabets. however, we can notice that everest is faster than jtorx in both scenarios, where the size of the input alphabet is larger than the output alphabet, and viceversa. 15 20 25 30 35 0.4 0.45 0.5 0.55 everest jtorx specifications with 10 states number of states on iuts t im e ( se co n d s) (a) 2 inputs, 10 outputs 15 20 25 30 35 0.4 0.45 0.5 0.55 everest jtorx specifications with 10 states number of states on iuts t im e ( se co n d s) (b) 10 inputs, 2 outputs figure 8. reversing i/o alphabets with ioco verdicts in contrast, regarding verdicts of nonconformance, we see an expressive impact on the verification time when running experiments with the same scenarios where the size of the input and output alphabets are reversed in size. in this case, both tools have taken less processing time for models with 2 inputs and 10 outputs. everest is 12.73% to 42.86% faster, according to the number of states on iuts, for models with 2 inputs and 10 outputs (see figure 9(a)) than when running models with 10 inputs and 2 outputs (see figure 9(b)). jtorx is around 200% to 352% faster, depending on the iut size, for the same scenarios. we can observe that the impact over the processing time is very expressive in jtorx for verdicts of nonconformance when we reverse the size of the alphabets. we remark that in practical applications we usually need more input actions than output actions to specify realworld systems. that is, input alphabets with a large number of ac an iolts modelbased testing tool bonifacio and gomes 2020 15 20 25 30 35 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 everest jtorx specifications with 10 states number of states on iuts t im e ( se co n d s) (a) 2 inputs, 10 outputs 15 20 25 30 35 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3 3.3 everest jtorx specifications with 10 states number of states on iuts t im e ( se co n d s) (b) 10 inputs, 2 outputs figure 9. reversing i/o alphabets and nonioco verdicts tions can weigh down the performance of jtorx tool. further, notice that everest always outperforms jtorx for all scenar ios as depicted in all figures. 5.1.2 varying the number of states we also performed some experiments varying the number of states (and transitions) to evaluate the tools’ scalability. in this case the rq is: “how does the number of states in spec ifications and iuts impact the processing time on checking conformance?”. we answer this question running three groups of exper iments: (i) specifications with 10 states and iuts ranging from 20 to 200 states; (ii) specifications with 50 states and iuts ranging from 60 to 200 states; and (iii) specifications with 100 states and iuts varying from 110 to 200 states. we remark that all groups of iut models were increased by 10 states in each group. in the experiments with verdicts of conformance, specifi cations with 10 states and iuts with up to 120 states, everest attains a better performance compared to jtorx. jtorx is just slightly better when the iut models have more than 120 states. see figure 10a. when running experiments with ver dicts of conformance, specifications with 50 and 100 states, and groups of iuts with up to 200 states, everest has always outperformed jtorx. see figures 10b and 10c. 20 40 60 80 100 120 140 160 180 200 0.42 0.45 0.48 0.51 0.54 0.57 0.6 everest jtorx number of states on iuts t im e ( se co n d s) (a) specification with 10 states 60 80 100 120 140 160 180 200 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 everest jtorx number of states on iuts t im e ( se co n d s) (b) specification with 50 states 120 140 160 180 200 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 everest jtorx number of states on iuts t im e ( se co n d s) (c) specification with 100 states figure 10. varying the number of states and ioco verdicts now we turn into experiments with verdicts of non conformance, specifications with 10 and 50 states, and iuts with up to 120 states. we see from figures 11a and 11b that everest has always outperformed jtorx for any group of iuts. jtorx gets a better performance only for iuts with more than 200 states and specifications with 100 states. see figure 11c. an iolts modelbased testing tool bonifacio and gomes 2020 20 40 60 80 100 120 140 160 180 200 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 everest jtorx number of states on iuts t im e ( se co n d s) (a) specification with 10 states 60 80 100 120 140 160 180 200 1.5 3 4.5 6 7.5 9 10.5 12 13.5 15 everest jtorx number of states on iuts t im e ( se co n d s) (b) specification with 50 states 110 120 130 140 150 160 170 180 190 200 24 26 28 30 32 34 36 38 40 42 everest jtorx number of states on iuts t im e ( se co n d s) (c) specification with 100 states figure 11. varying the number of states and nonioco verdicts 5.2 everest test suite generation now we evaluate our tool for test suite generation by running experiments using a more recent approach (bonifacio and moura, 2019), where multigraphs must be first constructed to then generate test purposes. we vary the number of states in the specification models and also the bound to be consid ered over the number of states on the iuts. we look out to construct distinguishing iuts from their respective specifi cations with a certain percentage of modification over the transitions in order to assess different scenarios. we remark that the experiments on generating test suites were performed using solely everest due to two main rea sons: (i) jtorx implements an online strategy where an iut is always required to run the test generation mechanism; and (ii) jtorx’s test generation process finishes at the very first detected fault, so it cannot generate complete test suites. in the first group of experiments we vary the number of states on specification models together with the number of states to be considered on iuts; in the second group we gen erate test purposes over the multigraphs obtained in the first group; and in the third group we run test suites which were extracted from the test purposes of the second group, over iuts that were generated by modifying the corresponding specification models by a certain percentage. 5.2.1 multigraph generation step we define the following rq for the multigraph generation step as follows: “what is the impact on the processing time when generating multigraphs?”. to answer this question we vary the number of states on specification models and also the bound m associated with the maximum number of states to be considered on iut mod els. moreover, we consider 5 to 35 states specifications and construct the corresponding multigraphs to get fault models for iuts with 5 to 55 states. alphabets were fixed at 5 inputs and 5 outputs and we increase the number of states by 10 for each group of iuts. transitions were randomly generated to ensure unbiased results. next we briefly describe all scenarios that are taken into account in the multigraph generation step: (i) specifications with 5 states and m from 5 to 55 states; (ii) specifications with 15 states and m from 15 to 55 states; (iii) specifications with 25 states and m from 25 to 55 states; and (iv) specifica tions with 35 states and m from 35 to 55 states. figure 12 shows that the processing time for generating multigraphs grows, in general, as the number of states also grows on specification and iut models. we first notice that the median values are lying on the medium of the boxes, which means that as the size of the models grows we also observe a wellbehaved growth of the processing time. we particularly see in figure 12a that the multigraph con struction for specifications with 5 states and m = 35 takes 0.038 seconds, whereas the construction for m = 55 takes 0.047 seconds. so the processing time rose by 23.68%. sim ilarly, we observe that the construction process for specifica tions with 15 states takes 0.186 seconds, with m = 35, and takes 0.253 seconds, with m = 55. in this case, the process ing time rose by 36.02%. taking specifications with 25 states and m = 35 the time consumption of the multigraph construction is 0.428 seconds, and for m = 55 it takes 0.676 seconds, as we can see in figure 12b. the processing time rose by 57.94%. in the last group, we take specifications with 35 states and m = 35, resulting in a time consumption of 0.994 seconds, while it takes 1.867 seconds with m = 55. here the processing time rose by 87.82%. notice that the multigraph generation with m = 35 is 46 times faster for specification models with 5 states than specifications with 35 states. likewise the construction with an iolts modelbased testing tool bonifacio and gomes 2020 0.0 0.1 0.2 0.3 5 15 25 35 45 55 max iut states t im e ( s e c o n d s ) spec states 5 15 multigraph generation − specification with 5, 15 states (a) specifications with 5 and 15 states 1 2 3 25 35 45 55 max iut states t im e ( s e c o n d s ) spec states 25 35 multigraph generation − specification with 5, 15 states (b) specifications with 25 and 35 states figure 12. multigraph generation m = 55 is about 26 times faster for specification models with 5 states than specifications with 55 states. therefore we can conclude that the performance of the multigraph generation decreases as the number of states on specifications and m increase. but the most important issue is that the processing time is not meaningfully affected, i.e., the processing time does not substantially increase as the number of states rises. 5.2.2 test purpose generation process now we turn into the tp generation step based on multi graphs that have been generated in the previous experiments. the associated rq, in this case, is: “how the tp generation is impacted w.r.t. the processing time when we take multi graphs that have been generated by varying the number of states from the corresponding specifications and also vary ing the number of states on iut models?”. here we get multigraphs associated to specifications with 5 to 35 states, and vary m from 5 to 55. we fixed the number of tps to be generated at 1000. figure 13 shows that the test generation process takes much more time compared to the multigraph generation step. 20 25 30 35 40 45 5 15 25 35 45 55 max iut states t im e ( s e c o n d s ) spec states 5 15 1000 tps generation from multigraph of specification with 5 and 15 states (a) specifications with 5 and 15 states 30 50 70 90 110 25 35 45 55 max iut states t im e ( s e c o n d s ) spec states 25 35 1000 tps generation from multigraph of specification with 25 and 35 states (b) specifications with 25 and 35 states figure 13. tp generation from figure 13a we see that the processing time is more uniform for specifications with 5 states no matter we vary m. when the number of states grows figure 13b shows that the processing time of the tp generation grows fast as m in creases. the processing time for specifications with 35 states and m = 35 takes 66.82 seconds whereas using m = 55 it takes 91.67 seconds. so we see that the rate rose by 37.19%. considering specifications with 15 states and, respectively, an iolts modelbased testing tool bonifacio and gomes 2020 m = 35 and m = 55, the rate rose by 33.75%, whereas for specifications with 25 states and, m = 35 and m = 55, respectively, the rate rose by 30.48%. 5.2.3 running test suites in the last group of experiments we evaluate the processing time on running test suites. here the rq is given as follows: “what is the impact on the processing time when running test suites over iuts with 1%, 2% and 4% of modification w.r.t. the specifications which were used to generate the cor responding multigraphs?”. to answer this question we have taken test suites from tps that were generated for specifications with 15 and 25 states. we fixed m = n, that is, the number of states to be consid ered on iut models is the same to the number of states in the specification models. figure 14 shows the processing time according to the mod ification rate over the iuts. the time consumption of the test 1% 2% 4% 15 20 25 15 20 25 15 20 25 80 90 100 110 120 max iut states t im e ( se co n d s) max iut states 15 25 1000 tps generation from multigraph figure 14. test run run over iuts with 1% of modification takes 83.03 seconds with m = 15 and 89.78 seconds with m = 25. regarding iuts with 2% of modification, the process takes 86.19 sec onds with m = 15 and 80.83 seconds with m = 25. finally, for iuts with 4% of modification, the test run takes 87.54 seconds with m = 15 and 77.83 seconds with m = 25. we see that the processing time of test runs over iuts with m = 15 and 1% of modification is 5.15% faster than the test run over iuts with 4% of modification. if we consider m = 25 then the test run over iuts with 4% of modification is 13.31% faster than over iuts with 1% of modification. 5.3 threats to validity we list some aspects that may arise as a threat to the valid ity of the experiments. first we have to report a substantial intricacy to obtain the jtorx tool. several libraries were miss ing and we did not have full access to the source code. we have had access only to a binary code whereupon we could make some small amendments to adapt and run it from the command line. had we accordingly compiled and configured both tools they could be appropriately set up under the same conditions, and so the time consumption could be more eas ily and precisely obtained on running the experiments. the computational resource where the experiments were run may also be a threat. we have run all experiments in a generalpurpose machine whose results might be biased in some way. but we remark that both tools have run all exper iments under the same conditions. another threat is related to the random generation of the models. although we have randomly generated all models in order to avoid biases in the process, we had to guarantee some properties on specific classes of experiments. for in stance, in some groups of experiments, we had to construct iuts that were in conformance to their corresponding speci fication while in other groups we had to guarantee a certain rate of modification over the iuts to get verdicts of non conformance. so the results might have somehow be biased by all these extra checking tasks. we also list as a threat, those properties that must be guar anteed over the models following restrictions imposed by jtorx. we see that the size of alphabets, the number of states, and transitions of the specification and iut models are mod ified from the original models to secure such properties. so we cannot make any claim about the similarity between these modified models and the original ones w.r.t. their behaviors. 6 conclusion conformance checking and test suite generation are impor tant activities to improve the reliability of developing reac tive systems. in this work we have presented an automatic testing tool for checking conformance and generating test suites for iolts models. we have implemented the classical ioco relation and the more general approach based on regular languages. the lat ter, and consequently everest tool, imposes few, if any, re strictions over the models and allows a wider range of fault models described by regular languages when checking con formance. several works have dealt with ioco theory and its variations. however, we are not aware of any other tool that implements a different notion of conformance, such as the languagebased conformance. further our tool has imple mented a complete blackbox test suite generation using the notion of test purposes for certain classes of fault models. we described some case studies to probe both tools and their functionalities in practice. we then could observe from a comparative analysis that everest provides a wider range of testing scenarios since it was able to detect faults, using the languagebased approach, that were not detected by jtorx, using the ioco theory. the effectiveness of our test suite gen eration method is also evaluated in blackbox scenarios. we also offered practical experiments of conformance checking to compare the performance of everest against the jtorx. we can see that everest outperforms jtorx in most sce an iolts modelbased testing tool bonifacio and gomes 2020 narios unless for those where the structure of iut models are quite different from the corresponding specifications. hence we remark that although everest implements a more general conformance relation the time consumption has not been im pacted on checking runs. also we observed from the results that everest has a more stable behavior w.r.t. the processing time even for iut models with quite a different number of states. we also performed experiments of test suite genera tion and test run using everest tool. our tool was able to handle specifications and implementation candidates with a reasonable number of states as seen in the experiments. the main contribution of this work is our practical tool that can check conformance based on different relations and can generate test suites in a blackbox setting. moreover, we have presented some case studies, a comparative analysis, and also practical experiments to evaluate and compare our tool. an extension on the current version of everest is under way with a new module to allow conformance checking, test suite generation and test run in a batch mode, i.e., it will be able to automatically test several iut models at once. as future directions, we intend to improve our strategies and al gorithms to generate test suites and run test cases more effi ciently. references belinfante, a. (2010a). jtorx: a tool for online model driven test derivation and execution. in esparza, j. and majumdar, r., editors, tools and algorithms for the con struction and analysis of systems, 16th international con ference, tacas 2010, lecture notes in computer science, pages 266–270. springer. belinfante, a. (2010b). jtorx: a tool for online model driven test derivation and execution. in esparza, j. and majumdar, r., editors, tools and algorithms for the con struction and analysis of systems, pages 266–270, berlin, heidelberg. springer berlin heidelberg. belinfante, a. (2014). jtorx: exploring modelbased test ing. centre for telematics and information technology (ctit), netherlands. ipa dissertation series no. 201409. bonifacio, a. l. and moura, a. v. (2019). complete test suites for input/output systems. corr, abs/1902.10278. accessed on: 201906. calamé, j. (2005). specificationbased test generation with tgv. software engineering notes. clarke, d., jéron, t., rusu, v., and zinovieva, e. (2002). stg: a symbolic test generation tool. in katoen, j.p. and stevens, p., editors, tools and algorithms for the con struction and analysis of systems, pages 470–475, berlin, heidelberg. springer berlin heidelberg. de vries, r. (2001). towards formal test purposes. in tret mans, g. and brinksma, h., editors, formal approaches to testing of software 2001 (fates’01) volume ns014, brics notes series, pages 61–76, aarhus, denkmark. gomes, c. s. and bonifacio, a. l. (2019). automatically checking conformance on asynchronous reactive systems. in the fourteenth international conference on software engineering advances, pages 17–23. jard, c. and jéron, t. (2005). tgv: theory, principles and algorithms. international journal on software tools for technology transfer, 7(4):297–315. accessed on: 2019 08. larsen, k. g., mikucionis, m., nielsen, b., and skou, a. (2005). testing realtime embedded software using uppaaltron: an industrial case study. in proceed ings of the 5th acm international conference on embed ded software, emsoft ’05, pages 299–306. acm. mark utting, b. l. (2007). practical modelbased testing a tools approach. elsevier, 1nd edition. marsso, l., mateescu, r., and serwe, w. (2018). testor: a modular tool for onthefly conformance test case gener ation. in beyer, d. and huisman, m., editors, tools and algorithms for the construction and analysis of systems, pages 211–228, cham. springer international publishing. mostowski, w., poll, e., schmaltz, j., tretmans, j., and wichers schreur, r. (2009). modelbased testing of elec tronic passports. in alpuente, m., cook, b., and joubert, c., editors, formal methods for industrial critical sys tems, pages 207–209, berlin, heidelberg. springer berlin heidelberg. naik, k. and tripathy, p. (2018). software testing and qual ity assurance: theory and practice. wiley publishing, 2nd edition. roehm, h., oehlerking, j., woehrle, m., and althoff, m. (2016). reachset conformance testing of hybrid automata. in proceedings of the 19th international conference on hybrid systems: computation and control, hscc ’16, pages 277–286, new york, ny, usa. acm. simão, a. d. s. and petrenko, a. (2014). generating com plete and finite test suite for ioco: is it possible? in pro ceedings ninth workshop on modelbased testing, mbt 2014, grenoble, france, 6 april 2014., pages 56–70. ac cessed on: 201907. sipser, m. (2006). introduction to the theory of computa tion. course technology, second edition. tretmans, j. (1993). a formal approach to conformance test ing. in rafiq, o., editor, protocol test systems, vi, pro ceedings of the ifip tc6/wg6.1 sixth international work shop on protocol test systems, pau, france, 2830 septem ber, 1993, volume c19 of ifip transactions, pages 257– 276. northholland. tretmans, j. (2008). model based testing with labelled transi tion systems. in hierons, r. m., bowen, j. p., and harman, m., editors, formal methods and testing, an outcome of the fortest network, revised selected papers, volume 4949 of lecture notes in computer science, pages 1–38. springer. introduction related works a model-based testing method checking conformance on reactive systems complete test suite generation a testing tool for reactive systems conformance checking process everesttest suite generation a real-world case study a comparative analysis practical evaluation conformance checking of everestand jtorx tools reversing the size of input/output alphabets varying the number of states everesttest suite generation multigraph generation step test purpose generation process running test suites threats to validity conclusion journal of software engineering research and development, 2021, 9:14, doi: 10.5753/jserd.2021.1911 � this work is licensed under a creative commons attribution 4.0 international license. attributes that may raise the occurrence of merge conflicts josé william menezes � [ universidade federal do acre | jose.william@sou.ufac.br] bruno trindade � [ universidade federal do acre | bruno.trindade@sou.ufac.br] joão felipe pimentel � [ universidade federal fluminense | jpimentel@ic.uff.br] alexandre plastino � [ universidade federal fluminense | plastino@ic.uff.br] leonardo murta � [ universidade federal fluminense | leomurta@ic.uff.br] catarina costa � [ universidade federal do acre | catarina.costa@ufac.br] abstract collaborative software development typically involves the use of branches. the changes made in different branches are usually merged, and direct and indirect conflicts may arise. some studies are concerned with investigating ways to deal with merge conflicts and measuring the effort that this activity may require. however, the investigation of factors that may reduce the occurrence of conflicts needs more and deeper attention. this paper aims at identifying and analyzing attributes of past merges with and without conflicts to understand what may induce direct conflicts. we analyzed 182,273 merge scenarios from 80 projects written in eight different programming languages to find characteristics that increase the chances of a merge to have a conflict. we found that attributes such as the number of changed files, the number of commits, the number of changed lines, and the number of committers demonstrated to have the strongest influence in the occurrence of merge conflicts. moreover, attributes in the branch that is being integrated seem to be more influential than the same attributes in the receiving branch. additionally, we discovered positive correlations between the occurrence of conflicts and both the duration of the branch and the intersection of developers in both branches. finally, we observed that php, javascript, and java are more prone to conflicts. keywords: version control, merge conflicts, conflict prediction 1 introduction software development normally involves collaboration among members of the project team. this collaborative development is supported by a version control system (vcs). often, when there is a need to develop new features or fix bugs, developers choose to create a branch, which is a separate development line. this separate development line helps teams to focus on their tasks, without prematurely worrying about how it affects other parts of the software (bird et al., 2011). however, the use of branches can cause problems, as changes made in different branches are usually merged, and direct and indirect conflicts may arise (brindescu et al., 2020b; costa et al., 2016; sarma et al., 2011; brun et al., 2011). according to bird et al. (2011), the effort involved in the merge process is dependent on how much work went on in the branches. some studies investigate ways to deal with merge conflicts by proactively detecting changes that can lead to conflicts (brun et al., 2011; sarma et al., 2011), identifying merge characteristics (accioly et al., 2018; ghiotto et al., 2018; vale et al., 2020), investigating the characteristics of difficult merge conflicts (brindescu et al., 2020b), and examining the decisions usually made to resolve conflicts (accioly et al., 2018; ghiotto et al., 2018). however, only recently, some studies started to investigate factors that may induce the occurrence of conflicts. dias et al. (2020) verify how seven factors related to modularity, size, and timing of developers’ contributions affect conflict occurrence. leßenich et al. (2018) analyze the predictive power of seven indicators, such as the number, size, and scattering degree of commits in each branch, to forecast the number of merge conflicts. in the same direction, owhadi-kareshk et al. (2019) investigate the predictive power of nine lightweight git feature sets, such as the number of changed files in both branches, the number of commits and developers, and the duration of the development of the branch. finally, vale et al. (2020) investigate the role of communication activity and the number of modified lines, chunks, files, developers, commits, and days that a merge scenario lasts in the increase or reduction of merge conflicts. similar to dias et al. (2020), leßenich et al. (2018), owhadi-kareshk et al. (2019), and vale et al. (2020), we assume that by analyzing attributes of past merges, it is possible to identify characteristics that may increase the chances of having a merge conflict. however, in addition to the attributes investigated by those authors (e.g., isolation, number of changed files, changed lines, commits, commit density, and developers), we analyzed some other attributes, such as the programming language, the frequency of one or more developers committing in both branches, and the existence of self-conflicts1 (zimmermann, 2007). as mentioned by brindescu et al. (2020a), the changes in conflict are generally authored by two different developers, but merge conflicts can also happen between the edits of the same developer, in two different branches. besides, in terms of the number of analyzed merges, our corpus is representative (it is only smaller than the corpus of owhadi-kareshk et al. (2019)), and our analysis of the number of commits, commit density, committers, and changed lines and files is performed by branch, not using averages. finally, it is important to mention that some metrics with similar names in the related work are calculated differently. thus, our work aims at providing a more in-depth analysis of how a set of merge attributes can influence the occurrence of conflicts. to do so, we mined association rules from 1a self-conflict is a conflict among changes committed by the same developer. https://orcid.org/0000-0003-4326-5351 mailto:jose.william@sou.ufac.br https://orcid.org/0000-0001-9000-3739 mailto:bruno.trindade@sou.ufac.br https://orcid.org/0000-0001-6680-7470 mailto:jpimentel@ic.uff.br https://orcid.org/0000-0003-4039-0915 mailto:plastino@ic.uff.br https://orcid.org/0000-0002-5173-1247 mailto:leomurta@ic.uff.br https://orcid.org/0000-0002-8851-1563 mailto:catarina.costa@ufac.br attributes that may raise the occurrence of merge conflicts menezes et al. 2021 182,273 merge scenarios extracted from 80 software projects hosted on github, written in eight different programming languages. the following eight research questions guided the analysis: • rq1. how is the isolation of a branch related to the occurrence of merge conflicts? our intuition is that the longer the isolation time of the branches, the greater the likelihood of having conflicts. • rq2. how is the number of commits related to the occurrence of merge conflicts? our intuition is that the greater the number of contributions in terms of commits in the branches, the greater the likelihood of having conflicts. • rq3. how is the number of developers that performed commits related to the occurrence of merge conflicts? our intuition is that the greater the number of contributors in the branches, the greater the likelihood of having conflicts. • rq4. how is the number of changed files related to the occurrence of merge conflicts? our intuition is that the greater the number of contributions in terms of changed files in the branches, the greater the likelihood of having conflicts. • rq5. how is the number of changed lines related to the occurrence of merge conflicts? our intuition is that the greater the number of contributions in terms of changed lines in the branches, the greater the likelihood of having conflicts. • rq6. how is the programming language related to the occurrence of merge conflicts? we had no intuition about the programming language, but we would like to know if any language was more prone to have conflicts. • rq7. how is the intersection of developers in both branches related to the occurrence of merge conflicts? our intuition is that the greater the number of contributors in both branches, the lesser the chances of having conflicts, because these developers are aware of the parallel changes. • rq8. how prevalent is the occurrence of merge selfconflicts? we had no intuition about the proportion of self-conflicts, but we would like to know if it is common in projects. the answers to these questions can provide insights on how software project teams’ work may affect the occurrence or avoidance of merge conflicts. we found that the investigated attributes have a positive correlation with merges with conflicts. notably, in the integrated branch, the number of changed files, the number of commits, the number of changed lines, and the number of committers have the strongest influence in the occurrence of conflicts among all attributes we analyzed. surprisingly, having some developers committing in both branches also increases the chance of conflicts, but having no common developer or having exactly the same developers committing in both branches decreases the chance of conflicts. we also verified that three programming languages (php, javascript, and java) are more prone to conflicts. this paper is an extended version of a conference paper (menezes et al., 2020) in which we answered six research questions, focused on the impact of the attributes time, commits, committers, changed files, intersection, and self-conflicts, in the occurrence of merge conflicts. this work complements our previous work by adding two new research questions and three new attributes, the number of changed lines, the commits density, and the programming language. additionally, we detail the investigation of developer intersection and replace the old “some intersection” category by percentages of intersections in association rules. we also deep the analysis of self-conflicts, obtaining the number of self-conflicts per chunk instead of per file. after all, a file can have several pieces of conflicting code that are from the same or different developers. hence, the analysis became more precise. in addition, we mine rules to verify the relation to the attributes in the occurrence of self-conflicts. besides this introduction, this paper is organized in 7 sections. in section 2, we present the research steps followed. in section 3, we present the results of our statistical analysis, the association rules, and the discussion about the self-conflicts. in section 4, we present the answers to our research questions. in section 5, we discuss threats to validity. in section 6, we discuss the related work. finally, in section 7 we present the conclusion. 2 materials and methods to answer the research questions presented in the introduction, we performed an exploratory study. the following steps, detailed in the following, compose our exploratory study: merge attributes definition, projects and merges selection, merges and attributes extraction, and data mining. 2.1 merge attributes definition the attributes were mainly derived from our research questions and defined in table 1. we divided the attributes into project attributes, merge attributes, and branch attributes. the project attributes are the predominant programming language, number of merges, number of analyzed merges (non-fast-forward), merges with conflicts, merges without conflicts, and self-conflicts. the merge attributes are the information collected by the merge scenario, using the information present in both branches. the merge conflict occurrence (yes or no), timing metrics, and information about changes and developers in both branches. the branch attributes are collected and presented by each branch2 (b1 and b2). we do not adopt any aggregation of 2when referring to the identification of branches in merges, i.e., the distinction between branch 1 (b1) and branch 2 (b2), we borrow the reasoning of chacon and hamano (2009): “the first parent is the branch you were on when you merged, and the second is the commit on the branch that you merged in”. attributes that may raise the occurrence of merge conflicts menezes et al. 2021 table 1. attributes # attributes definition project attributes a) programming language predominant programming language. b) total of merges number of merges in total. c) analyzed merges number of three-way merges, not considering fast-forward merges. d) merges with conflicts number of analyzed merges with conflicts. e) merges without conflicts number of analyzed merges without conflicts. f) merges with self-conflict number of analyzed merges with the same developer authoring both sides of at least one conflicting chunk. merge attributes g) merge conflict occurrence binary attribute (yes or no) indicating if the merge has conflicts. h) branching-duration the effective duration of development in the branches, from the first branch commit (min(b1,b2)) to the last branch commit (max(b1,b2)), in days. i) total-duration the total duration of isolation, from the common ancestor (base commit) to the merge commit, in days. j) committers in both branches percentage of developers in both branches. k) conflicting chunk number of conflicting chunks. l) conflicting chunk by the same developers number of conflicting chunks authored by the same developer in both sides. branch attributes (b1 and b2) m) commit density number of commits in the branching-duration. n) loc-churn number of lines changed (added + deleted) in each branch. o) changed files number of changed files in each branch. p) commits number of commits in each branch. q) committers (commit authors) number of developers that authored commits in each branch. values. the branch attributes are the number of commits, number of committers (commit authors), number of changed files, and loc-churn (number of lines changed: added + deleted). the attributes are collected for merges with conflicts and merges without conflicts. they allow us to compare different characteristics between merges with and without conflicts. some of these attributes are also mentioned in related work, such as the timing metrics, merge conflict occurrence, number of merge conflicts, number of commits, commit density, committers, and changed lines and files (dias et al. (2020); leßenich et al. (2018); vale et al. (2020)). 2.2 projects and merges selection first, we decided to select projects developed in different and popular programming languages. thus, we identified the top8 programming languages present in the following surveys: github3 top active languages survey 2019, stack overflow4 developer survey results in 2019, and tiobe 5 index 2019. the top-8 selected programming languages present in the three surveys were: javascript, python, java, php, c#, c++, c, and ruby. we selected the projects using the github api. we used the following criteria: (1) popular projects (projects with more than 1,000 stars), (2) software projects, (3) number of merges greater than 100, (4) projects with a wiki or some documentation, and (5) a balanced amount of merges per project. after 3https://githut.info/ 4https://insights.stackoverflow.com/survey/2019# technology 5https://www.tiobe.com/tiobe-index/ applying the first criterion, we initially selected 461 projects. after applying criteria 2 to 4, 279 projects remained. to obtain a balanced corpus in terms of the number of analyzed merges per project, we selected ten projects per programming language, where the number of analyzed merges was less than 5% of the total number of analyzed merges in the dataset (table 2). for example, the graal project is the project with the highest number of analyzed merges (7,064). however, it represents just 3.9% of our final dataset (182,273). table 2. general information of our dataset. programming language total merges analyzed merges conflicting merges c 31,013 21,948 981 c# 31,468 21,148 2,003 c++ 32,463 24,155 2,290 java 32,989 24,109 2,519 javascript 31,542 21,803 2,613 php 31,208 22,371 3,376 python 32,591 22,585 1,923 ruby 37,001 24,154 2,114 total 260,275 182,273 17,819 it is important to mention that although the total number of merges was initially 260,275 (table 2), we removed 78,002 merges from the analysis: 74,293 fast-forward merges (i.e., merges with no changes in a branch, in which git would be able to just moves the pointer forward, but due to the option --no-ff, a merge commit was created (chacon and hamano, 2009)), 37 merges with negative total-duration (i.e., merges https://githut.info/ https://insights.stackoverflow.com/survey/2019#technology https://insights.stackoverflow.com/survey/2019#technology https://www.tiobe.com/tiobe-index/ attributes that may raise the occurrence of merge conflicts menezes et al. 2021 in which the date of the common ancestor is more recent than the date of the merge, probably due to some clock misconfiguration in the developer computer), and 3,672 merges with only merge commits (merges in which all commits from both branches are merge commits). 2.3 merges and attributes extraction we have implemented a tool to extract the attributes. the tool and the dataset are publicly available on github6. the tool was developed in java to parse the log provided by git, retrieving all merge commits. then, it identifies the parent commits that were merged and navigated until the common ancestor, just before forking the history. our tool also checks whether the merge resulted in conflicts. figure 1 shows a merge example composed of: a merge commit (c57), two parents commits that were merged (c55 and c56), and the common ancestor (c50). from these commits, it is possible to identify the commits within each branch. these commits are located between the common ancestor, and each of the parents commits merged (including the parent commits). the “feature” branch in the example has three commits (c51, c54, and c56), and the “master” branch also has three commits (c52, c53, and c55). by identifying all commits from the branches of each merge, our tool was able to collect all the attributes listed in section 2.1. in our example, to calculate the branching-duration, we check the date of the first branch commit (min(b1,b2)), “08 aug 2020”, and the date of the last branch commit (max(b1,b2)), “22 aug 2020”. so, the branching-duration was 14 days. in the attribute verification of committers in both branches, the tool would identify that ana made changes to both branches. in the verification of the conflicting chunk by the same developers, ana could also have been the author of a self-conflict. in the committers attribute verification, the “feature” branch has two committers (lisa and ana), and the “master” branch also has two committers (ana and tom). three files were changed in the “feature” branch (a, b, and c), and two files (a and b) were changed in the “master” branch. with the attributes of the 182,273 merge cases, we could conduct statistical analysis to understand the difference between the distributions of merges with and without conflicts. additionally, we plotted graphs representing the probability of having a conflict in a merge (axis y) given that an attribute is higher than a value (axis x). we calculated this probability according to the bayes theorem: p (conf lict|attribute > value) = p (conf lict ∩ attribute > value)/p (attribute > value). 2.4 data mining in this step, we adopted a data mining technique called association rules extraction. in summary, an association rule r is a pair (x, y ) of two disjoint entity sets, x and y . in the notation x → y , x is called antecedent, and y is called consequent (han et al. (2012)). the rules aim at finding associations or correlations, but, as said by zimmermann et al. 6https://github.com/catarinacosta/mactool/ figure 1. merge example. (2004), rules do not tell an absolute truth. they have a probabilistic interpretation based on the amount of evidence determined by two metrics (agrawal et al., 1994): (a) support, the joint probability of having both antecedent and consequent, and (b) conf idence, the conditional probability of having the consequent when the antecedent is present. another measure of interest used is the (c) lif t, which indicates how much the occurrence of y increases given the occurrence of x. han et al. (2012) explain that lif t(x → y ) = conf idence(x → y )/support(y ), where lif t = 1 indicates that the antecedent (x) does not interfere with the occurrence of the consequent (y ), lif t > 1 indicates that the occurrence of x increases the chances of the occurrence of y , and lif t < 1 indicates that the occurrence of x decreases the chances of the occurrence of y . we adopted the knowledge discovery in databases (kdd) process (fayyad et al. (1996)) to extract the association rules from our dataset: (a) data selection, (b) preprocessing, (c) transformation and data enrichment, (d) association rules extraction, and (e) results interpretation and evaluation. after we selected and collected the projects and the attributes using our tool (step a), we removed instances (merge cases) with inconsistent values (step b), for example, merge cases with negative total-duration. these two initial steps were described in section 2.1 to 2.3. the discretization (step c) was performed through the supervised algorithm proposed by fayyad and irani (1992), available in the weka7 tool. this algorithm transforms numerical attributes into categorical ones, aiming at reducing the entropy of the original class distribution by finding ranges that maximize their class-related purity. in this study, the class attribute indicates the merge conflict occurrence. for the association rules extraction (step d), we used r8 with the apriori (agrawal et al. (1994)) algorithm and the rattle9tool. in this study, our focus was on finding rules with the occurrence of conflict in the consequent (conflict=yes). however, due to the presence of conflicts being approximately 10% of our dataset, we lowered the support and confidence measures of interest considerably, to 0.01%. finally, we looked at all the association rules extracted that would help us answer the research questions (step e). in this step, we performed the analysis of the results. 7https://www.cs.waikato.ac.nz/ml/weka/ 8https://cran.r-project.org/bin/windows/rtools/ 9https://rattle.togaware.com/ https://github.com/catarinacosta/mactool/ https://www.cs.waikato.ac.nz/ml/weka/ https://cran.r-project.org/bin/windows/rtools/ https://rattle.togaware.com/ attributes that may raise the occurrence of merge conflicts menezes et al. 2021 3 results this section answers the research questions posed in section 1 according to the research process described in section 2. section 3.1 presents a statistical analysis of the merge attributes. section 3.2 analyzes the extracted association rules. section 3.3 presents the number of self-conflicts. 3.1 statistical analysis in this section, we analyze the distribution of each merge attribute for merges with and without conflict to understand which attributes act more as an indicator of conflict. hence, we divided the dataset of 182,273 merges into two subsets: one with 164,454 merges without conflicts and the other with 17,819 merges with conflicts. table 3 presents the comparison of the distributions. for comparing the distribution of the attributes, we first analyzed their normality using the anderson-darling test (anderson and darling, 1954). we chose this test due to the size of the distributions. we observed non-normality in all distributions at 95% confidence. then, we applied the mannwhitney test (mann and whitney, 1947) for each pair of subsets, and we found statistically significant differences for all the distributions, except the number of committers in b1 p-value = 0.323. after calculating the mean of the statistically different distributions, we observed that merges with conflicts have higher values than merges without conflicts for all the attributes. given these results and the non-normality of the distributions, we used cliff’s delta (macbeth et al., 2011) to calculate the effect size of these differences. we found four attributes with a large effect size (the ones related to b2) and five with a small effect size (the ones related to the time and most of the ones associated with b1). for clarification, let us analyze the distributions of changed files in b2 from table 3 as an example. we started the analysis by applying the anderson-darling test for the distribution of changed files in b2 for merges without conflicts and obtained a p-value < 10−15 rejecting the null hypothesis at 95% confidence (i.e., we found that this data are not from a population with a normal distribution). then, we applied the same test for the distribution related to merges with conflict, and we also observed a p-value < 10−15. since both distributions are not normal nor paired, we compared them with a non-parametric test for unpaired data: mann-whitney. once again, we observer a p-value < 10−15, indicating that the distributions are statistically different from each other. note in table 3 that both the average and the boxplot of the number of changed files in b2 for merges with conflicts (wc) are higher than the ones for merges without conflicts (wo). finally, we used cliff’s delta to calculate the effect size of the difference between these distributions, and we obtained a magnitude of −0.57, which is classified as large (romano et al., 2006). after analyzing the distributions and observing a significant statistical difference in most of them, we applied the bayes theorem to calculate the probability p (conf lict|attribute ≥ x) and we variated x for values within the range of the boxplots presented in table 3 (i.e., between max(q1 − 1.5 × iqr, minimum) and min(q3 + 1.5 × iqr, maximum)). figure 2 presents the distribution of probabilities for each numeric distribution. as expected, all probabilities start at around 10%, which represents the percentage of merge conflicts, but they grow at different rates. figure 2 highlights the probabilities in the medians of the distributions with and without merge conflicts and the probability in the last value of the interval. note in figure 2(e) that the p (conf lict|committers b1 ≥ x) is 9% for x = 2 (the median for both merges with and without conflicts). it indicates that the probability had a small decrease in comparison to the starting point (x = 1). continuing our example using the number of changed files in b2, note that the p (conf lict|changed f iles b2 ≥ x) in figure 2(h) starts at 9.8% when x = 0. then, when x reaches the median number of changed files in b2 for merges without conflicts (x = 2), the probability is 13.5%. when x reaches the median number of changed files in b2 for merges with conflicts (x = 19), the probability is 29.5%. finally, at the end of the interval (x = q3 + 1.5 × iqr = 205), the probability is 47.9%. as expected, in figure 2, attributes with a large effect size (changed files in b2, commits in b2, changed lines in b2, and committers in b2) grow faster than attributes with a smaller effect size (changed lines in b1, branching-duration, changed files in b1, total-duration, and commits in b1). 3.2 association rules we used data mining to enrich our analyzes with association rules. table 4 presents the extracted association rules in which the antecedent is the range value of each attribute, obtained by the discretization process, and the consequent is the presence of conflicts. it also presents the three measures of interest used, support (sup.), confidence (conf.), and lift. in general, smaller attributes’ values make the chance of merge conflicts decrease (lif t < 1), while higher values make the chance of merge conflicts increase (lif t > 1). for instance, for just one changed file in b2, the probability of merge conflicts decreases by 81% (lif t = 0.19). however, for more than 30 changed files, the probability of merge conflicts increases by 243% (lif t = 3.43). through all these analyzes, we observed that attributes related to b2 (i.e., the branch being integrated into b1) influence more the probability of merge conflicts than the other attributes, with the number of changed files, number of commits, number of changed lines, and number of committers in b2 being the attributes that influence the most, in this order. we observed that the attributes branching-duration and totalduration have a similar impact on the probability of merge conflicts and could be used interchangeably in most situations. we also verified the eight programming languages we selected regarding their influence on the occurrence of conflicts. three languages (php, javascript, and java) have shown a positive conflict dependency (lif t > 1), which increases the chances of conflicts occurring (table 5). we observe that, when using php, the probability of conflicts occurrence increases by 53% (lif t = 1.53). on the other hand, when programming in c, the probability of having conflicts decreases by 54% (lif t = 0.46). finally, we evaluated the intersection of developers, i.e., the number of developers working in both branches. some attributes that may raise the occurrence of merge conflicts menezes et al. 2021 table 3. comparison of merge distributions with (wc) and without conflicts (wo) attribute anderson-darling (p-value) mannwhitney average cliff’s delta distribution wo wc (p-value) wo wc value meaning branchingduration < 10−15 < 10−15 < 10−15 6.26 18.53 -0.3 small 0 10 20 30 wc wo totalduration < 10−15 < 10−15 < 10−15 7.27 19.71 -0.27 small 0 10 20 30 wc wo commits b1 < 10−15 < 10−15 < 10−15 73.51 252.91 -0.24 small 0 100 200 wc wo commits b2 < 10−15 < 10−15 < 10−15 9.76 81.07 -0.53 large 0 25 50 75 wc wo committers b1 < 10−15 < 10−15 0.32 8.52 21.31 5 10 15 wc wo committers b2 < 10−15 < 10−15 < 10−15 2.01 9.4 -0.48 large 5 10 15 wc wo changed files b1 < 10−15 < 10−15 < 10−15 100.8 425.44 -0.29 small 0 100 200 300 400 wc wo changed files b2 < 10−15 < 10−15 < 10−15 21.43 166.3 -0.57 large 0 50 100 150 200 wc wo loc-churn b1 < 10−15 < 10−15 < 10−15 6666.95 32934.05 -0.33 small 0 5000 10000 15000 20000 wc wo loc-churn b2 < 10−15 < 10−15 < 10−15 1623.7 13734.43 -0.51 large 0 2000 4000 6000 8000 wc wo density b1 < 10−15 < 10−15 < 10−15 545.98 1074.17 0.07 negligible 0 25 50 75 100 wc wo density b2 < 10−15 < 10−15 < 10−15 35.21 51.53 -0.11 negligible 0 10 20 30 40 wc wo studies have already mentioned that developers may work in both branches (costa et al., 2014, 2016; zimmermann, 2007). according to zimmermann (2007), many developers work at different places (e.g., home and office) or on different branches, and, at some point, they need to synchronize their changes. costa et al. (2014) analyzed the number of merges in repositories according to three scenarios: the presence of the same developers in both branches, disjoint sets of developers, or some intersection of the developers. they found a significant number of merges with developers working in both branches. we also performed this analysis in our dataset, but we compared the numbers of merges with and without conflicts. figure 3 shows the number of merges cases with no intersection, with some intersection, and with all the developers in common for merges with and without conflicts. since the number of merges with conflicts is much smaller than the number of merges without conflicts, we normalized both groups according to the total number of merges. then, we mined association rules to find the increase or decrease in the probability of merge conflicts. table 6 presents the results, which indicates that having some intersection (67% to 99%) increases the chance of conflicts by 265% (lif t = 3.65), and having no intersection reduces the chances of conflict by 41% (lif t = 0.59). after extracting rules with only one attribute in the antecedent, and considering the multidimensional characteristics of an association rule (lu et al. (2000)), we decided to analyze the combination of rules and understand if the combination of factors increases some measures of interest in the occurrence of conflict. the algorithm that brought the best results in the selection of attributes was infogainattributeeattributes that may raise the occurrence of merge conflicts menezes et al. 2021 0 5 10 15 20 25 30 (a) branching-duration 0% 50% p(conflict | branching duration x) 13.4% 16.3% 24.8% 0 5 10 15 20 25 30 (b) total-duration 0% 50% p(conflict | total duration x) 12.8% 15.4% 24.2% 0 50 100 150 200 (c) commits b1 0% 50% p(conflict | commits b1 x) 12.3% 14.1% 27.5% 0 20 40 60 80 (d) commits b2 0% 50% p(conflict | commits b2 x) 9.8% 29.4% 51.1% 5 10 15 (e) committer b1 0% 50% p(conflict | committer b1 x) 9.0% 9.0% 17.5% 2.5 5.0 7.5 10.0 12.5 15.0 (f) committer b2 0% 50% p(conflict | committer b2 x) 9.8% 25.7% 52.7% 0 100 200 300 400 (g) changed-files b1 0% 50% p(conflict | changed files b1 x) 12.9% 14.8% 28.0% 0 50 100 150 200 (h) changed-files b2 0% 50% p(conflict | changed files b2 x) 13.5% 29.5% 47.9% 0 5000 10000 15000 20000 (i) loc-churn b1 0% 50%p(conflict | loc-churn b1 x) 13.5% 16.0% 25.3% 0 2000 4000 6000 8000 (j) loc-churn b2 0% 50%p(conflict | loc-churn b2 x) 15.0% 26.8% 45.2% 0 20 40 60 80 100 (k) density b1 0% 50% p(conflict | density b1 x) 9.1% 8.9% 15.0% 0 10 20 30 40 (l) density b2 0% 50% p(conflict | density b2 x) 11.9% 11.9% 9.2% figure 2. probability of conflicts given that the attribute is greater than the value on the x axis. green stars represent the probability on the median of the distributions without conflicts. red triangles represent the probability on the median of the distributions with conflicts. blue squares represent the probability on the maximum value. val10. six attributes (branching-duration, committers in b2, intersection, commits in b2, and changed files and lines in b2) with the best classification were selected and ten combinations of attributes with the rules with the best measures of interest are presented in table 7. 10weka.attributeselection.infogainattributeeval table 4. measures of interest for the rules {attribute = range value} → {conflict=yes} attibute range value sup. (%) conf. (%) lift branchingduration (in days) < 1 2.85 5.84 0.60 1 – 7 3.84 10.87 1.11 8 – 15 1.13 16.31 1.67 16 – 30 0.77 17.93 1.84 > 30 1.17 25.45 2.61 totalduration (in days) < 1 2.22 6.01 0.62 1 – 7 4.19 9.43 0.97 8 – 15 1.27 15.06 1.54 16 – 30 0.83 16.85 1.73 > 30 1.24 24.20 2.48 commits in b1 1 1.27 5.99 0.61 2 – 5 1.98 7.60 0.78 6 – 20 2.38 17.82 1.82 > 20 4.37 15.01 1.54 commits in b2 1 1.60 3.39 0.35 2 – 5 2.33 7.76 0.79 6 – 20 2.38 17.82 1.82 > 20 3.46 36.92 3.78 committers in b1 1 – 3 6.14 9.71 1.00 4 – 10 1.51 6.88 0.70 11 – 30 1.08 10.91 1.12 > 30 1.04 21.00 2.15 committers in b2 1 – 3 5.84 6.57 0.67 4 – 10 2.19 29.51 3.02 11 – 30 1.15 43.00 4.40 > 30 0.59 59.12 6.05 changed files in b1 1 file 0.42 3.18 0.33 2 – 5 1.55 6.85 0.70 6 – 30 2.94 9.34 0.96 > 30 4.85 14.90 1.53 changed file in b2 1 file 0.60 1.89 0.19 2 – 5 1.99 5.99 0.61 6 – 30 3.07 13.55 1.39 > 30 4.10 33.47 3.43 loc-churn in b1 0 – 10 0.41 3.03 0.31 11 – 100 1.44 9.25 0.64 101 – 1000 2.94 9.32 0.95 1001 – 10000 2.83 12.81 1.31 > 10000 2.15 22.27 2.28 loc-churn in b2 0 – 10 0.82 3.04 0.31 11 – 100 1.95 5.52 0.57 101 – 1000 3.04 12.13 1.24 1001 – 10000 2.59 26.81 2.74 > 10000 1.36 46.25 4.73 density b1 0 – 5 4.33 10.82 1.11 > 5 – 20 2.07 7.44 0.76 > 20 – 40 0.84 8.72 0.89 > 40 2.83 11.30 1.16 density b2 0 – 5 4.95 8.39 0.86 > 5 – 20 2.69 13.73 1.40 > 20 – 40 0.82 11.44 1.16 > 40 1.32 9.26 0.95 considering the first rule in table 7, when the branchingduration is 16-30 days, the number of committers in b2, is 4attributes that may raise the occurrence of merge conflicts menezes et al. 2021 table 5. measures of interest for the rules {language} → {conflict=yes} language sup. (%) conf. (%) lift c 0.55 4.47 0.46 c# 1.11 9.62 0.97 c++ 1.27 9.49 0.96 java 1.42 10.77 1.09 javascript 1.45 12.18 1.23 php 1.87 15.18 1.53 python 1.04 8.43 0.85 ruby 1.18 8.88 0.90 table 6. measures of interest for the rules related to the intersection of developers {intersection} → {conflict=yes} % intersection sup. (%) conf. (%) lift 0% 3.39 5.78 0.59 1% – 33% 4.91 17.83 1.83 34% – 66% 0.74 11.94 1.22 67% – 99% 0.13 35.68 3.65 100% 0.60 8.19 0.84 10, the intersection of developers is 26%-50%, and the number of commits in b2 is greater than 20, then the probability of conflict occurrence increases by 850% (lif t = 9.50). please note the confidence and lift of this rule are greater when compared to the individual rules of each attribute. without conflicts with conflicts 0% 10% 20% 30% 40% 50% 60% 70% m er ge s (n or m al iz ed b y gr ou p) 61% 100046 34% 603632%52143 60% 10687 7% 12265 6% 1096 no intersection some intersection all developers in common figure 3. intersection of developers in branches. 3.3 self-conflicts we observed a significant number of developers intersection in figure 3. so, we investigated conflicting chunks and commits that have been made by the same developer. we noticed something interesting: in some cases, a developer made parallel changes that resulted in a merge conflict. zimmermann (2007) named this phenomenon as self-conflicts. we identified self-conflict cases in all 80 investigated projects. figure 4 summarizes the comparison between self-conflicts and conflicts inserted by different committers in each of the 80 projects, grouped by programming language, for merges with conflicts. in this analysis, we divided the number of conflicting chunks by the same developer by the total number of conflicting chunks. we also decided to mine association rules about the attributes investigated in this study and their effect on the occurrence of self-conflicts. when looking at attributes such as time, the number of commits, committers, changed lines and table 7. measures of interest for the rules combined antecedent sup. (%) conf. (%) lift branching-duration = 16 – 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 26% – 50% ∧ commits b2 = 20 0,01 92,86 9,50 branching-duration = 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 1% – 25% ∧ changed file in b2 = 30 ∧ loc-churn = 101 – 1000 0,02 90,91 9,30 branching-duration = 30 ∧ % intersection = 1% – 25% ∧ commits b2 = 6 – 20 ∧ changed file in b2 = 30 ∧ loc-churn = 101 – 1000 0,01 89,47 9,16 branching-duration = 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 1% – 25% ∧ commits b2 = 6 – 20 ∧ changed file in b2 = 30 0,02 89,29 9,14 branching-duration = 16 – 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 26% – 50% ∧ changed file in b2 = 30 0,01 87,50 8,96 branching-duration = 30 ∧ committers in b2 = 4 – 10 ∧ commits b2 = 6 – 20 ∧ changed file in b2 = 30 0,02 87,10 8,91 branching-duration = 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 1% – 25% ∧ commits b2 = 6 – 20 ∧ loc-churn = 101 – 1000 0,03 86,27 8,83 branching-duration = 16 – 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 26% – 50% 0,01 85,00 8,70 branching-duration = 30 ∧ committers in b2 = 11 – 30 ∧ % intersection = 1% – 25% ∧ changed file in b2 = 30 ∧ loc-churn = 101 – 1000 0,01 82,35 8,43 branching-duration = 30 ∧ commits b2 = 6 – 20 ∧ changed file in b2 = 30 ∧ loc-churn = 101 – 1000 0,01 81,82 8,37 files, intersection, commits density, and the programming language. only the existence of developer intersection showed a strong influence on the occurrence of self-conflicts. selfconflict logically only exists when a developer works in both branches. however, it is important to verify that there is a tendency for the chances of self-conflict to increase as the percentage of intersection increases (with a slight exception in the range of 67% 99%, which reduces the chances by 1% compared to the range of 34% 66%), as shown in table 8. attributes that may raise the occurrence of merge conflicts menezes et al. 2021 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% % self-conflicts % different committers figure 4. conflicting chunks in projects grouped by programming language. table 8. measures of interest for the rules related to the intersection of developers {intersection} → {self-conflict=yes} % intersection sup. (%) conf. (%) lift 0% 2.40 6.98 0.19 1% – 33% 23.06 45.93 1.27 34% – 66% 4.67 59.33 1.65 67% – 99% 0,70 59,18 1.64 100% 5.22 81.48 2.26 4 discussions in this section, we answer the research questions presented in section 1 based on the results described in sections 3.1, 3.2, and 3.3. in general, the results for b2 (ie, the branch that is integrated into b1 during the merge) demonstrated a greater impact on the occurrence of conflicts, mainly for the number of changed files, commits, changed lines, and committers. as the identification of b1 and b2 is based on the merge direction, it depends on the strategy adopted by the software project. 4.1 how is the isolation of a branch related to the occurrence of merge conflicts? (rq1) the isolation of the branches is mentioned by some studies (bird et al. (2011); costa et al. (2014); dias et al. (2020); leßenich et al. (2018)) as a factor that may contribute to the occurrence of conflicts. in our study, we measured the isolation of branches using two attributes related to time: the branchingduration and the total-duration. we calculated these attributes for each merge case (with conflicts and without conflicts), in days. in section 3.1, we observed that both attributes have a very similar distribution, and they both present some impact on the occurrence of merge conflicts (effect sizes of -0.3 and -0.27 for branching-duration and total-duration, respectively). after mining association rules, we noted that the probability of conflicts occurring decreases when the duration is very short (less than a day): 40% less for branching-duration (lif t = 0.60) and 38% less for total-duration (lif t = 0.62). when the time is medium (8-15 days), the chances of having a conflict increases by 67% (lif t = 1.67) for branchingduration, and 54% (lif t = 1.54) for the total-duration. so, the results indicate a positive dependence between the duration increase and the chances of having a conflict. the lift of very long duration (more than 30 days) suggests that the chances of having a conflict increases by 161% (lif t = 2.61) for branching-duration and by 148% (lif t = 2.48) for the total-duration. answer to rq1: the branch-duration and totalduration have a small impact on the occurrence of merge conflicts (effect sizes of -0.3 and -0.27, respectively). despite the small impact, the association rules indicate that the occurrence of conflict increase when time increases (lift close to 1 for durations of 1-7 days and lift > 2.4 for durations bigger than 30 days). 4.2 how is the number of commits related to the occurrence of merge conflicts? (rq2) to answer this question, we checked the amount of work done in terms of commits in each branch. in section 3.1, we observed that both the number of commits in b1 and the number of commits in b2 have a positive impact on the occurrence of merge conflicts. however, the impact of commits in b2 is larger (effect size of -0.53) than the impact of commits in b1 (effect size of -0.24), indicating that the number of commits in b2 (i.e., the branch that is being integrated into b1) is a better predictor of conflicts than the number of commits in b1. we analyzed how much more frequent the conflict becomes with the increase in the number of commits in both branches in figure 2 and table 4. we can see that contributions with few commits in b1 and b2 have a negative dependency on the occurrence of conflicts. when the branch has only one commit, the occurrence of conflict decreases by 39% (lif t = 0.61) for b1 and 65% (lif t = 0.35) for b2. having few commits (2-5) shows a decrease of 22% (lif t = 0.78) for b1, and 21% (lif t = 0.79) for b2. the lift of 1.54 for b1 and 3.78 for b2 when there are more than 20 commits indicates that the chances of having a conflict increase by 54% for b1 and by 278% for b2. by looking at the probability of having a conflict given the number of commits in figure 2(c) and figure 2(d), it is possible to see that this probability grows faster according to the number of commits in b2, reaching around 40% for 30 commits while reaching just around 16% for 30 commits in b1. we also verified the commit density, i.e., the number of commits in b1 and b2 in relation to branching-duration. we noticed a significant difference between the impact of the number of commits and the number of commits divided by the branching-duration, the commit density. the impact of commit density in b1 (0.05) and b2 (−0.12) is negligible. when looking at the density association rules, we observed that unlike other attributes, there is no pattern of evolution when the value of the attribute increases, that is, when the number of commits in b1 or b2 divided by the branchingduration is greater. when the number of commits in b1 and b2 is between 0 to 5, the chance of having a conflict increases 11% for b1 (lif t = 1.11) and decreases 14% for b2 (lif t = 0.86). when the number of commits in b1 or b2 divided by the branching-duration is greater than 40, the chance of having a conflict increases 16% for b1 (lif t = 1.16) and attributes that may raise the occurrence of merge conflicts menezes et al. 2021 decreases 5% for b2 (lif t = 0.95). answer to rq2: the number of commit has a small impact for b1 (effect size of -0.24) and a large impact for b2 (effect size of -0.53) on the occurrence of merge conflicts. the association rules indicate that the chances of conflict increase when the number of commits increases (according to the ranges of commits, lifts in b1 range from 0.61 to 1.54 and lifts in b2 range from 0.35 to 3.78). 4.3 how is the number of developers that performed commits related to the occurrence of merge conflicts? (rq3) for this question, we checked the number of committers in each branch. we observed that the number of committers in b1 (i.e., the branch that receives the integration) does not seem to have a statistically significant impact on the probability of merge conflicts. on the other hand, we observed that the number of committers in b2 has a large impact on the occurrence of merge conflicts (effect size of -0.48). these differences can also be observed in figure 2(e) and figure 2(f), which present the probabilities of conflicts. while the probability barely grows according to the numbers of committers in b1 (from 10% for one committer to 11% for six committers), it has a considerable growth for the number of committers in b2 (from 10% for one committer to 40% for six committers). hence, the number of committers in the branch that is being integrated (b2) seems to be a good indication of the possibility of merge conflicts. comparing the distributions of committers in b2 for merges with and without conflicts in table 3, we noted that while merges without conflicts usually have a single committer in b2, conflicting merges seem to have more committers. the association rules in table 4 also indicate that when the number of committers is large, the chances of conflicts are higher. first, having few committers (1-3) in b1 does not imply more or fewer conflicts (lif t = 1.00). however, there is a negative dependency when considering b2. in this case, the occurrence of conflict decreases by 33% (lif t = 0.67). for a very large number of committers (i.e., more than 30 committers), we observed an increase in the chances of having a conflict by 115% for b1 (lif t = 2.15) and 505% for b2 (lif t = 6.05). answer to rq3: the number of committers has no impact for b1 (p-value is 0.32) and a large impact for b2 (effect-size of -0.48) on the occurrence of merge conflicts. the association rules indicate that the chances of conflict increase when the number of committers increases, especially for b2 (lift goes from 0.67 for 1–3 committers to 6.05 for >30 committers). 4.4 how is the number of changed files related to the occurrence of merge conflicts? (rq4) for this question, we checked the amount of work done in terms of changed files in each branch. the results are similar to the ones related to the number of commits in b1 and b2, with changes on b2 influencing more the probability of merge conflicts (effect size of -0.57) than changes on b1 (effect size of -0.29). figure 2(g) and figure 2(h) present the distributions and the probabilities of conflicts according to the number of changed files in b1 and b2, respectively. the probability of a merge conflict after changes in 40 or more files in b1 is around 16%. on the other hand, the probability after changes in the same number of files in b2 is approximately 36%. for the number of changed files, as expected, the association rules also confirmed that fewer changed files are less likely to cause conflicts. as shown in table 4, a single changed file indicates lower chances of conflicts: 67% less for b1 (lif t = 0.33) and 81% less for b2 (lif t = 0.19). however, for many changed files (i.e., more than 30), we observed an increase of 53% for b1 (lif t = 1.53) and 243% for b2 (lif t = 3.43). answer to rq4: the number of changed files has a small impact for b1 (effect-size of -0.29) and a large impact for b2 (effect-size of -0.57) on the occurrence of merge conflicts. the association rules indicate that the chances of conflict increase when the number of commits increases (> 30 files in b1 has lift 1.53; > 6 files in b2 has lift 1.39; > 30 files in b2 has lift 3.43). 4.5 how is the number of changed lines related to the occurrence of merge conflicts? (rq5) for this question, we checked the loc-churn, the total number of lines of code added and removed in each branch (gousios and zaidman, 2014; nagappan and ball, 2005; da silva et al., 2020). we verified that changed lines on b2 influence more the probability of merge conflicts (effect size of -0.51) than changed lines on b1 (effect size of -0.33). this result is similar to the ones related to the number of changed files and commits. we also verified that association rules involving changed lines of code have a negative conflict dependency for values less than 100 changed lines . rules with values of 0-10 changed lines have their chances of conflict reduced by 69% (lif t = 0.31) for b1 and b2. for changes involving 11-100 lines, the chances are reduced by 36% (lif t = 0.64) for b1 and 43% (lif t = 0.57) for b2. for modifications involving many changed lines, the chances of a conflict occurring are increased. for more than ten thousand lines of code, the chances increased by 128% (lif t = 2.28) for b1 and 373% (lif t = 4.73) for b2. attributes that may raise the occurrence of merge conflicts menezes et al. 2021 answer to rq5: the number of changed lines has a small impact for b1 (effect-size of -0.33) and a large impact for b2 (effect-size of -0.51) on the occurrence of merge conflicts. the association rules indicate that the chances of conflict increase when the number of changed lines increases (lift goes from 0.31 for 0 – 10 loc to 2.28 for > 10000 loc in b1 and 0.31 for 0 – 10 loc to 4.73 for > 10000 loc in b2). 4.6 how is the programming language related to the occurrence of merge conflicts? (rq6) for this question, we observed the eight programming languages adopted in the selected projects. as shown in table 5, for five programming languages (c, c#, c++, python, and ruby), the chances of conflict occurrences decrease. the language c reduces the chances of conflicts in 54% (lif t = 0.46). python and ruby also decreases the chances of conflicts, in 15% (lif t = 0.85) and 10% (lif t = 0.90), respectively. on the other hand, three programming languages (php, javascript, and java) selected for this study present a positive dependency to conflict occurrence. we observed that php increases the chances of conflict by 53% (lif t = 1.53) and javascript increases in 23% (lif t = 1.23). for projects written in java, there is an increase of 9% (lif t = 1.09) in the chances of a merge conflict. answer to rq6: the association rules indicate that the chances of conflict increase when the project is written in php (53%), javascript (23%), and java (9%). 4.7 how is the intersection of developers in both branches related to the occurrence of merge conflicts? (rq7) for this question, we checked the frequency of the committers in both branches and divided the merges into three groups: merges with no intersection, merges with some intersection, and merges with all developers in common. contrary to our expectations, as presented in figure 3, the intersection of developers does not decrease the chance of merge conflicts. when we mined association rules related to the intersection of developers, we divided the merges into five groups: 0% (merges with no intersection), 1%-33%, 34%-66%, 67%-99%, and 100% (merges with all developers in common) (table 6). we observed that having some intersection (67%-99%) increases the chance of conflict by 265% (lif t = 3.65), while having no intersection decreases the probability of conflict by 41% (lif t = 0.59). however, when the merge has all developers in common, the chance of conflicts also decreases by 16% (lif t = 0.84). so, having all the developers or no developers in common seems to be better than having just one set of developers in common. answer to rq7: the association rules indicate that having some intersection increases the chances of conflict (67% 99% in 265%, 1% 33% in 83%, and 34% 66% in 22%). 4.8 how prevalent is the occurrence of merge self-conflicts? (rq8) conflicts caused between commits of the same developer seem more common than we anticipated. note that the percentage of self-conflicts in figure 4 ranges from 5.46% (of 3,152 conflicting chunks) in yii2 project to 66.23% (of 835 conflicting chunks) in vert.x project. note also that ten projects had more than 50% of self-conflicts. when considering projects with more than 40% of self-conflicts cases, 22 projects are listed. we then decided to analyze a merge case (commit 456424) from the elasticsearch project, and observed two examples of self-conflicts in a source-code file and in a debug file. regarding the source-code file, in b1, the developer created an instance of a searchresponse object with a parameter (commit 3a6429), and in b2, the developer performed validation and also created an instance of a searchresponse object, but without parameters (commit d82faf). regarding the debug file, the developer added several lines in both branches (commits 3a6429 and d82faf), possibly during execution in a test environment. when we mined association rules related to the occurrence of self-conflicts, we verified when the merge involves all the developers in common, the chances of a self-conflict occurring are increased by 126% (lif t = 2.26), as shown in table 8. we analyzed other attributes, but none showed a strong influence (> 27%), with the exception of the intersection of developers. answer to rq8: we identified self-conflicts in all 80 projects. the percentage of self-conflicts range from 5.46% (of 3,152 conflicting chunks) in yii2 project to 66.23% (of 835 conflicting chunks) in vert.x project. 5 threats to validity as in any study, ours also has limitations. our approach uses the committers’ git id (names and/or email addresses) to identify developers who committed in both branches. developers may use multiple aliases, eventually generating inconsistencies (i.e., false negatives) in the results. we adopted the strategy to turn all letters in uppercase and remove all existing spaces to reduce this threat. we may have missed some cases when the aliases are lexically different, but in this case, the number of committers in both branches and self-conflicts would be higher. we believe that a branch’s isolation time is relative. someone can create a branch and not commit to it for a while or someone can perform the branch’s last commit and not merge for a while. therefore, the measurement of the duration of a branch has limitations. we used two metrics of time to attributes that may raise the occurrence of merge conflicts menezes et al. 2021 mitigate this threat: considering just the commits performed within the branches (branching-duration) and considering the merge commit (total-duration). we are investigating only three-way merge scenarios integrating two branches, so we found and excluded 74,293 fast-forward merges. different merge strategies may not have been considered, for example, the git rebase, as it flattens the rich information of parallel development into a linear history. we also excluded 37 merge cases in which the time metrics were negative. since the timestamp for each commit is generated on the developer’s computer, if the computer’s clock is wrong, the timestamp is recorded incorrectly. in a merge case (merge commit 1da7521) from the elasticsearch project, for example, while the merge was committed on 2/8/2017, the common ancestor (commit 5ee82e4) of the parents’ commits (commits 1ba5f8f and e761b76) was committed on 4/20/2018. finally, we excluded 3,672 merges with only merge commits. for example, in a merge commit (197f57c) of the osu project, we found just one commit in each branch, and these commits are also merge commits (commits 660afb4 and 436e155). 6 related work vale et al. (2020) investigated the role of communication activity in the increase or reduction of merge conflicts. they analyzed the history of 30 popular open-source projects involving 19,000 merge scenarios. the authors mined and linked contributions from git and communication from github data. they used bivariate and multivariate analyses to evaluate the correlations. in bivariate analysis, they found a weak positive correlation between github communication activity and the number of merge conflicts. in the multivariate analysis, they discovered that github communication activity does not correlate with the occurrence of merge conflicts. thus, they investigated if it depends on the merge scenarios’ characteristics, such as the number of modified lines, chunks, files, developers, commits, and days a merge scenario lasts. these variables are calculated by merge scenario (both branches). for example, the authors considered the sum of the number of developers in both branches. they found that there is no relation between the communication measures and the number of merge conflicts when considering these factors. they concluded that: (1) longer merge scenarios with more developers involve more github communication, but not necessarily more merge conflicts, (2) the size of the changes of merge scenarios (in terms of numbers of files, chunks, and lines of code involved) is not sufficient to predict the occurrence of merge conflicts. leßenich et al. (2018) surveyed 41 developers and extracted a set of seven indicators (the number of commits, commit density, number of files changed by both branches, larger changes, fragmentation of changes, scattered changes across classes or methods, and the granularity of changes above or within class declarations) for predicting the number of conflicts in merge scenarios. they also checked additional indicators mentioned in the survey, i.e., whether the more developers contribute to a merge scenario, the more likely conflicts happen and whether branches that are developed over a long time without a merge are more likely to lead to merge conflicts. after determining the respective value for each branch, they compute the geometric mean of these values. to evaluate the indicators, the authors performed an empirical study on 163 open-source java projects, involving 21,488 merge scenarios. they found that none of the indicators can predict the number of merge conflicts, as suggested by the developer survey. hence, they assumed that these indicators are not useful for predicting the number of merge conflicts. owhadi-kareshk et al. (2019) also investigated if conflict prediction is feasible. they verified nine indicators (the number of changed files in both branches, number of changed lines, number of commits and developers, commit density, keywords in the commit messages, modifications, and the duration of the development of the branch) for predicting whether a merge scenario is safe or conflicting. they adopted norm-1 as the combination operator to combine the indicators extracted for each branch into a single value. to evaluate the predictor, they performed an empirical study on 744 github repositories in seven programming languages, involving 267,657 merge scenarios. similar to related work, they did not find a correlation between the chosen indicators and conflicts, but using the same indicators, they designed a classifier that was able to detect safe merge scenarios (without conflicts) with high precision (0.97 to 0.98) using the random forest classifier. dias et al. (2020) also conducted a study to understand better how conflict occurrence is affected by technical and organizational factors. they investigated seven factors related to modularity, size, and timing of developers’ contributions. they computed the geometric mean of the branch values for each factor. the authors analyzed 125 projects, involving 73,504 merge scenarios in github repositories of ruby (100) and python (25) mvc projects. they found that merge conflict occurrence significantly increases when contributions to be merged are not modular in the sense that they involve files from the same mvc slice (related model, view, and controller files). as previously discussed, vale et al. (2020) and owhadikareshk et al. (2019) tried to predict the occurrence of merge conflicts. complementary, leßenich et al. (2018) tried to predict the number of merge conflicts. vale et al. (2020) and leßenich et al. (2018) did not find a strong correlation between the analyzed attributes and the occurrence and number of conflicts. owhadi-kareshk et al. (2019) also found no correlation between the indicators and conflicts, but were able to design a classifier for merge conflicts. our study investigated some similar attributes to the ones evaluated by vale et al. (2020) and owhadi-kareshk et al. (2019) (time metric, number of commits, committers, changed lines and files), and by leßenich et al. (2018) (number of commits, commit density, and files in both branches), however, in our results the investigated attributes seem to have a positive correlation with merges with conflicts. similar to our results, dias et al. (2020) found that more developers, commits, changed files, and contributions developed over long periods are more likely associated with merge conflicts. however, no evaluated attributes showed predictive power concerning the number of merge conflicts. they also investigated some similar attributes, as timing metrics, number of commits, committers, changed lines, and files. although attributes that may raise the occurrence of merge conflicts menezes et al. 2021 we did not check whether the contributions were modular or not, we added some attributes, such as the frequency of one or more committers in both branches and the verification of conflicting chunks and commits that have been made by the same developer. the extraction of association rules also showed us a tendency to merge conflicts when there is a longer duration, more commits, committers, and files changed. it is worth mentioning that the attributes evaluated by the previous studies might not be computed in the same way, despite the attributes’ name similarity. for example, the number of commits is presented in all the related work. leßenich et al. (2018) reported the number of commits between the common ancestor and the merge as the geometric mean of both branches. vale et al. (2020) report this number as the sum of commits performed in the two branches. owhadi-kareshk et al. (2019) used norm-1 (also a sum of absolute values) as the combination operator for the number of commits between the ancestor and the last commit in a branch. dias et al. (2020) also used the geometric mean of the number of commits in each contribution. in our work, we decide to keep the information by branch, using no aggregate measure. 7 conclusion in this work, we analyzed 182,273 merge scenarios from 80 projects written in eight programming languages to understand which attributes impact on the occurrence of merge conflicts. while all attributes seem to have a positive influence on the probability of merge conflicts, some appear to have a more significant impact than others. the attributes that presented a higher relation to the occurrence of merge conflicts are changed files, commits, changed lines, and committers in the branch b2 (i.e., the branch that is integrated into b1 during the merge). these attributes in the branch b1 have a smaller impact (changed lines, changed files, and commits) or even no statistically significant difference (committers) on the occurrence of conflict. both the branching-duration and the total-duration seem to have an impact comparable to the impact of attributes in b1. despite some attributes presenting a smaller impact on merge conflicts when we compare the whole distributions, the association rules indicate that higher values of them increase the chances of conflicts by over 53%. in addition to these attributes, we analyzed the impact of the selected programming language and the intersection of developers between branches on the occurrence of conflicts. among the eight programming languages verified, php, javascript, and java, have a positive conflict dependency, and php increases the chances of conflicts by 53%. regarding the intersection of developers, we noticed that merges with one or more committers acting in both branches do not seem to reduce the chances of merge conflicts. instead, having some intersection in the developers increases the chance of conflicts (1%-33% by 83%, 34%-66% by 22%, and 67%-99% by 265%). however, having all the developers or no developers in common reduces the chances of conflicts (41% and 16%, respectively). finally, we analyzed how common it is for a single developer to make self-conflicts. we observed that all projects have self-conflicts with a huge variation on the proportion. while some projects have only 5.46% of selfconflicts, other projects have up to 66.23% of self-conflicts. while some attributes have a large impact on the occurrence of merge conflicts, they may not be used as predictive attributes since the probability of having a conflict given the value of these attributes is relatively small. nonetheless, these attributes can be used to elaborate policies and best practices to reduce the chances of merge conflicts. the adoption of recognized best practices such as frequent commits, small changes, continuous integration, among others, can be reinforced with attention to the number of developers involved and conflicting changes by the same developer. as future work, we intend to increase the number of attributes and further investigate some of them by conducting a qualitative study on the programming language (what actually influences a language to have greater chances of conflicts, such as verbosity, developer freedom, among other aspects) and self-conflicts (if self-conflicts are evenly distributed among the project’s committers or if some committers concentrate the majority of self-conflicts). we also would like to verify our results with some of the analyzed project communities. finally, we intend to develop a tool that analyzes the project’s history and measures these metrics from time to time to warn the project team. acknowledgements this work was partially supported by capes (88882.464250/201901), cnpq, and faperj. references accioly, p., borba, p., and cavalcanti, g. (2018). understanding semi-structured merge conflict characteristics in open-source java projects. empirical software engineering, 121:2051 – 2085. agrawal, r., srikant, r., et al. (1994). fast algorithms for mining association rules. in 20th international conference on very large data bases (vldb), pages 487 – 499, san francisco, ca, usa. anderson, t. w. and darling, d. a. (1954). a test of goodness of fit. journal of the american statistical association, 49:765–769. bird, c., zimmermann, t., and teterev, a. (2011). a theory of branches as goals and virtual teams. in 4th international workshop on cooperative and human aspects of software engineering (chase), pages 53 – 56, waikiki, honolulu, hi, usa. brindescu, c., ahmed, i., jensen, c., and sarma, a. (2020a). an empirical investigation into merge conflicts and their effect on software quality. empirical software engineering, 25:562 – 590. brindescu, c., ahmed, i., leano, r., and sarma, a. (2020b). planning for untangling: predicting the difficulty of merge conflicts. in 42nd ieee/acm international conference on software engineering (icse), pages 801 – 811, seoul, south korea. brun, y., holmes, r., ernst, m. d., and notkin, d. (2011). proactive detection of collaboration conflicts. in 19th attributes that may raise the occurrence of merge conflicts menezes et al. 2021 acm special interest group on software engineering (sigsoft) symposium and the 13th european conference on foundations of software engineering (esec), pages 168 – 178, szeged, hungary. chacon, s. and hamano, j. (2009). pro git. berkeley, ca, 1:509. costa, c., figueiredo, j., murta, l., and sarma, a. (2016). tipmerge: recommending experts for integrating changes across branches. in 24th international symposium on foundations of software engineering (fse), pages 523 – 534, seattle, wa, usa. costa, c., figueiredo, j. j., ghiotto, g., and murta, l. (2014). characterizing the problem of developers’ assignment for merging branches. international journal of software engineering and knowledge engineering, 24:1489 – 1508. da silva, d. a. n., soares, d. m., and gonçalves, s. a. (2020). measuring unique changes: how do distinct changes affect the size and lifetime of pull requests? in 14th brazilian symposium on software components, architectures, and reuse (sbcars), pages 121 – 130, natal, brazil. dias, k., borba, p., and barreto, m. (2020). understanding predictive factors for merge conflicts. information and software technology, 121:106256. fayyad, u., piatetsky-shapiro, g., and smyth, p. (1996). from data mining to knowledge discovery in databases. ai magazine, 17:37 – 37. fayyad, u. m. and irani, k. b. (1992). on the handling of continuous-valued attributes in decision tree generation. machine learning, 8:87 – 102. ghiotto, g., murta, l., barros, m., and hoek, a. v. d. (2018). on the nature of merge conflicts: a study of 2,731 open source java projects hosted by github. ieee transactions on software engineering, 48:892 – 915. gousios, g. and zaidman, a. (2014). a dataset for pull-based development research. in 11th working conference on mining software repositories (msr), pages 368 – 371, hyderabad, india. han, j., kamber, m., and pei, j. (2012). data mining concepts and techniques (3rd edition). leßenich, o., siegmund, j., apel, s., kästner, c., and hunsen, c. (2018). indicators for merge conflicts in the wild: survey and empirical study. automated software engineering, 25:279 – 313. lu, h., feng, l., and han, j. (2000). beyond intratransaction association analysis: mining multidimensional intertransaction association rules. acm transactions on information systems (tois), 18:423 – 454. macbeth, g., razumiejczyk, e., and ledesma, r. d. (2011). cliff’s delta calculator: a non-parametric effect size program for two groups of observations. universitas psychologica, 10:545–555. mann, h. b. and whitney, d. r. (1947). on a test of whether one of two random variables is stochastically larger than the other. the annals of mathematical statistics, 18:50–60. menezes, j. w., trindade, b., pimentel, j. f., moura, t., plastino, a., murta, l., and costa, c. (2020). what causes merge conflicts? in 34th brazilian symposium on software engineering (sbes), pages 203 – 212, natal, brazil. nagappan, n. and ball, t. (2005). use of relative code churn measures to predict system defect density. in 27th international conference on software engineering (icse), pages 284 – 292, st. louis, mo, usa. owhadi-kareshk, m., nadi, s., and rubin, j. (2019). predicting merge conflicts in collaborative software development. in 13th acm/ieee international symposium on empirical software engineering and measurement (esem), pages 1 – 11, porto de galinhas, brazil. romano, j., kromrey, j. d., coraggio, j., and skowronek, j. (2006). appropriate statistics for ordinal level data: should we really be using t-test and cohen’sd for evaluating group differences on the nsse and other surveys. in 10th annual meeting of the florida association of institutional research (fair), pages 1 – 3, florida, usa. sarma, a., redmiles, d. f., and hoek, a. v. d. (2011). palantir: early detection of development conflicts arising from parallel code changes. ieee transactions on software engineering, 38:889 – 908. vale, g., schmid, a., santos, a. r., almeida, e. s. d., and apel, s. (2020). on the relation between github communication activity and merge conflicts. empirical software engineering, 25:402 – 433. zimmermann, t. (2007). mining workspace updates in cvs. in 4th international workshop on mining software repositories (msr), page 11, washington, dc, usa. zimmermann, t., weisgerber, p., diehl, s., and zeller, a. (2004). mining version histories to guide software changes. in 26th international conference on software engineering (icse), pages 563 – 572, usa. introduction materials and methods merge attributes definition projects and merges selection merges and attributes extraction data mining results statistical analysis association rules self-conflicts discussions how is the isolation of a branch related to the occurrence of merge conflicts? (rq1) how is the number of commits related to the occurrence of merge conflicts? (rq2) how is the number of developers that performed commits related to the occurrence of merge conflicts? (rq3) how is the number of changed files related to the occurrence of merge conflicts? (rq4) how is the number of changed lines related to the occurrence of merge conflicts? (rq5) how is the programming language related to the occurrence of merge conflicts? (rq6) how is the intersection of developers in both branches related to the occurrence of merge conflicts? (rq7) how prevalent is the occurrence of merge self-conflicts? (rq8) threats to validity related work conclusion microsoft word 19-##_article-431-1-6-20190621.docx journal of software engineering research and development, 2019, 7:1, doi: 10.5753/jserd.2019.19 this work is licensed under a creative commons attribution 4.0 international license. a taste of the software industry perception of the technical debt and its management in brazil victor machado da silva [ universidade federal do rio de janeiro | victor0machado@gmail.com ] helvio jeronimo junior [ universidade federal do rio de janeiro | jeronimohjr@gmail.com ] guilherme horta travassos [ universidade federal do rio de janeiro | ght@cos.ufrj.br ] abstract background: technical debt (td) metaphor has been an exciting topic of investigation for the software industry and academia in the last year. despite the increasing attention of practitioners and researchers, td studies indicate that its management (tdm) is still incipient. particularly in brazilian software organizations (bsos), there is still a lack of information regarding how software practitioners perceive and manage td in their projects. objective: to characterize td and its management under the perspective of bsos using their practitioners as proxies and extend the discussions presented at the 2018 ibero-american conference in software engineering. methods: a survey was performed with 62 practitioners, representing about 12 organizations and 30 software projects. results: the analysis of 40 valid questionnaires indicates that td is still unknown to a considerable fraction of the participants, and only a small group of organizations adopts td management activities in their projects. besides, it was possible to obtain a set of technologies that can be used to support tdm activities and to make available a survey package to study td and its management. conclusions: although the results provide an initial and representative landscape of the tdm scenario in bsos, further research will support to observe how effective and efficient tdm activities can be in different software project contexts. keywords: technical debt, software quality, survey, experimental software engineering 1 introduction the software evolution is essential for the survival of a software product in the market since the environment in which it is immersed continually changes. as argued by boehm (2008), in the face of an increasingly dynamic and competitive market, software development organizations need to support continuous and fast delivery of value to the customer in both short and long terms. in this scenario, many software organizations introduce agility practices into their development processes to handle the frequents requirements changes and the continuous delivery demand (de frança et al. 2016). this context reflects the challenges faced by software practitioners regarding the many decisions they take in their projects over time. at the same time, software practitioners should build high quality, low cost, on time, and useful software products. this working environment brings challenges to practitioners regarding the decision-making, setting up a trade-off that can lead to the intentional or unintentional creation of “technical debt" in software projects over time. as argued by tom et al. (2013) and avgeriou et al. (2016), most, if not all, software projects face some td. td refers to technical decisions taken in the software development scenario involving intertemporal choices (becker et al. 2018), which influence positively (intentional and managed) or negatively (unintentional and not managed) to the software project ecosystem and the quality of their software products. when td is perceived and managed in software projects, it has the potential to support deliveries of value to customers in a short time. on the other hand, in the long-term, some risks to internal software increase when the debt is not perceived and managed in the projects, hindering the software products maintenance and evolution (avgeriou et al. 2016). currently, the td metaphor interest and use have grown over the years (li et al. 2015). many studies have been discussing different knowledge areas of td and supporting solutions to software engineers to achieve better results in their projects. using an ad-hoc literature review, we observed some studies discussing the concept of td and technologies to support the technical debt management (tdm). as mentioned in li et al. (2015) and alves et al. (2016), only a few studies deal directly with the question of how software organizations perceive and apply the td metaphor in their working environment. also, the software development process is influenced by the country’s culture, language, and beliefs (prikladnicki et al. 2007), and it can influence how td can emerge, be perceived, and managed. particularly in brazil, there is a more latent gap regarding how the brazilian software organizations (bsos) perceive td and how their practitioners handle it in their projects. assuncao et al. (2015) reported that tdm is a topic of interest at brazilian federal administration departments. however, there is scarce information on whether td is adequately managed in bsos. this information is useful since it can provide initial insights so that bsos can improve their software processes to minimize the risks that td can bring to the software project ecosystem and the quality of their software products. this context motivated us to investigate how bsos (represented by their practitioners) adopt and manage td. also, it is our interest to observe whether the perception from bsos’ practitioners on td and its management match the findings of other td investigations. our study intends to a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 raise the level of knowledge of td and its management in bsos. therefore, a survey was designed and conducted with software practitioners engaged in brazilian software organizations. this paper presents the results of this survey, intending to provide the following initial contributions: • to get an initial perception of the td metaphor and its management in bsos, using their engaged professionals as proxies; • to make available a survey package with empirically evaluated instruments to support the gathering and aggregation of information regarding the td perception and tdm activities, tailorable to other localities. this paper is an extension of a previous publication at cibse 2018 (silva et al. 2018b). it details the theoretical background on td regarding its concepts, classification, tdm activities, and approaches for td management. it offers a comparison between the obtained results and those concerned with the related works. the survey’s design, analysis, and discussion of results are comprehensively presented, including the answers from three new survey’s participants. the remainder of this paper is structured as follows: section 2 provides a background on td; section 3 summarizes the related works to our research; section 4 presents the survey design; section 5 explains the survey results; section 6 presents the discussion about the main findings, the works related to our research and the threats to validity; and section 7 presents the final considerations. 2 theoretical background ward cunningham (1993) first coined the term “technical debt” when discussing with stakeholders the consequences of releasing a poorly written piece of code to accelerate the development process. although the code attends the core system requirements in the current release, in case of future changes, the consequences might spread over other software areas, affecting its evolvability. since then, the td metaphor use spread to allow better communication with non-technical stakeholders (e.g., corporate managers, clients, among others). moreover, it has been used as a quality improvement instrument, bringing to the software development context terms such as “principal” (used to refer to the required effort to eliminate the td source) and “interest” (the additional effort needed on software maintenance due to the presence of td) (alves et al. 2016). although reasonably disseminated, up until 2016, there was no standard definition of the td concept, creating several inconsistencies in the technical literature (tom et al. 2013). some definitions of td over the years are “a way to characterize the gap between the current state of a software system and some hypothesized ‘ideal’ state in which the system is optimally successful in a particular environment” (brown et al. 2010), “any side of the current system that is considered suboptimal from a technical perspective” (ktata and lévesque 2010), and “a tradeoff between implementing some piece of software in a robust and mature way (the ‘right’ way) and taking a shortcut which may provide shortterm benefits, but which has long-term effects that may impede evolution and maintainability” (klinger et al. 2011). this imprecision on the td definition could cause several misinterpretations and even a td metaphor misuse and damage to the concept. tom et al. (2013) affirmed that “it is evident that the boundaries of technical debt, as reflected in academic literature, are fuzzy – they lack clarity and definition – and represent a barrier to efforts to model, quantify and manage technical debt.” the lack of consensus on the td definition was brought to attention during the dagstuhl seminar 16162, “managing technical debt in software engineering” (avgeriou et al. 2016). this seminar gathered members from academia and industry to discuss many relevant points regarding the td concept. at the end of the seminar, the participants came up with a td definition: “in software-intensive systems, technical debt is a collection of design or implementation constructs that are expedient in the short term, but set up a technical context that can make future changes more costly or impossible.” td is also acknowledged for being restricted to internal software quality issues, like maintainability or evolvability. as it can be observed, some differences between the definition of td and its first association with financial debt appeared along the years. the primary divergence is on the optionality to repay the td item (guo et al. 2016). however, some similarities remain. similar to financial debt, strategical, controlled decisions that opt to postpone some tasks to obtain short-term gains such as shortening time-to-delivery can be decisive for a product’s success (yli-huumo et al. 2016). nowadays, the definition proposed in the dagstuhl seminar is the most accepted among the researchers, and it is adopted throughout this paper. this definition contradicts, though, with some previous concepts of what should be considered and what should not be considered td. for instance, unfinished tasks in the development process are considered as a type of non-td, as reported by li et al.’s (2015) secondary study. however, it fits the dagstuhl’s td definition and should be considered as such in this paper. in other words, td can be associated with technical decisions about the shortcuts and workarounds taken in software development. such decisions can influence positively (strategic and managed) or negatively (unintentional and not managed). depending on the perspective, the presence of td can influence positively or negatively a software project and the quality of its software products. strategical, controlled decisions that opt to postpone some tasks to obtain short-term gains such as shortening time-to-delivery can be decisive for a product’s success. however, td can cause damage to the project, since it might be incurred unintentionally throughout the software development cycle. td items of this nature can be incurred due to many factors, such as a lack of knowledge of team members on writing the source code without following a specific programming style. therefore, it is crucial that software organizations perceive and manage td in their projects. a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 2.1 classification even before the academic interest in the td metaphor, the industry already had presented alternatives to classify it. mcconnell (2007) divided td into two different types: unintentional debt (in which td does not incurr due to a strategical purpose, like a bad-written piece of code created by an inexperienced programmer); and intentional td (when it is usually incurred with a strategical approach, when the team or the organization decides to achieve a short-term gain at the cost of a long-term effort). an example of intentional td is the decision on developing a simplified architecture solution for the software, knowing that it might not attend the project’s future needs. martin fowler (2009) expanded the classification created by mcconnell (2007), considering that, beyond the td being intentional (deliberate) or unintentional (inadvertent), it can also be reckless or prudent. td quadrants structure these classifications, as shown in figure 1. a reckless debt (either deliberate or inadvertent) incurs in the project and is not adequately planned, creating unnecessary risks. on the other hand, prudent debt items receive attention from the developing team, which assess their risks and make a plan to repay them. another perspective is to observe the original artifacts that the td item incurred or the td item nature. tom et al. (2013) named this classification scheme as dimensions of td, naming five types of td. posteriorly, li et al. (2015) conducted a systematic mapping study, expanding the td dimensions into ten types. the most recent study attempting to classify the td according to its nature or origin artifact, to our knowledge, is by alves et al. (2016). in this study the authors provide classification of td in fifteen types, like design debt (associated with violations of the principles of good object-oriented design, documentation debt (issues observed in the software documentation) and code debt (problems found in the source code, that can make it harder to maintain, usually related to inadequate coding practices). figure 1. td quadrants (adapted from fowler (2009)) 2.2 td management li et al. (2015) state that tdm includes activities that prevent potential td (both intentional and unintentional) from being incurred, as well as those activities that deal with the accumulated td to make it visible and controllable, and to keep a balance between cost and value of the software project. to our knowledge, their mapping study is the most recent on tdm activities, listing eight activities and the main approaches collected from the studies: • td identification: detects td caused by technical decisions in software, either intentional or unintentional; • td measurement: evaluates the cost/benefit relationship of known td items in software or estimates the overall td; • td prioritization: adopts predefined rules to rank known td items, to support the decision-making process; • td prevention: establishes practices to avoid potential td from being incurred; • td monitoring: observes the evolution of known td items over time; • td repayment: eliminates or reduces the td impact (principal and interest) in a software system; • td representation/documentation: represents and codes td in a pre-defined standard, to address the stakeholders’ concerns; • td communication: disclose the identified td to the stakeholders. while searching for technologies to support tdm (silva et al. 2018a), it was possible to observe that some studies are discussing and proposing different technologies, either approaches, tools, or techniques. table 1 presents some of the leading technologies identified in the literature to support the management of td grouped by tdm activity. 3 related works klinger et al. (2011) interviewed four software architects at ibm to obtain insights on how the organization perceives and manages td. all four architects stated that the debt could incur unintentionally, showing up in the projects through, for example, acquisition, new alignment requirements, or changes in the market ecosystem. they claimed that unintentional debt is usually more problematic than intentional. they also affirmed that the decision-making process on tdm is often informal and ad-hoc. finally, the interviewees claimed that there was a gap between executive and technical stakeholders, indicating a lack of a channel or common vocabulary to explain the td to non-technical stakeholders. lim et al. (2012) conducted interviews with 35 practitioners with diverse industry experiences from the usa. the authors aimed to understand how td manifested in software projects and to determine which practitioners adopted td types in the industry. they also investigated the causes, symptoms, and effects of td, and finally, they questioned how practitioners deal with td. seventy-five percent of the interviewees were not familiar with the td metaphor. the participants described td as tradeoffs between a short-term gain and an additional long-term effort. they affirmed that the effects of td were not all negative, as the tradeoff depended on the product’s value. although they wanted a way to measure the td, they claimed that measuring td might not be that easy, as its impact is not uniform. besides, they a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 claimed the key to measure td is to evaluate the cumulative effect over time. finally, the authors suggested start managing the td in an organization through “conducting audits with the entire development team to make technical debt visible and explicit; track it using a wiki, backlog, or task board.” ernst et al. (2015) executed a survey with 1,837 participants in three organizations in the united states and europe. the authors found a “widespread agreement on high-level aspects of the technical debt metaphor, including some popular financial extensions of the metaphor.” they also observed that the project context dramatically affects how the practitioners perceive td. as they stated, only the software architecture was commonly seen as a source of td, regardless of context. sixty-five percent of the respondents in this survey report that they adopted only ad-hoc tdm practices in their projects. however, many respondents affirmed that they manage td through existing practices, such as risk processes or product backlog. forty-one percent of the participants affirmed not to use any tool for managing td, while only 16% use tools to identify td. ampatzoglou et al. (2016) conducted a study to understand how practitioners in organizations from the embedded systems domain perceive the td. they performed an exploratory case study in seven organizations from four different countries. among other research questions, the authors wanted to find what td types are more frequently occurring in embedded systems. their findings about the most frequent td types in the practitioners’ point of view coincide with the taxonomy proposed by alves et al. (2016), except regarding design debt, which is considered more relevant to researchers as it is for practitioners; and test debt and code debt, which seems to be more relevant to practitioners. the study did not identify the defect, people, process, service, and usability debts. rocha et al. (2017) surveyed with practitioners from bsos to understand how the td is dealt with in practice, at the code level only. among their research questions, they investigated which are the factors that lead developers to create td at the code level, and which practices can prevent developers from creating td at the code level. seventy-four practitioners answered the survey, from which almost 72% affirmed to have low, very low or medium knowledge about the td metaphor. the participants affirmed that developers should follow the best programming practices to help prevent the td, despite admitting they indeed contribute to creating td on their projects. among the main reasons to incur in td, the participants answered management pressure, tight schedule, developer’s inexperience, and work overload. the code review was pointed to as the most relevant practice to prevent the occurrence of td. holvitie et al. (2018) conducted a multi-national survey to observe td in practice, including practitioners from finland, brazil, and new zealand. the authors opted to focus on practitioners managing td in organizations adopting agile practices and methodologies. one hundred eighty-four practitioners answered the survey. approximately 20% of the participants had little to no knowledge on the td definition. thirty-five percent of the brazilian participants were able to provide an example of a td instance. according to the study, the six leading causes of td, selected by more than 50% of the participants, are inadequate architecture, structure, tests, and documentation, software complexity and violation of best practices or style guides. finally, most of the participants perceived refactoring, coding standards, continuous integration, and collective code ownership as having a positive effect on reducing the td in software projects. regarding agile software development processes’ and process artifacts, iteration reviews/retrospectives, iteration backlog, daily meetings, product backlog, iteration planning meetings, and iterations were all assigned as having a positive impact on reducing td. table 1. some technologies to support the management of td tdm activity technologies and strategies td identification manual code inspection, sonarqube, checkstyle, findbugs (ylihuumo et al. 2016); codevizard (zazworka et al. 2013); sonarqube, understand, cppcheck, findbugs, sloccount (ernst et al. 2015). td documentation /representation td template (seaman and guo 2011); td backlog/list, documentation practice, jira, wiki, td template (yli-huumo et al. 2016). td communication td meetings (yli-huumo et al. 2016); td board (santos et al. 2013); trello (oliveira et al. 2015). td measurement sonarqube, jira, wiki; td evaluation template (yli-huumo et al. 2016). td prioritization cost/benefit model, issue rating (yli-huumo et al. 2016). td repayment redesigning, refactoring, and rewriting (yli-huumo et al., 2016). td monitoring sonarqube, jira, wiki (yli-huumo et al. 2016); vtiger and jira (oliveira et al. 2015). td prevention coding standards, code reviews, definition of done (yli-huumo et al. 2016). a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 4 survey design 4.1 research objectives using the goal-question-metric (gqm) paradigm (van solingen et al., 2002), the objective of this study is to analyze the td and its management, with the purpose of characterizing, with respect to the level of knowledge and the adopted strategies, activities and technologies, from the point of view of software practitioners, in the context of brazilian software organizations. 4.1.1 research questions the research questions are explained as follows: • rq1: is there a consensus on the perception of td among software practitioners in bsos? it intends to determine whether the perception of td is homogeneous among professionals in bsos. if so, it can support the observation of the existence of a common perspective on td between the industry and academia (a positive side effect of this survey). • rq2: do the practitioners in the bsos perceive td in their software projects? before characterizing the tdm activities, it is essential to confirm that the software organizations (through their practitioners) perceive, i.e., observe the presence of td in their projects. o rq2.1: do bsos manage their td? if td is perceived, it is essential to know whether bsos manage the td in their software projects. rq2.1.1: what tdm activities are most relevant to software projects? the goal of this question is to identify, among professionals, which tdm activities, among those proposed by li et al. (2015), are more relevant, or at least more considered during the software projects. rq2.1.2: which technologies and strategies are adopted for each tdm activity? for all eight tdm activities proposed by li et al. (2015), which strategies and technologies are used to support them. even though this survey had been designed to identify the most common technologies used by the practitioners in their bsos to support tdm activities, it is not possible to make any judgment regarding their efficiency and effectiveness. furthermore, this survey did not look for the benefits of applying such technologies in bsos. 4.2 questionnaire design the questionnaire was designed according to the guidelines presented in linåker et al. (2015). we performed an ad-hoc literature review to gather specific information about the perception of td and tdm. concerning tdm, we organized the activities as proposed by li et al. (2015), as such activities cover the ones mentioned in different studies. moreover, we accepted them as consistent since no disagreements were observed during pilot trials. for each activity, we identified a set of specific strategies and technologies used to conduct the activity, as well as a list of possible roles for each activity. from this information, a questionnaire was designed, and specific questions for each activity were included. for instance, on td identification, a set of questions involving the td classification was included – as by alves et al. (2016). regarding the questionnaire structure, it is divided into fourteen sections, described in table 2. it is composed mostly of closed-ended questions. a small number of openended questions was necessary to get further information from the participant. it also contains partially closed-ended questions to deal with issues related to tools and strategies for each tdm activity when the given options do not cover the entire possibility of the participant´s answers. table 3 presents an extract of our questionnaire translated into english, with some questions on td identification. each section starts with a brief explanation of its content and specific instructions. the limesurvey platform available in the experimental software engineering (ese) group at coppe/ufrj (http://lens-ese.cos.ufrj.br/ese/) supported the questionnaire implementation and survey execution. the questionnaire was configured to ensure the participant’s anonymity. a welcome message describes the survey structure and explains its importance for bsos. the participants are asked to answer the questions based on their current (or most recent) software project and organization. each set of questions related to a specific tdm activity was conditionally presented to the participants only if they have some experience with that activity to minimize the problem of lengthy survey questionnaires. other conditional breakpoints in the questionnaire were set to end the survey whether the participant is not familiar with the td concept and whether the organization or the project do not apply any tdm activity. 4.2.1 characterization sections the three sections related to the participant’s characterization include questions regarding its role in the projects, academic formation, working experience in software projects, its organization field, size, and any maturity model certificate in software processes. to assess the size of the organizations, we adopted the sebrae/ibge classification of organizations, consisting in micro (fewer than ten employees), small (between 10 and 49), medium (between 50 and 99) and large (more than 100) organizations. although this grouping does not constitute a world-level standard, it attends the first necessity for this study, which is a means to estimate the total number of organizations represented by the participants that answered the survey. finally, the projects in which the participants work are also characterized, through their domain problem and their lifecycle model. in this last question, the agile software development method was included for simplification purposes, even though it does not characterize as a lifecycle model. a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 4.2.2 td perception section this section aims to gather the general participant understanding, regarding the td definition and its overall aspects. its first question regards the participant understanding of td. it was not our purpose to inquiry participants not knowing the meaning of td, as they can provide wrong answers on the tdm sections. thus, the participants without td knowledge should finish their questionnaire in this question. to the participants that claim they know td, a follow-up question was designed, to assess which common issues in software development should be considered td. we presented to the participants a list with items not considered td (according to the mapping study conducted by li et al. (2015)), and items considered td (obtained through an adhoc literature review). following the general understanding of td by the participants, they were questioned if td was perceived in their most recent project, i.e., if they could notice any issues that could be associated with td. an affirmative answer on this question allows the participants to answer two follow-up questions: if their organization adopts any tdm activity and if their manager (or themselves) adopt any tdm activity, regardless their organization adopting any. answering “yes” to any of these two questions, allow the answering of the remaining questionnaire. table 2. questionnaire sections sections topic description 1 participant characterization obtain personal information regarding the participant, such as professional experience and academic degrees. 2 organization characterization gather information about the organization the participant works for or has worked before. 3 project characterization obtain information about the project considered by the participant in the survey. 4 td perception collect information on the participant’s knowledge regarding td, including what can be considered td. also, determine if the organization or the project he works at has strategies for tdm. 5 tdm (general) ask the participant which tdm activities are adopted in the working project. obtain information about the responsibilities and importance associated with each activity from the participant’s point of view. 6-13 tdm (activities) gather information on several aspects regarding each of the tdm activities proposed in li et al. (2015). 14 tdm (other) provide space for the participant to describe other activities that are executed in the organization. table 3. survey – td identification section question answer options is there a formal strategy to identify td? ( ) yes, we have a formal procedure to identify the td. ( ) no, the td identification is executed only informally. are all the stakeholders required to apply the td identification strategy?” ( ) yes, the strategy is mandatory for all stakeholders. ( ) no, the strategy is considered only a suggestion. at what point in the project is the td identified? ( ) there is no defined period; we identify the td whenever we perceive some issue. ( ) we always identify the td at the end of each iteration/sprint. ( ) the td identification is continuous, i.e., occurs throughout the development process. mark below all tools or techniques that are used to identify td. [ ] manual coding inspection [ ] dependency analysis [ ] checklist [ ] sonarqube/sqale [ ] checkstyle [ ] findbugs [ ] codevizard [ ] clio [ ] other (cite which) a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 4.2.3 td management section the purpose of this section is to identify the adoption and relevance of tdm practices in the bsos’ projects. the participants were clarified that by “technical debt management,” they should consider all activities that organize, monitor, and control the td and its impacts on software projects. the participants were asked to select which tdm activities were conducted in their projects, based on the list of tdm activities provided by li et al. (2015). an additional option was included to provide space for the participants so that they can mention other tdm activities not discussed by li et al. (2015). for each tdm activity selected by the participants, an additional question was created to ask which roles were responsible for conducting that activity. the leading roles offered as answers were obtained from yli-huumo et al.’s study (2016), but the questionnaire provided an open-ended question so that the participants could elaborate in case another role should be considered responsible for that activity. 4.2.4 tdm activities sections eight sections follow the questionnaire, asking for information regarding each one of the tdm activities proposed by li et al. (2015). they are only available to the participants if they select those activities in the previous section. at the beginning of each section, the tdm activity is described, to improve the participants’ knowledge and reduce the probability of misunderstanding of the proposed questions. those sections follow the structure briefly presented below, for each activity. all activities subsections included a question to obtain which tools or techniques were adopted to conduct that particular activity, based mainly on a list obtained mainly from li et al. (2015) and yli-huumo et al. (2016). • td identification: the participants were asked about the use of any formal approach to identify td, as well as their optionality. next, they were asked when the td was identified. a subsection was created to assess if and how the td is classified after it has been identified. • td documentation/representation: the participants were asked if there was a standard to follow when documenting td, and whether it was mandatory to all stakeholders. then they were asked how the td items are documented or cataloged. • td communication: the participants were only asked how the unresolved td items were communicated between the project stakeholders. • td measurement: the participants were asked if there was any strategy previously defined to measure td, and how it was measured efficiently. they were asked which information or variables were used to measure the td items. • td prioritization: the participants were asked how the td is prioritized. finally, they were asked which criteria are used to support the td prioritization. • td repayment: the participants were asked if there is any planning to repay td. • td monitoring: like the td repayment, the participants were asked how the td is monitored. • td prevention: for this section, the participants were asked if there are any formal practices conducted to prevent the td and whether they are mandatory or optional to the stakeholders. • other tdm activities: one last section is provided to gather information regarding other tdm activities used in the participant’s software organization, presenting similar questions to the previous sections. 4.3 pilot execution a pilot trial was conducted using the same artifacts and procedures designed for the final survey, including the survey questionnaire and the execution method, but with a small number of participants from the target population (linåker et al. 2015). seven practitioners were invited to the pilot trials. five of them work on software projects and come from the ese group at coppe/ufrj, which conducts this research. the other two participants also work on software projects but are from outside the research group. all of them have some prior experience with td and/or tdm, mostly in the industry. an invitation by e-mail included the main instructions and questionnaire link. they were asked to answer the questionnaire and return their feedback regarding response time, proper understanding, completeness, and other aspects. all pilot participants answered the pilot survey within a week. the average answering time was 15.2 minutes. the relevant comments were associated with usability issues, clarity of questions, and some suggestions to improve some details and definitions throughout the questionnaire. these were later discussed internally, and modifications were applied to the final questionnaire. overall, we did not observe negative comments or doubts about either the answer options or the questions descriptions, suggesting that the questionnaire was good enough to use in the study. 4.4 target population and sampling to achieve the research objectives and to answer the research questions, the practitioners from bsos were selected as the target audience. the sampling design adopted is accidental, a non-probabilistic type of sampling, i.e., we cannot observe randomness on the selected units from the population. this decision can incur in a threat to validity, which will be further discussed in section 6.3. an invitation to answer the survey was sent to a series of renowned software development groups in the country. other invitations were sent through the linkedin professional social network. finally, the survey was disclosed to the participants on three software-related events: rioinfo (practitioners oriented) and sbqs (high participation from practitioners), in rio de janeiro/brazil, and cbsoft (some practitioners participation), in fortaleza/brazil. a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 4.5 final revisions and survey release after the pilot trial, the final survey was released on june 2017. a lab package with the research plan and the survey questionnaire is available in english and portuguese at https://doi.org/10.6084/m9.figshare.5923969. 5 survey results the survey was conducted between june 2017 and april 2018. in total, 62 participants answered the survey, with 36 complete answers. four participants did not complete the survey but reached the questionnaire’s section 4 (td perception), so they were included in this initial analysis, totalizing 40 valid answers. the remaining 22 incomplete responses were not included in the analysis. figure 2 summarizes the survey responses. 5.1 participants’ characterization the respondents have an average work experience of 14.15 years in software projects. only four respondents reported having an incomplete undergraduate degree, while the remaining 36 respondents hold at least an undergraduate degree. twenty-six participants reported holding a specialization degree (master or doctorate). regarding the bsos in which the respondents work, most of them (23) are from the it sector. referring to the project development, most of them (35) adopt agile or incremental lifecycle models. two projects adopt the spiral model, while three adopt the waterfall model. due to the questionnaire anonymity, it is not possible to precise the number of organizations represented in the survey. however, it is possible to estimate it roughly, based on the information provided by the participants in this section. therefore, we could estimate around 12 organizations and 30 projects included in the survey. only one organization characterized by the participants adopt de mps.br maturity model to evaluate its software processes, at level g, whereas two participants affirmed the organizations they work have cmmi level 5 and two others have cmmi level 2. 5.2 td awareness and td perception regarding the perception of td, from 40 valid answers, we found that 16 respondents (40%) claimed not to be aware of the td metaphor. the 24 remaining participants (60%) were asked to select the options that best matched the td definition. two of them did not answer this question. as it can be observed in table 4, seven issues out of 12 were marked by 50% or more of the 22 participants: low internal quality, poorly written code, that violates code rules; "shortcuts" taken during design; the presence of known defects that were not eliminated; architectural problems; planned but unfinished or unplanned tasks; and issues associated with low external quality. regarding the td perception, from the 24 participants that were aware of the td meaning, 17 informed to perceive some issues associated with the td concept in their projects, whereas four did not perceive the td occurrence, and one participant did not answer this particular question. from the 17 participants that informed that perceived the td occurrence in their projects, ten answered that their organizations or the project managers adopt tdm activities. table 5 presents the distribution of these answers, grouped by the organization size. table 6 presents the same results among organizations that adopt any model to evaluate their maturity level on software processes. 5.3 td management from the eight tdm activities proposed by li et al. (2015), as shown in figure 2, only td monitoring was not marked by the participants when asked about which tdm activities were conducted in their projects. td identification and td documentation are conducted in projects for six participants each, while five participants each marked td prioritization, td communication, and td repayment. td measurement and td prevention are conducted in projects according to two participants. one participant did not mention any tdm activities. no participants mentioned any tdm activity besides those proposed by li et al. (2015). table 7 presents the grouping of the results according to the sizes of the organizations. 5.3.1 tdm responsibilities there was no consensus among participants on which roles should be responsible for each tdm activity. moreover, some distinction was observed between the participants’ responses and the tdm framework presented in yli-huumo et al.’s study (2016). for instance, the tdm framework presented in yli-huumo et al. (2016) states that software architects and the team leader are the responsible roles for the activity of td measurement. however, in our survey, no respondent selected software architects as responsible for this activity. on the other hand, it was also identified that some responsible roles to perform some tdm activities are similar to results pointed in yli-huumo et al. (2016), for example, software architect and development team to perform td identification. therefore, we consider the findings concerning tdm responsibilities are coherent and complementary to presented in yli-huumo et al. (2016). table 8 presents our results, compared with the ones from yli-huumo et al. (2016). 5.3.2 td identification two out of six participants answered that there is a mandatory strategy to conduct the td identification activities, while one participant answered that there is a formal strategy, albeit not mandatory. three participants claimed to adopt only simple strategies. three out of six answers suggested that td identification was conducted continuously throughout the project. regarding td identification, one participant affirmed that the td was classified as design debt or documentation debt in the project, while one participant claimed to use the artifact that initially incurred the td to classify it. a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 figure 2. summary of the survey responses a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 5.3.3 td documentation from the total of six participants, two answered that they have a standard on documenting the td that should be followed by all stakeholders. one participant answered that his/her project has a td documentation standard, but it is not mandatory for the stakeholders. two participants answered that the td documentation is conducted only informally. when asked how the td is documented, four participants answered that they use a general task backlog, with no specific details, while one affirmed she uses a specific backlog of td items. one participant did not provide any details on td documentation, despite informing that it is conducted in her project. 5.3.4 td communication from the five participants that answered the td communication section, four affirmed that the td was discussed during project meetings, but with the participation of only a few of the necessary stakeholders. one participant said that the td was only discussed informally. 5.3.5 td measurement out of the two participants that answered the td measurement section, one affirmed that the td measurement was conducted informally, through the analysis of metrics and indicators based on specific information regarding the td item. the other participant indicated that there is a mandatory strategy to measure td, based on direct information, like person-hours to repay the td item or the item loc. 5.3.6 td prioritization regarding the td prioritization, three of five participants answered that the td items were prioritized according to “guesses” or simplified estimative based on previous experiences, while the other one used the td item criticality to prioritize it. four participants affirmed that they tend to prioritize the td items that most impact the client, and three answered that they prioritize the td items that could cause the most impact on the project. one did not provide any details on td prioritization, despite adopting it in his/her project. table 4. issues related to td, according to the participants issue % of participants low internal quality aspects, such as maintainability and reusability 77% poorly written code that violates code rules 68% "shortcuts" taken during design 68% presence of known defects that were not corrected 68% architectural problems (like modularity violation) 55% low external quality aspects, such as usability and efficiency 50% planned, but not performed, or unfinished, tasks (e.g., models, test plans, etc.) 50% trivial code that does not violate code rules 45% code smells 45% defects 36% lack of support processes to the project activities 23% required, but unimplemented, features 18% table 5. td perception, grouped by organization size yes knows td? perceived td? adopts any tdm activity? total no no answer fewer than 10 employees yes: 1 yes: 1 yes: 0 3 no: 2 no: 0 no: 1 no answer: 2 no answer: 2 between 10 and 49 employees yes: 5 yes: 3 yes: 3 11 no: 6 no: 0 no: 0 no answer: 8 no answer: 8 more than 100 employees yes: 18 yes: 13 yes: 7 26 no: 8 no: 4 no: 6 no answer: 9 no answer: 13 total 40 40 40 40 a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 5.3.7 td repayment from the five participants that answered the td repayment section, two of them answered that the td repayment is planned according to the current project necessities, while one answered that the td repayment is planned continuously, with specific periods during the development process destined to this activity. one participant answered that the td is only repaid when it is not possible to avoid it anymore. one participant did not provide any details on td repayment, despite informing that it is conducted in his/her project. 5.3.8 td prevention both respondents answering the td prevention section mentioned that it is an activity conducted only by each member of the team individually. 5.3.9 technologies and strategies for tdm table 9 presents a list of practices, techniques, and tools used in each tdm activity. the numbers in parentheses represent the number of participants answering that specific section (column “tdm activity”) and the number of participants that affirmed using that tool or technique (column “tools and techniques”). we can observe that different technologies support tdm, and there is no consensus about which one to use. most of such technologies are similar to those identified in the technical literature (see table 1). 6 discussion 6.1 revisiting the findings the analysis of the survey’s results, presented in section 5, allowed us to answer reasonably the rqs, which we discuss next. 6.1.1 rq1: consensus on the perception of td we did not observe consensus in the overall td perception. each participant was asked to select which of the 12 issues suggested on td should be associated with the td concept, as presented in table 4. out of those options, 75% or more table 6. td perception, grouped by the adoption of maturity models to evaluate software processes yes knows td? perceived td? adopts any tdm activity? total no no answer mps.br level g yes: 1 yes: 0 yes: 0 1 no: 0 no: 1 no: 0 no answer: 0 no answer: 1 cmmi level 2 yes: 2 yes: 2 yes: 0 2 no: 0 no: 0 no: 2 no answer: 0 no answer: 0 cmmi level 5 yes: 0 yes: 0 yes: 0 2 no: 2 no: 0 no: 0 no answer: 2 no answer: 2 others yes: 2 yes: 1 yes: 1 4 no: 2 no: 0 no: 0 no answer: 3 no answer: 3 total 9 9 9 9 table 7. tdm activities conducted in the participants’ projects, grouped by organization size tdm activity between 10 and 49 employees more than 100 employees total identification 2 4 6 documentation/ representation 3 3 6 communication 1 4 5 measurement 0 2 2 prioritization 2 3 5 repayment 2 3 5 monitoring 0 0 0 prevention 1 1 2 identification 2 4 6 documentation/ representation 3 3 6 a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 of the 22 respondents evaluated only one issue. from the issues associated with the td concept by 50% or more of participant’s, only one (“issues associated with low external quality”) is not considered td. only one issue associated with the td concept was marked by less than 50% of the 22 answers, which is “code smells” (42%). therefore, although we could not find consensus between the industry and academia, we consider that there is some agreement among participants of what should be considered td since 17 out of 22 participants identified that td should be related to internal quality issues. however, 1 https://www.sonarqube.org 2 http://www.sqale.org/ 3 http://findbugs.sourceforge.net/ 50% of the participants believe that td should also be associated with external quality issues, which is worrisome and contradicts the definition asserted at the dagstuhl seminar (avgeriou et al. 2016). it could indicate that there is a misconception of what should be considered td, associating its definition with any issue occurring during the software development. we could observe some alignments in the views on td between the participants and academia since most of the issues associated by more than half of the participants are also 4 https://br.atlassian.com/software/jira 5 https://trello.com/ table 8. tdm responsibilities tdm activity our study yli-huumo et al. (2016) identification team leader; software architect; development team software architect; development team documentation/ representation project manager; team leader; software architect; development team software architect; development team communication team leader; software architect; development team project manager; software architect; development team measurement team leader; development team software architect; development team prioritization project manager; team leader; software architect; development team project manager; software architect repayment team leader; software architect; development team software architect; development team prevention project manager; team leader; software architect; development team software architect; development team table 9. tdm activities – technologies and strategies tdm activity technologies and strategies td identification (6) manual code inspection (4), dependency analysis (1), checklist (2), sonarqube1/sqale2 (3), checkstyle (1), findbugs3 (1) td documentation /representation (6) td backlog (3), specific artifacts for td documentation (1), jira4 (1), others trello (1) td communication (5) discussion forums (3), specific meetings about td (1), others gitlab (1), others trello5 (1) td measurement (2) manual measurement (1), sonarqube (2), jira (1) td prioritization (5) cost/benefit analysis (1), classification of issues (3) td repayment (5) refactoring (3), redesign (1), code rewriting (4), meetings/workshops/training (1) td monitoring (0) n/a td prevention (2) guidelines (2), coding standards (2), code revisions (1), retrospective meetings (1), definition of done (2) a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 in agreement with the definition indicated in the technical literature. we believe that despite the reasonable td definition understanding by some software practitioners, it is vital to disseminate better the distinction between issues related to internal quality (td) and those related to external quality (defects). 6.1.2 rq2: bsos’ practitioner’s perception of td only 43% of 40 participants claimed to perceive td in their software projects, which could be considered low, given the importance of the topic. moreover, only 25% of the 40 participants adopt tdm activities, possibly indicating the existence of a severe gap in the overall product quality perspective. however, we did not assess the adoption of other internal quality assurance methods to replace the low perception of the td presence in the organizations. grouping the td perception with the size of the organizations (table 5), we could observe that a higher percentage of participants from larger organizations know about td when compared to smaller-sized organizations. most of the participants from these companies also answered that they perceive td in their projects. it could indicate that more prominent organizations (generally being active for a more extended period, and having more solid processes to manage software development projects) could have a broader perspective on td and tdm. unfortunately, due to the low number of responses, we could not analyze the correlation of the adoption of maturity models to evaluate software processes with any aspect of the study. out of the nine participants that answered their organizations adopt any maturity models, six of them did not know what td is, or they did not perceive it in their last projects. from the remaining three that did perceive in their projects, only one conducted any activities to manage it. this gap in the results can be used to develop the research on td in bsos further. 6.1.3 rq2.1.1: most relevant tdm activities the results of our survey show that there was no consensus on which tdm activities are more relevant to surveyed software projects. however, almost half of the participants that answered this question mentioned that td prevention is relevant to a project. it is a possible research gap for future works since most of the studies regarding tdm focus on td identification, measurement, and prioritization. regarding the main tdm activities conducted by the participants, our results are mostly in line with yli-huumo et al. (2016) in which indicates that td communication is most commonly adopted by the development teams, followed by td identification, documentation, prioritization, repayment, and prevention. the rarely managed td activities described in ylihuumo et al.’s study (2016) are td measurement and td monitoring, as also observed in our study. despite the number of participants indicating the importance of td prevention, only two reported performing td prevention activities. 6.1.4 rq2.1.2: technologies and strategies as presented in section 5.3.6, a list of tools and technologies used to manage td activities (see table 9) can be used in further studies looking for evidence on their effectiveness and efficiency in managing the td. 6.2 comparison with results from related works most of the studies previously described in section 3 have distinct populations, being researches from other countries. however, we could observe that their results are coherent and complementary to the findings of our survey, as we discussed in sections 5.3.1 and 5.3.9. when analyzing the results concerning the td understanding or td perception in the projects, it is possible to observe that most of the surveyed software practitioners reported having a low level of knowledge about td. regarding the management of td, we could identify that it still seems to be incipient in the surveyed software organizations in such studies. most of the studies reported that tdm activities are performed in an informal and ad-hoc way. although some strategies and technologies identified by holvitie et al. (2018) to support the tdm activities are coherent and complementary to those identified in our survey (see section 5.3.9), the evidence on their effectiveness and efficiency in managing the td must be further investigated. besides, as previously mentioned, some distinction and similarities were identified regarding which roles should be responsible for each tdm activity (see section 5.3.1). 6.3 threats to validity this research has some threats to validity as any other empirical study. next, we report them together with some of the adopted mitigation actions, relying on the classification as proposed by wohlin et al. (2012) and linåker et al. (2015). a potential internal threat comes from the participants that might have misunderstood some terms and concepts of the questionnaire. there is also a construct threat of a biased survey, from the researchers’ perspectives and the collected information from the technical literature such as the tdm activities organized in li et al. (2015). to reduce the level of this menace, we conducted three revision cycles during the survey development with two researchers. furthermore, two pilot trials were executed, followed by a final revision by all the pilot survey participants aiming to ensure the modifications were aligned with their perspectives. we also observed a potential threat in the way that the main topic of the survey was disclosed to the potential participants in the invitations. if the participants did not have previous knowledge regarding the topic, this could have driven them away from the survey, which could have biased the results. we recognize that this effect of the invitations on the participants could have affected the study in some way. we observed an external validity threat concerned with the representativeness and high mortality of surveys’ respondents. as part of our disclosure strategy involved presenting the research in software engineering research events, some of our results might present some bias. there is a high rate of a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 mortality of respondents since a substantial number of responses were discarded. only 65% of the 62 responses were valid to the point we could obtain some information. these discarded responses refer to 22 incomplete questionnaires, in which its respondents did not reach the questionnaire’s section 4 (td perception) so they could not be included in the analysis. perhaps the reason for incomplete questionnaires might be associated with the survey length and the response time. overall, the survey has 52 questions (no participant had to answer the complete survey, though), distributed over 23 pages. studies report that every additional question can reduce the response rate by 0.5%, and every additional page, by 5% (linåker et al. 2015). however, since we do not have data on this possibility, we cannot formulate any elaborate conclusions. another possible reason for the low number of responses is that the concept of td is still incipient in bsos and, since the topic was explicitly mentioned in the invitations, it could have kept away some practitioners that are not familiar with the term. if this case is indeed real, the results would be even more worrisome, as the percentage of practitioners that know what td is could drastically drop. considering the initial number of survey’s participants, only ten them reported adopting tdm activities in their projects. however, it is essential to highlight that the participants might not be so representative among those who manage td in the bsos. on the other hand, the practitioners surveyed may be a good sample of how the td concept is perceived in the bsos, in which it is also an interesting result. this result may indicate that td concept still needs to be further disseminated for the software industry. in this sense, maybe the dissemination about td concept and aspects concerned with its management can occur at the university courses level or even at the professional training level. even among those practitioners who responded to the complete survey, we could observe a level of misconception point out that the td perception is not in line with the dagstuhl’s definition. finally, the main threat to validity is the generalization of the results. since the target sampling is non-probabilistic, it is not possible to determine a priori the population size and the expected total number of participants. therefore, the results confidence level might be low, making it hard to generalize the results to the entire population (bsos). as argued by mello et al. (2015), the establishment of representative samples for se surveys is considered a challenge, and the specialized literature often presents some limitations regarding the interpreting surveys’ results, mainly due to the use of sampling frames established by convenience and non-probabilistic criteria for sampling from them. as previously, methodological procedures were used since the planning stage of our study until its execution, aiming to reduce the level of such menace. despite that, the inevitable conclusions can suggest the td research with initial indications of the level of knowledge of bsos regarding the td concept and tdm activities. 7 concluding remarks this paper presented background about the td definition and the results of a survey conducted with practitioners in bsos. the results provide initial observations regarding how bsos (represented by their software practitioners) perceive and manage td in their projects. before the analysis of the survey results, some observations can be made. first, we obtained a considerable low number of responses and an even lower number of complete responses. notwithstanding, the results were enough to provide an initial and representative picture of the perception of td and its management in the scenario of the bsos. regarding the td perception, our results indicate no unanimity concerning how brazilian software practitioners perceive td. regarding tdm, it was observed that only a few bsos report the td management in their software projects, indicating that tdm seems to be still incipient in bsos. four out of nine practitioners reporting tdm activities claimed that td prevention is the most critical activity in their projects, despite only two participants indicated to perform it. we believe that the results of this study provide the following contributions to both industry and academia: • to the bsos (industry) the initial results indicate that software practitioners and their organizations need to understand better the concept of td. it is necessary to achieve better results in their projects since the perception of td and its management in this scenario is still incipient. the findings also present a list of technologies that can be used to support tdm activities, as long as software engineers evaluate their usage based on the organizations’ needs at the time. moreover, the findings indicate that tdm activities usually involve distinct roles throughout the projects. in general, we consider that the bsos need to systematize some actions (e.g., training) to enable their teams to perceive and manage existing td. • to the researchers our results indicate that there is a need for more investigations aiming to disseminate the td knowledge to practitioners on bsos, as well as provide strategies and software technologies to support the tdm on these organizations. besides, we believe the sharing of this study package can contribute to support the development of investigations on td and its management more connected to the software organization needs in brazil and other regions. following this study, we conducted two other works that compose a research framework on td and its management. both were reported in silva et al. (2018a). the first work consisted of a quasi-systematic literature review to gather available technologies that support tdm in the technical literature. the second work organized evidence briefings (cartaxo et al., 2016) in both english and portuguese, combining the survey and the literature review results. the evidence briefings intend to address critical points observed in the survey research, primarily regarding the practitioners’ general lack of knowledge or misconceptions concerning a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 td. they are available online at https://doi.org/10.6084/m9.figshare.7011281. overall, we believe this study offered a new perspective on the td research in bsos. to the best of our knowledge, only one other survey analyzed the td specifically in bsos (rocha et al. 2017), but the authors focused mainly on the td located at the code level, not using a broader software engineering perspective like our study. moreover, our survey package provides the materials that can be used by other software engineering researchers to study this topic in other organizations and software communities, facilitating a better understanding, future comparisons, and providing indications to evolve tdm activities. acknowledgements the authors thank all the professionals that took part in this survey and the researchers that have collaborated with their feedback on the pilot trials. this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior – brasil (capes) – finance code 001. prof. travassos is a cnpq researcher and an isern member. references alves nsr., mendes ts. de, de mendonça mg., et al. (2016) identification and management of technical debt: a systematic mapping study. inf softw technol 70:100–121. doi: 10.1016/j.infsof.2015.10.008. ampatzoglou a, ampatzoglou a, chatzigeorgiou a, et al. (2016) the perception of technical debt in the embedded systems domain: an industrial case study. proc 2016 ieee 8th int work manag tech debt, mtd 2016 9–16. doi: 10.1109/mtd.2016.8. assuncao tr de, rodrigues i, venson e, et al. (2015) technical debt management in the brazilian federal administration. 2015 6th brazilian work agil methods 6–9. doi: 10.1109/wbma.2015.11. avgeriou p, kruchten p, ozkaya i, et al. (2016) managing technical debt in software engineering edited by. dagstuhl reports 6:110–138. doi: 10.4230/dagrep.6.4.110. becker c, chitchyan r, betz s, mccord c (2018) trade-off decisions across time in technical debt management. 85– 94. doi: 10.1145/3194164.3194171. boehm b (2008) making a software century. brown n., cai y., guo y., et al. (2010) managing technical debt in software-reliant systems. in: proceedings of the fse/sdp workshop on the future of software engineering research, foser 2010. pp 47–51. cartaxo b, pinto g, vieira e, soares s (2016) evidence briefings: towards a medium to transfer knowledge from systematic reviews to practitioners. pp 1–10. cunningham w (1993) the wycash portfolio management system. acm sigplan oops messenger 4:29–30. doi: 10.1145/157710.157715. de frança bbn, jeronimo h, travassos gh (2016) characterizing devops by hearing multiple voices. proc 30th brazilian symp softw eng sbes ’16 53–62. doi: 10.1145/2973839.2973845. ernst na., bellomo s., ozkaya i., et al. (2015) measure it? manage it? ignore it? software practitioners and technical debt. in: 2015 10th joint meeting of the european software engineering conference and the acm sigsoft symposium on the foundations of software engineering, esec/fse 2015 proceedings. pp 50–60. fowler m (2009) technical debt quadrant. in: martinfowler.com. https://martinfowler.com/bliki/technicaldebtquadrant.html. guo y., spínola ro. c, seaman c. (2016) exploring the costs of technical debt management – a case study. empir softw eng 21:159–182. doi: 10.1007/s10664-014-9351-7. holvitie j, licorish s, spinola r, et al. (2018) technical debt and agile software development practices and processes: an industry practitioner survey. inf softw technol 96:141–160. klinger t, tarr p, wagstrom p, williams c (2011) an enterprise perspective on technical debt. in: proceedings international conference on software engineering. pp 35–38. ktata o, lévesque g (2010) designing and implementing a measurement program for scrum teams: what do agile developers really need and want? in: acm international conference proceeding series. pp 101–107. li z, avgeriou p, liang p (2015) a systematic mapping study on technical debt and its management. j syst softw 101:193–220. doi: 10.1016/j.jss.2014.12.027. lim e., taksande n., seaman c. (2012) a balancing act: what software practitioners have to say about technical debt. ieee softw 29:22–27. doi: 10.1109/ms.2012.130. linåker j, sulaman s, maiani r, höst m (2015) guidelines for conducting surveys in software engineering engineering. mcconnell s (2007) technical debt 10x software development. oliveira f., goldman a., santos v. (2015) managing technical debt in software projects using scrum: an action research. in: proceedings 2015 agile conference, agile 2015. pp 50–59. prikladnicki r, audy jln, damian d, de oliveira tc (2007) distributed software development: practices and challenges in different business strategies of offshoring and onshoring. proc int conf glob softw eng icgse 2007 262– 274. doi: 10.1109/icgse.2007.19. ribeiro lf. b, de farias maf. d, mendonça m., spínola ro. e (2016) decision criteria for the payment of technical debt in software projects: a systematic mapping study. in: iceis 2016 proceedings of the 18th international conference on enterprise information systems. pp 572–579. rocha jc, zapalowski v, nunes i (2017) understanding technical debt at the code level from the perspective of software developers. proc 31st brazilian symp softw eng sbes’17 64–73. doi: 10.1145/3131151.3131164. santos psm, varella a, dantas c (2013) visualizing and managing technical debt in agile development: an experience report. a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 seaman c., guo y. (2011) measuring and monitoring technical debt. adv comput 82:25–46. doi: 10.1016/b978-012-385512-1.00002-5. silva vm, junior hj, travassos gh (2018a) technical debt management in brazilian software organizations: a need, an expectation, or a fact? in: brazilian symposium on software quality (sbqs). curitiba. silva vm, junior hj, travassos gh (2018b) a taste of the software industry perception of technical debt and its management in brazil. av en ing softw a niv iberoam cibse 2018 1–14. spínola ro., vetrò a., zazworka n., et al. (2013) investigating technical debt folklore: shedding some light on technical debt opinion. in: 2013 4th international workshop on managing technical debt, mtd 2013 proceedings. pp 1–7. tom e, aurum a, vidgen r (2013) an exploration of technical debt. j syst softw 86 1498–1516. doi: 10.1016/j.jss.2012.12.052. van solingen r, basili v, caldiera g, rombach hd (2002) goal question metric (gqm) approach. encycl. softw. eng. wohlin c, runeson p, höst m, et al. (2012) experimentation in software engineering. springer science & business media, heidelberg, berlin. yli-huumo j., maglyas a., smolander k. (2016) how do software development teams manage technical debt? an empirical study. j syst softw. doi: 10.1016/j.jss.2016.05.018. zazworka n. e, spínola ro., vetro a., et al. (2013) a case study on effectively identifying technical debt. in: acm international conference proceeding series. pp 42–47. journal of software engineering research and development, 2023, 11:1, doi: 10.5753/jserd.2023.2417 this work is licensed under a creative commons attribution 4.0 international license. technical debt guild: managing technical debt from code up to build thober detofeno [pontifícia universidade católica do paraná | thober@gmail.com ] andreia malucelli [pontifícia universidade católica do paraná | malu@ppgia.pucpr.br ] sheila reinehr [pontifícia universidade católica do paraná | sheila.reinehr@pucpr.br ] abstract efficient technical debt management (tdm) requires specialized guidance so that decisions taken are oriented to add value to the business. because it is a complex problem that involves several variables, tdm requires a systemic look that considers professionals' experiences from different specialties. guilds have been a means technology companies have united specialized professionals around a common interest, especially those using the spotify methodology. this paper presents the experience of implementing a guild to support tdm's activities in a software development organization using the action research method. the project lasted three years, and approximately 120 developers were involved in updating about 63,300 source-code files, 2,314 test cases, 2,097 automated test scripts, and the build pipeline. the actions resulting from the tdm guild's efforts impacted the company's culture by introducing new software development practices and standards. besides, they positively influenced the quality of the artifacts delivered by the developers. this study shows that, as the company acquires maturity in tdm, it increases the need for professionals dedicated to tdm's activities. keywords: technical debt, technical debt management, community of practice, technical debt guild 1 introduction technical debt (td) is a metaphor that expresses software artifacts' immaturity and their impacts on software maintenance and evolution activities. according to brown et al. (2010), this metaphor characterizes the difference between a software system's current state and its hypothetical ideal state. a theoretical ideal state is understood as the one established by the context in which the software is inserted (brown et al., 2010). td negatively affects productivity and feasibility in software development. in many cases, developers are forced to introduce more td because of prior debts (besker et al., 2019). it is estimated that between 25% and 37% of all development time is wasted due to td. most of the time is wasted understanding or managing td (ampatzoglou et al., 2017; besker et al., 2017; martini et al., 2018). if unmanaged, td can result in significant cost overruns, serious quality problems, reduced developer morale (ghanbari et al., 2017), and limited ability to add new features (seaman et al., 2012). it can even reach a crisis point when a vast and expensive refactoring or complete system replacement is needed (martini et al., 2014). the efficient management of td is a little explored area, although it seems to help in quality and productivity during software development (guo et al., 2016; rios et al., 2018). works investigating aspects of td management in the software development process are isolated initiatives (rios et al., 2018). decision-making in td management is hard to standardize because, in most cases, it depends on the organization's context (guo et al., 2016). one way to face this problem is to build a team focused on solving problems. this approach can be a practical way of solving a wide range of issues and offering suggestions on processes and working methods that need improvement (connolly, 1992). such groups can be implemented using the concepts of communities of practice (cop) (smite et al., 2019). a community of practice (cop) is a group of individuals who periodically meet due to a common interest in learning and applying what has been learned, sharing knowledge, exchanging experiences, taking their problems, and finding solutions. one of the best-known examples of cop is the concept used by the music streaming technology company spotify, named guild (kniberg, 2014). in a context where the tdm should be incorporated into the software development process, bringing together people who have knowledge and interest in the subject can contribute to finding solutions and generating value for the business. this article presents an experience report on establishing and using a td guild in a software development organization throughout an action research process. the paper describes the experiences, results, success factors, and challenges. the actions promoted by the td guild contributed to the identification, monitoring, prevention, prioritization, and payment activities in tdm. the guild helped align td's payout efforts with the organization's goals. due to the several strategies that can be adopted and difficulties in measuring the results, implementing a tdm process is not a trivial task. this work is expected to support other companies in the challenge of tdm using a guild approach. this study is structured as follows: section 2 presents a literature review; section 3 presents the research method; section 4 describes the context and overview of the company in which the study was conducted; section 5 describes the three cycles of action research; section 6 presents the results, lessons learned, challenges, related work and threats to validity. finally, section 7 concludes the paper. https://orcid.org/0000-0003-2479-5904 https://orcid.org/0000-0002-0929-1874 https://orcid.org/0000-0001-9430-7713 technical debt guild: managing technical debt from code up to build detofeno et al. 2023 ¹ https://www.sonarqube.org/ ² https://github.com/sonarsource/sonar-php 2 background 2.1 guild or communities of practice (cop) in the middle age, guilds played an essential role in economic sustainability. a guild was formed hierarchically by masters, officers, and apprentices and had experienced and renowned specialists in its field of craftsmanship. these specialists were called master artisans. there was an exchange of knowledge in these guilds to make the work more efficient and productive (wolek, 1999). using these older phenomena as a reference, leave and wenger (1991) coined the term community of practice (cop). in the most current concept, approached by wenger and wenger-trayner (2015), cops are formed by people who share a concern or passion for something and engage in collective learning in a shared domain of human effort to do it interacting better regularly. for wenger, mcdermott, and snyder (2002), domain, community, and practice are the three essential elements that characterize a cop. the domain builds the community and identity and corresponds to the interest area that attracts and keeps the members. on the other hand, the community is the central element, composed of individuals and their interactions based on joint learning. cops stand out for managing knowledge assets in organizations, creating value for members and the organization, as a competitiveness tool. it can develop new skills and generate strategic opportunities through innovations (wenger et al., 2002). it is believed that cops are used in organizations of different natures, with other terminologies, such as learning networks, thematic groups, technology clubs, and guilds. the professional literature on how to scale up agile software development suggests cops as a possible solution for learning and knowledge sharing among individuals with similar functions, such as testers or scrum masters (larman and vodde, 2010). experience from four cops at ericsson shows that success factors include a good topic, a passionate leader, a proper schedule, decision-making authority, openness, tool support, a suitable rhythm, and cross-site participation when needed. the cops in ericsson had three leading roles: to support the agile transformation, be part of the large-scale scrum implementation, and support continuous improvement. cops became a central mechanism behind the success of the large-scale agile implementation in the case organization that helped mitigate some of the most pressing problems of the agile transformation (paasivaara and lassenius, 2014). for smite et al. (2019), implementing well-functioning communities is not easy. experiences from oracle corporation, uk national health service, hewlett-packard, wipro technologies, alcatel, and daimlerchrysler suggest that the cultivation of knowledge culture requires organizational attention, support, and sponsorship for cops. inspired by cops, the guilds in spotify are designed beyond formal structures and unite members with common interests, whether related to leisure (cycling, photography, or coffee consumption) or engineering (web development, back-end development, c++ engineering, or agile coaching). figure 1 presents the five types of members identified by smite et al. (2020) in the guilds of spotify, based on the numbers of members registered in the communication channels and engaged in the activities. similar to wenger et al. (2002), smite et al. (2020) identified a group of core members (sponsors and coordinators), active members, and peripheral members (passive members and subscribers). the latter group forms most community members (smite et al., 2020). smite et al. (2020) noticed that individual members' activity levels change over time for several reasons: the coordinator role rotates, some active members become passive and vice versa, and those who change specialization turn into inactive users who merely subscribe the latest news. figure 1. different types of members in a guild (smite et al., 2020). some guilds arise from shared interests, while others are structured or sponsored and can even have a specific budget. the maintenance and generation of value for the organization of a guild is a challenge. 2.2 technical debt management (tdm) as previously stated, td represents the effects of immature artifacts in the software evolution that bring short-term benefits but have to be adjusted later. the concept, whose scope was initially limited to source code and related artifacts, was expanded to consider different software development stages and work products (alves et al., 2016). rios, mendonça, and spínola (2018) provide a taxonomy with 15 types of td, as described below: • architecture debt – "refers to the problems found in product architecture, which can affect architectural requirements. usually, architectural debt could result from sub-optimal upfront solutions or sub-optimal solutions as technologies and patterns become superseded, compromising some internal quality aspects, such as maintainability." • automation test debt – "refers to the work involved in automating tests of previously developed functionality to support continuous integration and faster development cycles." • build debt – "refers to issues that make the build task harder and unnecessarily time-consuming." • code debt – "refers to the problems found in the source code (poorly written code that violates best coding practices or coding rules) that can negatively affect the https://www.sonarqube.org/ https://github.com/sonarsource/sonar-php technical debt guild: managing technical debt from code up to build detofeno et al. 2023 ¹ https://www.sonarqube.org/ ² https://github.com/sonarsource/sonar-php legibility of the code, making it more challenging to maintain". • defect debt – "refers to known defects, usually identified by testing activities or by the user and reported on bug tracking systems, that the development team agrees that should be fixed but, due to competing priorities and limited resources, have to be deferred to a later time". • design debt – "refers to debt discovered by analyzing the source code and identifying sound object-oriented design principles violations." • documentation debt – "refers to the problems found in the software project documentation." • infrastructure debt – "refers to infrastructure issues that can delay or hinder some development activities if present in the software organization. such issues negatively affect the team's ability to produce a quality product." • people debt – "refers to issues that can delay or hinder some development activities if present in the software organization". this is represented for late hire, for example. • process debt – "refers to inefficient processes, e.g., what the process was designed to handle may be no longer appropriate." • requirements debt – "refers to tradeoffs made concerning what requirements the development team needs to implement or how to implement them. in other words, it refers to the distance between the optimal requirements specification and the actual system implementation." • service debt – "refers to the inappropriate selection and substitution of web services that lead to a mismatch of the service features and applications' requirements. this kind of debt is relevant for systems with service-oriented architectures." • test debt – "refers to issues found in testing activities that can affect the quality of those activities." • usability debt – "refers to inappropriate usability decisions that must be adjusted later." • versioning debt – "refers to problems in source code versioning, such as unnecessary code forks." design, code, and architecture debts are the most studied td types. this is probably because several source code analysis tools help identify problems such as complex code, code smells, duplicate code, and others, which often serve as indicators of technical debt. the authors also define debt types and a list of situations in which td items can be found in the software (rios et al., 2018). a td item represents an instance of td and has several causes factors that lead to the occurrence of the td item and consequences to the project. a td item can be caused by inappropriate processes, decisions, schedule pressure, etc. on the other hand, td items can cause several consequences that affect software features and are usually related to cost value, schedule, and quality. a td item can be associated with one or more artifacts of the software development process (rios et al., 2018). if td items are not managed, they can cause financial and technical problems, increase software maintenance and evolution costs, and lead to a crisis point where the entire future of the software can be compromised (martini and bosch, 2016; spínola et al., 2013; nord et al., 2012). it is not enough that teams are only aware of what constitutes td. they must be aligned to manage td to add value to the business. simply knowing about td does not necessarily result in value for the software (bavani, 2012). td metaphor allows thinking about software quality regarding the organization's business (tom et al., 2013). however, the decision criteria used for the payment of td can be different according to the different scenarios and objectives of an organization (rios et al., 2018). a challenge for development teams is to quantify the maintenance problems of their projects to justify the investment in refactoring the td (mo et al., 2018; sharma et al., 2015). convincing arguments are needed about when and why the td should be removed. a model for tdm should foresee the contexts in which td is identified and evaluated so that decisions can help companies and organizations to take advantage of opportunities and anticipate market needs (kruchten et al., 2012). although td affects everyone involved in the project, regardless of the cause, the level of communication regarding the td varies. team members generally discuss td among themselves but understand that there are difficulties presenting evidence of tdm to upper-level management (codabux, 2013). tdm includes identifying, monitoring, and paying td items incurred in a system (griffith et al., 2014). rios, mendonça, and spínola (2018) describe prevention, identification, monitoring, and payment as macro activities and documentation and communication as activities performed during tdm. some activities such as identification (e.g., td detection by static source code analysis), measurement (td quantification using estimates), and payment (td resolution by techniques such as re-engineering or refactoring) receive more attention with the support of appropriate tools and approaches (li et al., 2015). the payment activity refers to the activities carried out to support decision-making about the most appropriate time to eliminate td items. at this point, the prioritization of which td item should be eliminated is made (rios et al., 2018). the tdm turns it possible to make decisions about eliminating the td and the most appropriate moment to do so (guo et al., 2016). decision-making criteria are the basis for generating prioritization in the payment of td items. tdm should be based on a rational approach to decision-making, considering planned and potential future development (schmid, 2013). https://www.sonarqube.org/ https://github.com/sonarsource/sonar-php technical debt guild: managing technical debt from code up to build detofeno et al. 2023 ¹ https://www.sonarqube.org/ ² https://github.com/sonarsource/sonar-php guo et al. (2016) mentioned that aspects of managing td in a software development process were little explored. until the literature search for this work development, no studies reporting experiences of applying cops or guilds to tdm support were found. 3 research method considering the characteristics of this study, the research method selected was action research. action research aims to provide research subjects, participants, and researchers to respond to the problems they experience with greater efficiency and based on a transformative action. the characterization of action research varies from one author to another. however, there is a set of common characteristics (dick, 2000): • act in an existing situation to improve and expand knowledge on the subject. • to have a cyclical nature, to repeatedly execute a series of steps. the cycle varies according to the author, but it must include the stages of planning, action, and reflection. • possess a reflexive nature and a critical reflection on the research process and obtained results. • it is primarily qualitative, although quantifications are possible in some situations. in coughlan and coghlan (2002), the action research cycle comprises three steps, as illustrated in figure 2: 1. a pre-step: to understand context and purpose. 2. six main steps that relate first to the data and then to the action, as follows: • data gathering: this can occur through observations, interviews, surveys, and reports, collecting qualitative or quantitative data. • data feedback: the collected data is submitted to the organization for analysis through reports or feedback meetings. • data analysis: seeks for each party to contribute with a critical view of the data collected, internal company issues, the conduct of the research, or interaction with the researcher's knowledge. • action planning: what will be done and the deadline. • implementation: the actions are implemented to promote the planned changes in collaboration with the stakeholders. • evaluation: a reflection of the results expected or not, coming from executing the action, aiming to improve the next cycle. 3. a meta-step to monitor that occurs through all the cycles. each cycle leads to another, so continuous planning, implementation, and evaluation occur over time, as illustrated in figure 2. figure 2. action research cycle (coughlan and coghlan, 2002). our study was structured based on this approach, as illustrated in figure 3. it began with a stage of understanding the context and proceeded with three cycles of the driving phase, composed of the following steps: • planning: data analysis was performed with those involved to establish what would be done and when. • action: the planned activities were implemented to promote the planned changes in collaboration with those involved and responsible for the organization. • evaluation: a reflection was performed to analyze the outcomes, aiming to improve the following cycle. each cycle of this research was conducted as presented below: • 1st cycle: in the first cycle, the guild was created, and the guidelines for the scheduled and unscheduled social interactions were established. the first steps were taken to td identification, and the teams were guided in the tds payment and monitoring. • 2nd cycle: this cycle was a review of the previous one, where the tools and management activities of td were revised. the guild promoted the standardization of the source code's development and documentation and guided the teams in prioritizing the td. • 3rd cycle: in the third cycle, the review of the tools and the td identified in the source code was maintained, and the td guild sought to identify and propose actions to pay for the td in the continuous integration test artifacts. the duration of each cycle is linked to the company's annual management cycle, which foresees periods of planning and execution of actions that impact the software development process or the teams' goals. https://www.sonarqube.org/ https://github.com/sonarsource/sonar-php technical debt guild: managing technical debt from code up to build detofeno et al. 2023 ¹ https://www.sonarqube.org/ ² https://github.com/sonarsource/sonar-php figure 3. timeline of the research. 4 context this article describes the experience of a td guild's implementation and evolution in a brazilian software development company founded in 1995. it currently has more than 2,000 customers and 300,000 users worldwide, providing process improvement and compliance management solutions. corporations use its solutions in all kinds of industries: manufacturing, automotive, food and beverage, mining and metals, oil and gas, high-tech and it, energy and utilities, government and public sector, financial services, transportation and logistics, and healthcare. technically the product is entirely on the web, with documentation and localization for more than ten languages, and compatible with three database management systems. the software development area brings together some benefits of the agile philosophy with project management. the project management and planning use the scrum method defined by schwaber and sutherland (2020), dividing it into two-week development cycles and a quarterly release to the market. thus, the company does not have automated continuous delivery or continuous deployment. however, it has continuous integration with a standardized and automated development flow/process for all teams for software development. during the period that lasted for three years, the area had, on average, 96 professionals split into 12 teams composed of professionals with the following roles: product owners (po), scrum masters (sm), developers (dev), testers, and devops. the teams vary in terms of the number of members, the amount of source code they are responsible for, and the programming languages used. the source code repository is composed by 61% php, 30% javascript, 3% java, 2% html, 2% css, 1% json and 1% xml. the development area is responsible for approximately 63,300 source-code files. in the second and third years of the study, the monthly average was 1,850 change packages effective in the repository (commits). td concepts and tdm activities (as an approach to contribute to quality and productivity during software development) were presented to the product owners and scrum masters in an internal meeting. the area's director proposed sponsoring and supporting creating a td guild to discuss and offer tdm solutions for the company. the invitation for td guild was to all professionals involved in the product's maintenance and evolution activities. per year, three or four experienced professionals were invited to become active members because they had a deep knowledge of the product's architecture. in the three years that the td guild was implemented and evolved, the sponsor and the coordinator were the same professional, but there were changes in the active members. figure 4 shows the number and type of members per year. in the first year, the active members were composed of a tester, a product owner (po), a scrum master (sm), and three developers (devs). in the second year, the members were two testers, three sm, and two devs. in the third year, the members were three testers, an sm, and two devs. the guild was composed of representatives with technical and business knowledge of the product. figure 4. td guild members. most members of the td guild are peripheral members that do not represent the key practitioners. peripheral members are those with low involvement in the guild's interactions or the members impacted by the guild's actions. they provided kind suggestions, criticisms or encouraged the initiatives. in the 2nd cycle, these comments were analyzed through a survey. the td guild emerged within an organizational context, aligned with strategic objectives, and sponsored by the https://www.sonarqube.org/ https://github.com/sonarsource/sonar-php technical debt guild: managing technical debt from code up to build detofeno et al. 2023 board. besides the exchange of experiences, learning, and best practices on tdm, the td guild was challenged to generate value for the product and add knowledge to the software development teams. the td guild formation was based on the guidelines for cops building and the characteristics of autonomy and alignment with the strategic objectives given by the spotify approach presented by kniberg (2014). the guild meetings were monthly and face-to-face, but the members met more frequently to deliberate actions that required more speed in specific cases. the primary responsibilities of the td guild coordinator were to organize the subjects and meetings, monitor the execution of tasks, support guild members, and align the needs with the sponsor. the sponsor was responsible for evaluating the proposed actions and approving and providing the necessary resources to execute the tasks. during guild meetings, each group member presented the ideas and problems to which the guild should pay attention. for each action approved by the guild members and the sponsor, a task list was created with a guild member as responsible. the person in charge had the objective of continuing the theme, carrying out in-depth studies and practical tests to evaluate the proposal's feasibility. the sponsor approved the tasks so the responsible in charge could prioritize this task and the other demands of the team. the subjects or actions in progress were discussed at the beginning of the guild meetings. the specific issues were often discussed in an internal communication channel or by e-mail. 5 research cycles the td guild beginning was marked by discussions and alignments about the purposes, objectives, guidelines to conduct the activities, and interest subjects to the organization and its members. at the beginning of each research cycle, guild members reviewed goals and procedures. the td guild's purpose was to study and help implement and monitor the tdm, with proposals and actions to improve internal quality and reduce maintenance costs and software evolution. to carry out its duty, the td guild developed some directives to conduct the meetings and activities aligned with the organization's expectations, as follows: • be aligned with the company's strategy. • have a well-defined purpose or objective. • have autonomy to implement solutions. • clearly communicate the problems and opportunities to the interested parties. • the member must be a promoter for td's payment actions within the teams. • allow members of different teams to participate, considering that the member should have knowledge about the work context. 1 https://www.sonarqube.org/ • the member influences the teams to help direct and prioritize the tasks of refactoring the td. • maintain the focus on quality and productivity in software development, helping to define the actions of prioritization and payment of the td. • guide the teams on best practices and standards of internal development. • have periodic meetings to monitor the actions and propose changes. 5.1 first cycle in the first year of this study, the actions of the td guild focused on two initiatives related to the php source code: (1) identify, measure, and monitor the primary technical debts identified in the php source code; (2) identify and propose actions to improve the php source code. based on the guidelines and the deployment of the initiatives, the td guild defined the following actions: • deploy tools to support tdm. • identify td in the context. • guide teams on td payment. • monitor td payment. the guild contributed to disseminating the tdm within the company, identifying the most appropriate td for the company's objectives, and selecting the most appropriate td identification and monitoring tools. according to the company's goals and resources, the guild members' quality rules priority classification guided the td payment. 5.1.1 deploy tools to support tdm to manage the td, it is necessary to have tools for implementing and continuing actions. tools provide support and enable the automation of tdm activities. sonarqube1 and the sonarphp 2 plugin were selected to identify and monitor the php source code's td. the choice of sonarqube and the sonarphp plugin was mainly to the: amount of quality rules available for the php source code; options for configuring the quality rules; and, the possibility to develop specific quality rules. the quality rules provided by sonarqube were reviewed and updated according to the organization's context. the sonarqube identified the source code that was not within the td guild's coding standard. 5.1.2 identify td in this action, the td guild's objective was to know in detail and select the quality rules provided by sonarqube considering the company's goals. table 1 presents the list of options to select and classify the quality rules. the classification and selection of quality rules were made based on priority definition. the description of priorities was defined by the td guild, taking into account the organization's context, and were used to guide the classification of the quality rules. the rationale for using this 2 https://github.com/sonarsource/sonar-php technical debt guild: managing technical debt from code up to build detofeno et al. 2023 scale is to allow easy mapping to the five-point scale used in sonarqube: blocker, critical, major, minor, and info. table 1. list by priority. priority description blocker rule considered as a bug, system vulnerability, or command that should not be used. critical an important rule with a high impact on product quality and source code standardization. major a minor rule with a low impact on product quality minor good practice rules that should be monitored. this action resulted in the approval of 93 quality rules that were activated in sonarqube, classified by priority and td type, as shown in table 2. table 2. first cycle: rules classified by priority and td type. priority blocker 25 critical 27 major 14 minor 27 td type code debt 39 defect debt 18 design debt 36 5.1.3 guide the teams in td payment the quality rules were classified by their priority, complexity, and impact to support the teams prioritizing and paying the td. the priority analysis ranked the quality rules considering the research context's available objectives and resources. this analysis was used to prioritize td payment actions. the quality rules' analysis on complexity and impact helped the teams select the td payment source files. complexity was understood as the technical difficulty to solve a quality rule. the impact of a change was classified by the extent of the change within the system, that is, the change's potential to affect other modules or classes. some members ranked the quality rules separately, following the guild's guidelines. the results of the classification were reviewed and aligned during the guild meetings. 5.1.4 monitor td payment the td monitoring was intended to expose the reality and motivate the teams to pay the td. to support the teams in the periodic monitoring of td, the sonarqube was configured per team, and a web portal with the values of the td classifications was made available. this action facilitated the management's follow-up on the teams' initiatives in td payment. 5.2 second cycle at the beginning of the second cycle, the guild members discussed and decided to maintain the objectives and guidelines defined in the first cycle. however, they added the initiative to identify and propose improvement actions in the php source code most relevant to the project. thus, the td guild defined the following steps: • deploy tools to support tdm. • define a coding standard in php. • define a documentation standard in php. • identify td in the context. • train the teams on the standards and best practices. • evaluate guild actions by the developers. the actions to monitor and guide the teams were incorporated into the software development process, so the teams have guidance on how to track and pay the td. to execute the definition of coding standards and documentation from samples of the php source code, the guild members evaluated the impacts of the product's modifications. in this way, besides assessing the impacts, it was possible to estimate the necessary efforts and create practical procedures to adapt and maintain the standards. 5.2.1 deploy tools to support tdm to support decision-making on td prioritization and payment actions, guild members developed two internal systems, one to calculate the effort needed to eliminate td items from a team and the other to analyze the dependencies of each source file in php. several tools were evaluated to facilitate large-scale changes, format the source code, and eliminate code smells. although the guild did not approve tools to automatically make the changes without going through the developers' manual supervision, it was recommended to use two free tools (visual studio code, scite) that presented the best results. one tendency is that this subject will be discussed again by the guild. 5.2.2 define a coding standard in php setting a coding standard for php development aimed to define rules and development standards to improve developers' communication capacity. the ultimate goal is to have less disruption when the source code is maintained by a developer who has not created it. the option was to use php standards recommendation (psr), which are project specifications proposed by the php framework interop group (php-fig). php-fig is currently the primary standard used in php development and has source code verification tools that help in automatic adaptation and source code monitoring. this project followed the recommendations of the psr-1 and psr-2 standards. sonarqube was used for monitoring td, which has the formatting rules according to the psr standard. the knowledge transfer was done through standard documentation and internal training for developers. 5.2.3 define a php documentation standard after implementing agile practices in the company, the developers questioned the source code's documentation, especially regarding its value to the product and the teams. technical debt guild: managing technical debt from code up to build detofeno et al. 2023 according to the developers' reports, especially from recently hired developers, it was evident that the current source code did not have a standard terminology that allowed a quick understanding. another situation identified by the team was the difficulty of finding the routines already implemented in the system. php docblock standards were implemented as a reference for this action regarding functions, source code elements, classes, and methods documentation. phpdocumentor (a tool that generates documentation from php source code) was used to create the documentation. after defining the documentation standard, a custom rule in sonarqube helped teams monitor and identify the source code that did not meet the standard. similar to the previous action, the knowledge transfer was done by documenting the standard and delivering developers' internal training. 5.2.4 identify td after defining the coding and documentation standards, it was necessary to review the approved rules in the first cycle. guild members understood an improved knowledge of quality rules in this cycle. the 125 rules were approved and classified by priority and td type, as shown in table 3. table 3. second cycle: rules classified by priority and td type. priority blocker 33 critical 19 major 34 minor 39 td type code debt 44 defect debt 41 design debt 37 documentation debt 3 5.2.5 train the teams the training was conducted to qualify and guide developers on changes in programming procedures and source code releases in php. the td guild promoted three courses for eight groups in these first two years. each training session lasted 4 hours, with one developer from each team per group and groups composed of 12 to 14 participants. in total, the guild delivered 96 training hours to almost 90 participants. the first training covered the use of sonarqube, and procedures to monitor the team's source code. the other two courses were about coding standards and documentation of the php source code. the developers who attended the training were responsible for passing the knowledge to the other developers. the training was documented and published in the company's internal knowledge base tools. 5.2.6 evaluate the actions by the developers at the end of the second year, a survey was applied to the development team to assess the impact generated by the tdm guild actions. the objective of the survey was to extract the developers' perceptions of the actions taken by the guild. we had 83 responses out of 89 total peripheral members. the survey had two questions. first, a closed question in the likert scale format and, second, an open-ended question: 1. actions of the td guild to improve the source code contributed to the developer's productivity or quality. 2. in your opinion, what were the impacts of the actions promoted by the td guild on projects and teams? because the open-ended question was not mandatory, 52 responses were obtained from 83 respondents. the survey indicated that approximately 94% agree that the td guild's actions have helped improve the product quality and team productivity. a thematic analysis approach was used to analyze the 52 responses to the open-ended question once written in natural language. thematic analysis is an effective method for identifying, analyzing, and reporting patterns and themes within a searched data scope (braun and clarke, 2006). analyzing and coding the answers, we identified patterns among them, and five themes emerged: compliance (31 quotes), maintainability (16 quotes), refactoring (11 quotes), understandability (8 quotes), and reusability (3 quotes). the most cited characteristic was that the actions improved the standardization of the source code and the use of best programming practices. these actions helped improve the source code's understanding, refactoring, and maintenance (35 responses). three developers quoted source code reuse as a side benefit. the standardization of the source code documentation helped developers locate existing source code in other projects. as reported in the open-ended question, the teams' significant impacts were the change in developers' behavior in development activities and code review. developers were motivated to develop a cleaner and standards-compliant code. the developers sought to interact to improve the source code in the code review activity. in this context, many legacy source code was developed under an architecture with several anomalies, such as difficulties of reuse, strong coupling, and lack of separation of the responsibilities among software architecture layers. it was realized that developers understood td's impacts and were concerned with refactoring the source code. still, the pressure to deliver new features, lack of resources, and the source code's architecture hindered td's payment. 5.3 third cycle in the first two research cycles, guild actions focused on source code artifacts. however, the guild understood that efforts to manage td should expand to identify other types of immature, incomplete, or inadequate artifacts in the software development lifecycle that cause higher costs and lower quality in the long term. in this cycle, the guild kept the actions for improving the php source code and created measures to manage tests, technical debt guild: managing technical debt from code up to build detofeno et al. 2023 automated tests, and build technical debts. those were the most significant tds after source code. the issues discussed by the td guild for this cycle were around two main questions: i is the current state of the functional test planning, documentation, and execution optimal for this context? ii are build issues affecting the productivity of the teams? thus, the td guild defined the following actions: • identify td in the context. • review test case documentation. • define an automated test development standard. • monitor automation test execution. • identify build debt. 5.3.1 identify td the guild members understand that annually one should update the version of sonarqube and review the priority of quality rules to the source code. in this cycle, 189 quality rules were reviewed, divided into 181 rules provided by sonarqube and eight rules tailored by the guild members. these 149 rules were approved and classified by priority and td type, as shown in table 4. table 4. third cycle: rules classified by priority and td type. priority blocker 69 critical 29 major 32 minor 19 td type code debt 46 defect debt 26 design debt 45 documentation debt 3 vulnerability debt 29 5.3.2 review test case documentation this action aims to review the description of the test cases executed manually or automatically. this action was performed by 12 pos and 13 testers, where 2,314 test cases were reviewed, in which 2,097 steps are performed automatically daily, and 4,295 steps are performed manually on each new product release. the testlink1 tool was used to register and review the test cases. the company already used testlink to record the test cases, and the tool adhered to this action. as per the guidance of the td guild, reviewers were to answer the following checklist questions about their project's test cases: • are documented test cases suitable for the project? 1 https://testlink.org/ • do the most critical project requirements have planned test cases? • are the test cases updated in the software test management tool (testlink)? • do all test cases have a title, objective, action, steps, and the expected results? • do test cases have the desired results that can be validated? • do the test cases have a well-defined objective? 5.3.3 define an automated test standard the need to define a standard for developing automated test scripts was identified from developers' demotivation in creating automated tests. the guild discussed this issue along with some developers and team leaders, and they identified the following causes: • automated test scripts with too many lines. • outdated and redundant code. • many failures due to outdated test execution environments, databases, and test scripts. • lack of visibility into test automation results. we emphasize that the use of tools to develop and execute automated tests and the integration with the product are compliant with the company's needs. from the identified causes, the td guild developed a pilot project with a development team, this project developed 96 automated tests as a standard for developing new automated tests. the guild proposed and implemented the practice of code review for the automated tests. the guild developed a checklist to support automated tests code review with the following questions: • is the automated test documented, and does it have the test case and step references? • does the script contain outdated code? (e. g.: xpath, sleep, non-standard selectors) • in image comparison tests, is it correctly referencing the model image? • is the automated test independent? • is the data kept in the test base as evidence in case of failure? • is an evidence image generated of the correctly executed test at the end of the test run? 5.3.4 monitor automation test execution the teams highlighted the lack of a tool to control the execution of the automated tests. the proposal implemented by the td guild was the development of a data analytics dashboard to monitor the status of the automated tests by the teams. monitoring automated tests provide all developers and stakeholders with a view of test cycles' progress, results achieved, identified failures, and test execution metrics. https://testlink.org/ technical debt guild: managing technical debt from code up to build detofeno et al. 2023 the td guild made available two metrics to monitor the test execution. the first metric presents the test execution status, grouping the data by day. from this metric, the teams can monitor the execution of automated tests. the second metric represents the number of documented test cases with automatic or manual execution. this metric helps the teams plan to make resources available to develop new automated tests. 5.3.5 identify build debt the company has tens of millions of lines of source code at a change rate of 1,850 commits per month. we highlight that the main advantages are that the company has a guide for standardizing development, a single source code repository, a single build system for all projects, and a single testing infrastructure. in this action, we sought to identify build times, builds' success rate, and which services fail most in the build system. the data was extracted from the version control system, where we highlight the following information, grouped by month: • average of 1,010 merge requests. • average of 3,100 requested builds. • build success rate, which is the percentage of successfully executed builds, decreased from ~60% to ~30% (it will be presented in figure 7). • the average build time increased from ~10 to ~35 minutes (it will be presented in figure 8). • of the 43 services performed in the compilation, the five services that failed the most were identified. • in the last month of the cycle, the compilation failures consumed 107,036 minutes of processing. 6 discussion in this section, we discuss the results obtained in the payment of td, the main success factors, and the challenges faced. we also describe a guideline to support the creation of a td guild, related work, and the main threats to our work validity. 6.1 results the guild was present in all tdm activities. its involvement in td's categorization and prioritization provided confidence and reliability to the teams in td payment. classifying the quality rules by priority was performed in all three research cycles, so it was possible to evaluate the results obtained in the payment of the td items during all cycles. td guild meetings were held monthly with a pre-defined duration of at least one to two hours. when necessary, for example, the guild had extra meetings to anticipate decisionmaking in the planning phase. it is estimated that each active member spent 8 hours per month participating in the meetings, contributing to decision-making, participating in the communication group, and supporting the implementation of actions. 6.1.1 source code debt table 5 shows the number of td items at the beginning and end of each cycle, summarized by priority. the column td items reduction means the amount of paid td. the focus of this table is on showing the source code debts. in the first cycle, there was a reduction of approximately 62% of the total tds: 64% with blocker priority, 34% with critical, 70% with major, and 11% with minor. the developers' primary explanation for not paying 100% of the tds with blocker priority was their difficulty prioritizing refactoring source code with low importance for the project. by this time, these criteria were not being taken into consideration. in the second cycle, there was a decrease of approximately 48% of the total tds: 67% with blocker priority, 15% with critical, 53% with major, and 13% with minor. in this cycle, the teams prioritized the td's payment in the files with more defects, increasing the value of product quality. in the third cycle, the reduction percentages were lower than in the previous cycles. the same guidelines for prioritizing td were followed in this cycle as in the previous ones. however, the paid td items quantity with blocker and critical priority was higher than in the second cycle: 20% with blocker priority, 46% with critical, 28% with major, and 6% with minor. table 5. td items payment results by cycle. priority td items cycle start td items cycle end td items reduction % reduction 1st cycle blocker 8,992 3,189 5,803 64,54% critical 37,441 24,574 12,867 34,37% major 476,572 139,057 337,515 70,82% minor 64,341 56,696 7,645 11,88% total 587,346 223,516 363,830 61,94% 2nd cycle blocker 2,066 666 1,400 67,76% critical 9,026 7,640 1,386 15,36% major 650,533 299,642 350,891 53,94% minor 98,664 85,712 12,952 13,13% total 760,289 393,660 366,629 48,22% 3rd cycle blocker 12,089 9,597 2,492 20,61% critical 22,517 12,017 10,500 46,63% major 211,440 150,305 61,135 28,91% minor 48,037 44,984 3,053 6,35% total 294,083 216,903 77,180 26,24% this phenomenon was observed because the decrease in the first two cycles was possible once these td items were mainly related to the source code formatting. the td guild suggested using automated source code editor tools to support the payment of td items of this nature, accelerating their payment. in the third cycle, it was unnecessary to use the tools for source code formatting, and no other tool was technical debt guild: managing technical debt from code up to build detofeno et al. 2023 identified that could have accelerated the payment of the td placed in the source code. in the second and third cycles, guild members reviewed and reclassified the priorities to be more precise in paying the most relevant td for the team. thus, it is recommended to analyze the results of table 5 per research cycle. due to most php source code analysis, an unexpected effect for the guild showed up: discovering dead code source code that is not executed by any product's routine. this subject will be dealt with in the next improvement cycle. the guild recommendation was to record a task of the possible dead code identified to be evaluated by the teams at the beginning of each product release. the payment of td items, identified in the source code, improved with each research cycle. thus, table 5 shows that teams have incorporated td prevention and payment into their development activities. 6.1.2 test debt in the third cycle, the td guild went beyond the source code boundary to seek solutions to improve the internal quality of the product and reduce maintenance costs in other product artifacts, such as test cases, automated test scripts, and the build pipeline. the test cases were reviewed according to the guidelines passed on by the td guild. as previously stated, this action involved 12 pos and 13 testers who reviewed 2,314 test cases and 6,342 test steps. ten professionals defined a standard for developing the automated tests, rewriting ~1,160 test scripts compliant with the new standard. a dashboard with automated test execution metrics was developed to help the teams monitor the results. figure 5 shows all teams' test execution status, but the dashboard also presents the data per team. this dashboard reflects the moment after defining the standard for developing automated tests (action 5.3.3). the chart illustrates the test scripts executed daily for nine days. for example, on day 1, 1,070 test cases passed, and 78 failed. the chart shows that ~6.5% of the performed tests have flaws that the responsible teams should analyze. figure 5. tests execution status per day. figure 6 shows the percentage of the automated test for all teams, but the organizational dashboard also presents data per team. after reviewing the test cases, this data was obtained to track the number of test steps that have automated and manual execution. during the 3rd cycle, the automated test scripts corresponded to 2,106 (33.21%) test steps, and there are still 4,236 (66.79 %) automated test steps to be created. figure 6. percentage of automated tests. 6.1.3 build debt in the build procedure, the td guild identified the existence of build debt because the build time, the success rate of the builds, the failures in the build systems are not meeting the needs of the context, and they are causing rework for testers and developers. in this action, the guild's goal was to present quantitative evidence of the existence of build debt. the data from the charts presented in figure 7 and figure 8 were extracted from the last 22 months and grouped by month. this period was chosen because the number of builds requested per month was not less than 10% of the previous six months' average, 1,037 builds. the months were selected until the monthly build quantity was higher than 933 builds. this procedure was chosen to mitigate the risk of the number of builds influencing the results. each time a build is not successfully executed, the developer needs to request the build again, thus causing a waste of resources. figure 7 presents the percentage of requested builds completed successfully (e.g., having 200 requested builds, 125 builds were executed successfully, resulting in a 62,5% build success rate). it can be seen in figure 7 that the percentage of builds successfully executed had fluctuations until month 17, with ups and downs. the last five months dropped below the previous periods, and the rate stays at ~30% of the total builds requested. figure 7. percentage of build success rate. figure 8 shows the average build times per month (in minutes) over 22 months. visually it is possible to see that the average time to execute a build has increased. in the first five months, the average was ~10 minutes. in the last five months, the average was over 35 minutes. technical debt guild: managing technical debt from code up to build detofeno et al. 2023 figure 8. average build time per month. by analyzing the graphs in figure 7 and figure 8, it is possible to observe the presence of build debt, as they present evidence that developer rework is increasing, even without increasing build demand. the build results are below the goal desired by the company. those responsible agreed that the optimal value for the company should be a build success rate higher than 60% and the average build time per month lower than 15 minutes. the devops team will be responsible for implementing measures to improve the build debt. 6.2 success factors this section highlights the main elements that contributed to successfully implementing td guild's actions and continuity. they were analyzed during retrospective meetings performed by the guild team: • sponsorship of top management: the sponsorship of the area director positively contributed to the engagement of members and teams. the members felt motivated to participate, knowing that the suggested actions had organizational support. the teams adhered to the changes because they knew that the activities were aligned with the top executive view. • support tools: the tools used were fundamental to giving td visibility and transparency. for example, we used the data provided by sonarqube in data analytics tools for monitoring and tracking actions of td payments. • objectives and guidance well defined: the goals and directions established in the guild's first meetings delimited the scope of subjects and tasks aligned to the organization's needs. • qualified team: the td guild was trustworthy to the teams and stakeholders once the team was composed of technical experienced and reference professionals. • alignment with the board of directors: all decisions were aligned with the company's board of directors. • visible results: the guild engagement was mainly due to seeing the suggested actions generating value for the organization and knowledge for the members. 6.3 main challenges during this study period, the td guild was created and obtained recognition from the organization but, at the same time, faced several challenges: • aligning the members' issues with the organization's needs to generate value for both is a constant challenge in the guild. this challenge was mitigated with the early alignment of the guild's objectives and guidelines with the sponsor and members. • in suggesting the change actions, guild members found a complex context in which the size of the source code base and the rate at which it changes were significant. the guild had technical skills to analyze the environment in detail and propose viable solutions to overcome this challenge. the guild also sought to communicate the purposes and expected results of the changes clearly and permanently to achieve engagement. • the standards suggested by the guild affected the individual characteristics related to software development. the developers had their coding habits and standards before the guild started. with the guild, changes happened and required a development culture shift. • in an organizational environment where the professionals have several activities and commitments, finding time to devote to the guild's tasks was another challenge. this challenge was mitigated by having the area director sponsor the guild. thus, the guild tasks were prioritized and executed during regular working hours. • the actions that involved data analysis were necessary to obtain data viewing permission from teams in other areas of the organization. these accesses were granted only to a guild member responsible for extracting and disseminating the data. • the sponsor and the teams empirically recognized that the actions promoted by the guild contributed to software development. however, it was impossible to quantitatively evaluate the results in the software's maintenance and evolution. 6.4 resulting guidelines after analyzing the results obtained in the three research cycles, table 6 presents some guidelines to support the creation of a td guild. we have splited it into three sections: general recommendations, guild meetings and guild actions. in the first part we present guild planning and setup. the second presents the guidelines for the meetings. the third shows the recommendations upon the guild actions. table 6. guidelines for building a td guild part i general recommendations context: the td guild should emerge within an organizational context, aligned with strategic objectives and the needs of the software development teams. purpose: to improve internal quality and reduce maintenance costs and software evolution. technical debt guild: managing technical debt from code up to build detofeno et al. 2023 challenge: to generate value for the product and add knowledge to software development teams. guidelines: to develop purpose, objectives, and guidelines to conduct the meetings and the actions aligned with the organization's expectations. invitation: the invitation for td guild should be to all professionals involved in the product's maintenance and evolution activities. the invitation should be sent by the guild sponsor. sponsor: responsible for evaluating the proposed actions and approving and providing the necessary resources to execute the tasks. coordinator: the coordinator is responsible for organizing the subjects and meetings, monitoring the execution of tasks, supporting guild members, and aligning the needs with the sponsor. active member: an active member is a motivated person who participates in the meetings and leads the actions proposed by the td guild. review: the objectives and guild needs may change over time. thus, guild members should review goals and procedures periodically. we recommend the yearly td guild review. part ii guild meetings meeting schedule: guild meetings can be monthly with a duration of two hours or biweekly with a duration of one hour. ideas discussion: each member presents the ideas and problems to which the td guild should pay attention. actions selection: actions are selected. for each action a guild member is assigned to be responsible for approving and implementing the action. action goal definition: the goal of the action must be aligned in the meetings. monitoring: the progress is presented and discussed during guild meetings. part iii guild actions approval: the sponsor approves the actions so the person in charge can prioritize this task with the other demands of the team. build/execution actions: lists the action steps needed to achieve the established goals. desjardins (2011) suggests that for the execution of the action, consider: ownership, action steps, responsibility, support, informed, metrics and budget, milestone date, and completion date. monitoring: the actions in progress are discussed during guild meetings. communication: the specific issues about the actions can be discussed in an internal communication channel or by e-mail after de meeting. 6.5 related work the td guild has the essential elements of the domain, community, and practice that characterize a cop, as wenger, mcdermott, and snyder (2002) described. the guild implemented by our research project has different types of members, as identified by smite et al. (2019), as previously presented in figure 4. the td guild followed smite et al. (2019) recommendations, establishing a straightforward practice and a well-defined scope, having regular interactions with tasks and responsibilities that showed signs of member engagement with the results. we confirmed the statement of smite et al. (2019) that the sponsor's authority and attention contribute to helping achieve the guild's objectives. our study confirmed the relevance of the sponsor role. as already pointed out, the director played an essential role in sponsoring the guild. the guild's formation followed the guidelines of several studies in the area. still, the td guild differentiates itself by supporting the deployment and continuity of the tdm in software development. despite the lack of studies on strategies to implement and monitor tdm in a business context, several studies present suggestions and challenges that a td guild can contribute: • it considers the context in identifying and evaluating td, as kruchten et al. (2012) suggested. • to help the teams to quantify, prioritize and justify the payment of td, challenges cited in the studies by sharma et al. (2015), fernández-sánchez et al. (2017), and cai and kazman (2018) were also observed in our study. • it is applicable to provide transparent communication about the expected returns on td payments (fernándezsánchez et al., 2017). • it involves all stakeholders in decision-making for tdm, as suggested in (fernández-sánchez et al., 2017; rios et al., 2018). the studies that were part of the tertiary study by rios et al. (2018) did not point out strategies that collaborate to prevent td. our study obtained it through source code standardization, teams training, test scripts development standardization, and code review for automated tests. thus, a td guild can also be used as a strategy to prevent td. 6.6 generalization and threats to validity it was observed that the sponsor or the guild coordinator invited the guild participants. it may be that inviting the people to join may have created some embarrassment to deny the invitation and may have intimidated peripheral members into participating more actively in the guild. this can also be seen as a positive factor (once the director sponsored the project). technical debt guild: managing technical debt from code up to build detofeno et al. 2023 for the actions proposed by the td guild that were aligned with the company's goals and were approved by the sponsor, the td guild obtained the necessary resources to continue the actions. because of this, the td guild can be interpreted as a working group at some point. this study aimed to present the results and challenges of a td guild obtained throughout three action research cycles. this paper does not detail the calculations, resources, strategies, and tools adopted to support tdm activities. 7 conclusion with the results obtained, it is possible to conclude that the guild can contribute to technical debt management in an organization. the td guild was present in all tdm activities identified in the source code and was responsible for preventing td from creating standards and guidelines for the teams. the guild also contributed to determining the tds that were most aligned with the company's objectives. td is often incurred because people do not know it. the guild disseminated knowledge about td and guided developers in best practices and development standards. besides, it helped deploy tools to verify and monitor the source code, making incorrect development difficult. in the first two years, the td guild focused on the td identified in the php source code, but in the third year, the actions promoted by the td guild reached other software artifacts, such as test cases, automated test scripts, and the build pipeline. the td guild promoted actions in different td types: automation test debt, build debt, code debt, defect debt, design debt, documentation debt, and test debt. these experiences can be helpful for other professionals and provide practical knowledge to help with the guild, cops, and tdm research. setting up a guild with periodic meetings was the most adherent proposal to the company's context. the continuity and maintenance of tdm tools were passed on to two company professionals. as the company evolves in tdm, the need for a professional or an allocated team responsible for tdm also increases. this work raises the question about the lack of a professional trained and dedicated to the tdm in organizations: the td manager. we are now working to define and implement an incremental and evolutionary tdm process aligned with empirical evidence of use in the software industry. acknowledgments the authors would like to thank the company and the professionals who participated in this research. references alves, n., mendes, t., de mendonça, m., spínola, r., shull, f., & seaman, c. (2016). identification and management of technical debt. information and software technology, 70(c). ampatzoglou, a., michailidis, a., sarikyriakidis, c., ampatzoglou, a., chatzigeorgiou, a., & avgeriou, p. (2018). a framework for managing interest in technical debt: an industrial validation. 2018 ieee/acm international conference on technical debt (techdebt). bavani, r. (2012). distributed agile, agile testing, and technical debt. ieee software, 29(6), pp. 28-33. doi:10.1109/ms.2012.155 besker, t., martini, a., & bosch, j. (2017). the pricey bill of technical debt when and by whom will it be paid? 2017 ieee international conference on software maintenance and evolution (icsme). doi:10.1109/icsme.2017.42 besker, t., martini, a., & bosch, j. (2019). software developer productivity loss due to technical debt a replication and extension study examining developers' development work. the journal of systems and software, pp. 41–61. doi:https://doi.org/10.1016/j.jss.2019.06.004 braun, v., & clarke, v. (2006). using thematic analysis in psychology. qual. res. psychol. 3, pp. 77–101. brown, n., cai, y., guo, y., kazman, r., kim, m., kruchten, p., . . . zazworka, n. (2010). managing technical debt in software-reliant systems. foser '10 proceedings of the fse/sdp workshop on future of software engineering research, pp. 47-52. codabux, z., williams, b., bradshaw, g., & cantor, m. (2017). an empirical assessment of technical debt practices in industry. journal of software: evolution and process 2017. doi:doi:10.1002/smr.1894 connolly, c. (1992). team-oriented problem solving. iee seminar on team based techniques design to manufacture. coughlan, p., & coghlan, d. (2002). action research for operations management. january 2002 international journal of operations & production management, 22, pp. 220-240. doi:10.1108/01443570210417515 desjardins, m. (2011). how to execute corporate action plans effectively. business in vancouver. archived from the original on 22 march 2014. dick, b. (2000). a beginner's guide to action research. acesso em 03 de 09 de 2019, disponível em http://www.aral.com.au/resources/guide.html fernández-sánchez, c., garbajosa, j., yagüe, a., & pereza, j. (2017). identification and analysis of the elements required to manage technical debt by means of a systematic mapping study. journal of systems and software, 124, pp. 22-38. doi:https://doi.org/10.1016/j.jss.2016.10.018 ghanbari, h., besker, t., martini, a., & bosch, j. (2017). looking for peace of mind? manage your (technical) debt. an exploratory field study. published in: 2017 acm/ieee international symposium on empirical software engineering and measurement (esem). doi:10.1109/esem.2017.53 griffith, i., taffahi, h., izurieta, c., & claudio, d. (2015). a simulation study of practical methods for technical debt management in agile software development. proceedings of the winter simulation conference 2014. doi:10.1109/wsc.2014.7019961 technical debt guild: managing technical debt from code up to build detofeno et al. 2023 guo, y., spínola, r., & seaman, c. (2016). exploring the costs of technical debt management a case study. empirical software engineering, 21(1), pp. 159–182. doi:https://doi.org/10.1007/s10664-014-9351-7 kniberg, h. (2014). spotify engineering culture. (spotify) accessed on: oct/30/2020, available: https://engineering.atspotify.com/2014/03/27/spotifyengineering-culture-part1/?fb_comment_id=278872278947916_36091417074372 6 kruchten, p., nord, r., & ozkaya, i. (2012). technical debt: from metaphor to theory and practice. ieee software, 29(6), pp. 18-21. doi:10.1109/ms.2012.167 larman, c., & vodde, b. (2010). practices for scaling lean & agile development: large, multisite, and offshore product development with large-scale scrum. addisonwesley professional. lave, j., & wenger, e. (1991). situated learning: legitimate peripheral participation. martini, a., & bosch, j. (2016). an empirically developed method to aid decisions on architectural technical debt refactoring: anacondebt. 2016 ieee/acm 38th international conference on software engineering companion (icse-c). martini, a., bosch, j., & chaudron, m. (2014). architecture technical debt: understanding causes and a qualitative model. 2014 40th euromicro conference on software engineering and advanced applications. martini, a., fontana, f. a., biaggi, a., & roveda, r. (2018). identifying and prioritizing architectural debt through architectural smells: a case study in a large software company. springer international publishing. doi:https://doi.org/10.1007/978-3-030-00761-4_21 mo, r., snipes, w., cai, y., ramaswamy, s., kazman, r., & naedele, m. (2018). experiences applying automated architecture analysis tool suites. acm/ieee international conference on automated software engineering (ase 2018), pp. 779–789. doi:10.1145/3238147.3240467 nord, r., ozkaya, i., kruchten, p., & gonzalez-rojas, m. (2012). in search of a metric for managing architectural technical debt. 2012 joint working ieee/ifip conference on software architecture and european conference on software architecture, pp. 20-24. doi:10.1109/wicsa-ecsa.212.17 paasivaara, m., & lassenius, c. (2014). deepening our understanding of communities of practice in large-scale agile development. doi:10.1109/agile.2014.18 rios, n., mendonça, m., & spínola, r. (2018). a tertiary study on technical debt: types, management strategies, research trends, and base information for practitioners. information and software technology, 102, pp. 117-145. doi:https://doi.org/10.1016/j.infsof.2018.05.010 schmid, k. (2013). a formal approach to technical debt decision making. qosa '13 proceedings of the 9th international acm sigsoft conference on quality of software architectures, pp. 153-162. doi:10.1145/2465478.2465492 seaman, c., guo, y., zazworka, n., shull, f., izurieta, c., cai, y., & vetrò, a. (2012). using technical debt data in decision making: potential decision approaches. 2012 third international workshop on managing technical debt (mtd). doi:10.1109/mtd.2012.6225999 sharma, t., suryanarayana, g., & samarthyam, g. (2015). challenges to and solutions for refactoring adoption. ieee software, 32(6), pp. 44-51. smite, d., moe, n. b., floryan, m., levinta, g., & chatzipetrou, p. (2020). spotify guilds. 63(3), pp. 56–61. doi:https://doi.org/10.1145/3343146 smite, d., moe, n. b., levinta, g., & floryan, m. (2019). spotify guilds: how to succeed with knowledge sharing in large-scale agile organizations. 32(2), pp. 51-57. doi:10.1109/ms.2018.2886178 spínola, r., vetrò, a., zazworka, n., seaman, c., & shull, f. (2013). investigating technical debt folklore: shedding some light on technical debt opinion. 2013 4th international workshop on managing technical debt (mtd). doi:10.1109/mtd.2013.6608671 tom, e., aurum, a., & vidgen, r. (2013). an exploration of technical debt. journal of systems and software, 86(6), pp. 1498-1516. doi:https://doi.org/10.1016/j.jss.2012.12.052 wenger, e., & wenger-trayner, b. (2015). introduction to communities of practice. a brief overview of the concept and its uses. accessed on: oct/30/2020, available:http://wenger-trayner.com/wpcontent/uploads/2015/04/07-brief-introduction-tocommunities-of-practice.pdf wenger, é., mcdermott, r. a., & snyder, w. (2002). cultivating communities of practice: a guide to managing knowledge. harvard business press. wolek, f. (1999). the managerial principles behind guild craftsmanship. 5(7). doi:10.1108/13552529910297460 yuanfang, c., & kazman, r. (2019). dv8: automated architecture analysis tool suites. ieee/acm international conference on technical debt (techdebt), pp. 53-54. doi:10.1109/techdebt.2019.00015 zengyang, l., avgeriou, p., & liang, p. (2015). a systematic mapping study on technical debt and its management. journal of systems and software, pp. 193–220. doi:10.1016/j.jss.2014.12.027 technical debt guild: managing technical debt from code up to build 1 introduction 2 background 2.1 guild or communities of practice (cop) 2.2 technical debt management (tdm) 3 research method 4 context 5 research cycles 5.1 first cycle 5.1.1 deploy tools to support tdm 5.1.2 identify td 5.1.3 guide the teams in td payment 5.1.4 monitor td payment 5.2 second cycle 5.2.1 deploy tools to support tdm 5.2.2 define a coding standard in php 5.2.3 define a php documentation standard 5.2.4 identify td 5.2.5 train the teams 5.2.6 evaluate the actions by the developers 5.3 third cycle 5.3.1 identify td 5.3.2 review test case documentation 5.3.3 define an automated test standard 5.3.4 monitor automation test execution 5.3.5 identify build debt 6 discussion 6.1 results 6.1.1 source code debt 6.1.2 test debt 6.1.3 build debt 6.2 success factors 6.3 main challenges 6.4 resulting guidelines 6.5 related work 6.6 generalization and threats to validity 7 conclusion acknowledgments references journal of software engineering research and development, 2021, 9:13, doi: 10.5753/jserd.2021.1942 this work is licensed under a creative commons attribution 4.0 international license. towards to transfer the directives of communicability to software projects: qualitative studies adriana lopes damian [ federal university of amazonas | adriana@icomp.ufam.edu.br] edna dias canedo [ university of brasília | ednacanedo@unb.br] clarisse sieckenius de souza [ pontifical catholic university of rio de janeiro | clarisse@inf.puc-rio.br] tayana conte [ federal university of amazonas | tayana@icomp.ufam.edu.br] abstract the software artifacts developed in the early stages of the development process describe the proposed solutions for the software. for this reason, these artifacts are commonly used to support communication among members of the development team. miscommunication through software artifacts occurs because practitioners typically focus on their modeling, without reflecting on how other software development team members interpret them. in this context, we proposed the directives of communicability (dcs) to support practitioners analyzing characteristics that affect the artifact’s content on communication via artifact. we conducted preliminary studies in a controlled environment with our proposal. however, we noticed that new studies are necessary to evaluate the dcs concerning practitioners’ perceptions before transferring them to the industry. in this paper, we present two studies performed aiming to transfer the dcs to the software industry. in the first study, we evaluated the practitioners ’ perception about the dcs. in the second study, we evaluated the feasibility of the dcs in a software developmen t team. the studies’ results indicated that dcs have the potential to support improvements in artifacts’ content to reduce miscommunication via artifact. to facilitate the use of our proposal in the software industry, we created procedures that support the adoption of the dcs and checklists for the application of each directive in the software artifacts. we noticed positive perceptions of practitioners about the application of dcs in software artifacts. we hope that our contribution support software development teams that use artifacts in your projects. keywords: communication via software artifacts, human‑centered computing, semiotic engineering 1 introduction artifa cts developed in the ea rly sta ges of the softwa re development process, such a s the different dia gra ms of the unified modeling la ngua ge (uml) (freire et a l., 2 018; omg, 2015), a ssist pra ctitioners in understa nding the problem for which softwa re wa s required. as proposed solutions for softwa re development a re in a rtifa cts, these a rtifa cts a lso support tea m communication (petre, 2013). communica tion is considered a n importa nt fa ctor in softwa re development, since miscommunica tion in software tea ms ca uses low productivity a nd softwa re fa ilures (kä fer, 2017). miscommunica tion via a rtifa ct occurs, for exa mple, when consumers (who ta ke the informa tion they see in the models for the development of a nother a rtifa ct) ha ve different interpreta tions from the ones intended by the producers (who conceive the modeling of the softwa re). as much as consumers know the modeling nota tion, the wa y the modeling ha s been expressed by their producer ca n a ffect these pra ctitioners’ mutua l understa nding. in order to mitiga te miscommunica tion via a rtifa ct, we proposed the directives of communica bility 1 (dcs), presented in lopes et a l. (2019a ). the dcs ca n support reflections to producers a bout how they ca n crea te a softwa re solution via a rtifa cts a imed to get a mutua l understa nding a mong development tea m members. 1 communicability in this context refers to the artifact’s ability to convey to its consumers the solution conceived by its producers. pra ctitioners ca n use our proposa l ma inly in the a rtifacts developed in the initia l sta ges of the development process, such a s uml dia gra ms, mockups a nd others. we conducted prelimina ry studies to eva lua te our proposa l to reduce miscommunica tion (lopes et a l., 2019a ; lopes et a l., 2019b). however, we noticed tha t new studies a re necessa ry to eva lua te the dcs concerning pra ctitioners’ perceptions before tra nsferring them to the industry. given the context a bove, we conducted a n explora tory study (lopes et a l., 2020) to eva lua te pra ctitioners’ perceptions of the dcs. fifteen pra ctitioners pa rticipa ted in this study by modeling uml use ca se (omg, 2015) with the support of dcs. the results demonstra ted that the uml use ca ses developed, with the support of dcs, ha d few risks of miscommunica tion. besides, pa rticipa nts’ perceptions a bout the dcs indica te tha t such directives ca n support better communica tion via a rtifa ct, contributing to software qua lity. however, it is a lso importa nt to eva lua te how softwa re engineers a pply the dcs in a rtifa cts used in software projects to identify their fea sibility. this pa per extends our previous work (lopes et a l., 2020), presenting a study ca rried out to a na lyze communica tion via a rtifa cts in a softwa re development team. we conducted this study in a softwa re tea m with fourteen pra ctitioners that worked on a coopera tion project between the university of bra silia (unb) a nd the bra zilia n army. the results of this study showed the potentia l of the dcs to indica te improvements in the a rtifa ct’s content rega rds communica tion via towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 a rtifa cts. in a ddition, we present proposa ls tha t fa cilitate pra ctitioners to a dopt the dcs, such a s procedures tha t direct the a doption of the dcs in softwa re a rtifa cts a nd checklists tha t support the employ of ea ch directive in common scena rios of two specific a rtifa cts. through both studies, we noticed the contribution of the dcs for: (i) few risks of miscommunica tion via a rtifa cts, a llowing better communica tion via a rtifa ct ; a nd (ii) improvements on the qua lity of a rtifa cts, since miscommunication ca used incorrect informa tion in softwa re a rtifa cts. we hope tha t our contribution helps softwa re development tea ms reduce miscommunica tion via a rtifa cts. 2 theoretical foundations and related works this section begins by presenting both the semiotic engineering theory (de souza , 2005; de souza et a l., 2016) and the grice coopera tion principle (grice, 1 975), which we used to understa nd communica tion via a rtifacts a nd to propose the dcs. additiona lly, we present rela ted works to this type of communica tion. 2.1 theoretical foundations semiotic engineering theory (de souza , 2005; de souza et a l., 2016) cha ra cterizes user-system intera ction a s a pa rticula r ca se of huma n-media ted systems communica tion. systems a re considered metacommunication a rtifa cts in semiotic engineering, i.e., a rtifa cts tha t communica te a messa ge from the designer to users a bout h ow they ca n or should communica te with the system to do wha t they wa nt. the content of the metacommunication messa ge, or metamessage, ca n be pa ra phrased in the following templa te: “here is my understanding of who you are, what i’ve learned you want or need to do, in which preferred ways, and why. this is the system that i have therefore designed for you, and this is the way you can or should use it in order to fulfill a range of purposes that fall within this vision”. semiotic engineering uses the communica tion space model proposed by ja kobson (1960), tha t is structured in terms of context, sender, receiver, messa ge, code, a nd cha nnel, where: “a sender transmits a message to a receiver through a channel. the message is expressed in code and refers to a context”. ba sed on the communica tion space model proposed by ja kobson (1960), we ca n structure the communica tion elements via a rtifa ct in terms of the problem doma in (context), how the a rtifa ct is a va ila ble (cha nnel), informa tiona l a rtifact’s content (messa ge) composed of the a rtifa ct’s nota tions (code) for the communica tion between the producer (sender) a nd consumer (receiver) of a n a rtifa ct, where: “a producer transmits the informational content of the artifact to a consumer through a channel. informational content of the artifact is expressed by the artifact’s notations and refers to the problem domain ”. figure 1 presents a cha ra cteriza tion of these elements. semiotic engineering proposed eva lu a tion methods to support designer-user communica tion in order to understand how the user is being receives the meta messa ge. the principle of ca tegoriza tion of communication failures presented by semiotic engineering is rela ted to three ca tegories: figure 1. communication space of jakobson (1960). • complete failures when the intention of the communication and its effect are inconsistent; • partial failures when part of the intended effect of the communication is not reached; and • temporary failures when in the intention of a communicative act between user and system, the user has momentary difficulty to continue talking with the system. semiotic engineering extended its origina l perspective to a huma n-centered computing perspective, a resea rch field tha t a ims to understa nd huma n beha vior by integra ting technologies in socia l a nd cultura l contexts (sebe, 2010). this contribution is rela ted to the set of conceptua l a nd methodologica l tools ca lled signifyi (signs for your interpreta tion) (de souza et a l., 2016). the signifyi suite helps investiga te mea nings in softwa re during the development process and the communica tion between softwa re producers a nd consumers. among them, the signifyi messa ge tool (sfyi messa ge) is the opera tiona l version of the meta communication templa te. this opera tiona l version proposes tha t it can sta nd on its own a s a powerful eva lua tion resource to identify communica bility issues (which refers to the qua lity of the tra nsmission of the solution designed by producers to consumers). de souza et a l. (2016) report the use of a principle of reciproca l coopera tion rela ted to effective a nd efficient communica tion, ca lled grice’s coopera tive principle (grice , 1975). this principle is expressed by four ma xims. brea king one or more of these ma xims ma y lea d to a communication fa ilure. grice’s four ma xims a re: quality try to ma ke your contribution a true one. do not sa y wha t you believe to be fa lse a nd do not sa y something without a dequate evidence. in softwa re development, for exa mple, the softwa re engineer must communica te to the team only informa tion tha t is rela ted to the problem doma in. quantity make your contribution as informative as is required. do not make you contribution more informative than is required. following the previous exa mple, when towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 communica ting to the tea m, the softwa re engineer must try to use only sufficient content to cla rify the informa tion they must develop. relation – be releva nt, tha t is, do not introduce points tha t do not come under discussion. in the ca se of systems developed in different cycles, ea ch cycle must conta in only informa tion releva nt to such development. manner be perspicuous, a voiding obscurity of exp ression a nd a mbiguity, be brief a nd be orderly . the software engineer must use descriptions tha t the tea m ea sily interprets, a voiding a mbiguity. 2.2 related works for the communica tion to be efficient, the sender must ca refully choose a n expression for th e content he wishes to communica te, using a code tha t the receiver is a ble to interpret (de souza , 2005; de souza et a l., 2016). in this sense, we identified works rela ted to a rtifa cts’ comprehensibility, which refers to the receiver’s interpreta tion of wha t the sender sa id in his communica tive a ct. on communica tion via a rtifa ct, bordin a nd de angeli (2016) point out tha t softwa re engineers sta ted tha t documenta tion keeps a softwa re development team a ligned, especia lly in scena rios of distributed tea ms or with the introduction of new members in the tea m. schoonewille et a l. (2011) present a contribution rela ted to cognitive a spects in the understa nding of softwa re design documenta tion. they investiga ted, in one study, the pa rticipa nts' a bility to extra ct inf orma tion from dia gra ms a nd texts (gra mma tica lly a nd synta ctica lly correct). the a uthors noticed tha t self-a ssessment could be problema tic. they observed tha t developers were sa tisfied to “fill in” informa tion missing from the documenta tion without the sa me understa nding a s the documenta tion producers. this ca n ca use incorrect interpreta tions rega rding the softwa re. na ka mura et a l. (2011) proposed three metrics rela ted to the comprehensibility of uml cla ss dia gra ms in the following a spects: (1) cla ss structure, (2) pa cka ge structure a nd (3) a ttributes a nd opera tions. the a uthors cla im tha t the metrics help in estima ting the cost of time for understa nding a cla ss dia gra m. cruz-lemus et a l. (2010) present a predictive model of comprehensibility for uml sta te ma chine dia gra ms, a na lyzing its structura l complexity. the a uthors’ goa l wa s to reduce the impa ct of understa nding this dia gra m. tilley (2009) presents a work tha t summa rizes 15 years of resea rch on the use of gra phica l nota tion a s documentation for understa nding the system. according to the a uthor, the gra phica l nota tion ca n help to understa nd the system and support com munica tion. however, technica l ‘communicators’ a re not usua lly involved in this process. still, a ccording to the a uthor, the result is tha t the engineers, who ha ve the best of intentions, do not ha ve the necessa ry ba ckground to explore the resources of the gra phic nota tion to support end users’ ta sks. therefore, the a uthor reports a lesson lea rned: “we need to know how to ta lk”. theref ore, this highlights the importa nce of the producer thinking a bout the consumers. la nge a nd cha udron (2006) present a work tha t investiga ted the effects of defects in uml dia gra ms in rela tion to different interpreta tions. they conducted two controlled experiments with a la rge group of students a nd pra ctitioners. the two ma in contributions of this work a re investiga tions on defect detection a nd different interpreta tions ca used by undetected defects. the a uthors sta te that the results a re genera liza ble for modeling with uml dia gra ms. these works dea l with topics rela ted to communication with the support of a rtifa cts developed in the ea rly sta ges of softwa re development. schoonewille et a l. (2011) a nd tilley (2009) show the importa nce of a rtifa ct producers to reflect on consumers. thus, it is importa nt to ha ve a proposa l for a rtifa cts producers to reflect on the consumers. the contributions of the dcs ca n help with this, a s their goa l is to support communica tion via a rtifa ct. this ca n be a chieved when pra ctitioners ma ke improvements to the a rtifa cts to obtain a mutua l understa nding of tea m members. 3 directives of communicability for the dcs proposa l, we ha ve a ppropria ted the communica tion spa ce of ja kobson (1960) to communica tion via a rtifa ct, a s follows: the a rtifa ct is ma de a vaila ble with the support of a tool (the cha nnel) with informa tion from the problem doma in (context) to support communica tion between artifa cts producers (the emitters) a nd consumers (the receivers). the producer, in his messa ge, must con sider how the content is expressed (the use of the code) in such a rtifa cts. figure 2 shows such a ppropria tion . figure 2. appropriation of the communication space of jakobson (1960) for communication via artifact. besides, we ha ve a ppropria ted of semiotic engineering to define the following concepts rela ted to communication via a rtifa ct: • communicability of software artifacts refers to the a rtifa ct’s a bility to tra nsmit to its consumers the proposed solutions for softwa re development. • communicability issues in software artifacts – refers to the expressions or fea tures of the a rtifa ct tha t can be directly a ssocia ted with a n incompa tibility between mea nings a ssocia ted to them by their producers and consumers. • risks of miscommunication via artifacts – the towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 likelihood of a communica bility issue to ca use communica tion fa ilures between producers and consumers. • miscommunication via artifacts – incompa tible interpreta tions by a rtifa cts consumers from the producer perspective for softwa re modeling. 3.1 proposal of the dcs we ela bora ted the dcs ba sed on semiotic engineering (de souza , 2005; de souza et a l., 2016) a nd grice’s coopera tive principle (grice, 1975). we a da pted the origina l semiotic engineering metacommunication templa te a s follows: “here is my understanding as a producer of the model, of who you are, as its consumer (to whom the producer is designing the model), what i have learned about what you need to do in system development (about what should be addressed in the model). this is the solution of the system that i designed for you to carry out your activities”. ba sed on this, we crea ted the following questions to help producers reflect on a rtifa cts consumers: (i) can the consumer understand the artifacts’ cont ent? can the consumer achieve its goals? – to support producers to reflect on whether everyone involved can understand the information in the model, such as developers and managers, or only developers; and (ii) what content should be addressed about the domain of the problem/solution of the system in the artifact? in order to encourage the producer to reflect on the content that she wishes to be comprehended from the model, such as the tasks that a user can perform on the system. these questions are used before the use of the dcs. rega rding the informa tion rela ted to models’ content, the dcs use the four ma xims of grice’s coopera tive principle. the directives will a llow producers to reflect on the models’ content before they send it to the consumer, so tha t there is mutua l comprehension in softwa re development teams. with this, the dcs ca n improve the model’s a bility to convey to its consumers the solution conceived by its producers. below we present ea ch dc, ba sed on grice’s ma xims: “say the truth!” dc1: use true informa tion. do not use informa tion tha t a ffects the content qua lity in the model (ma xim of qua lity). in the uml use ca se dia gra m, for insta nce, do not insert use ca ses tha t a re outside the problem doma in: “say what is needed and no more than necessary” dc2: use the necessa ry content in the model. do not use unnecessa ry content in the model (ma xim of qua ntity). ana lyze, for insta nce, the a mount of informa tion in the specifica tion of a ll use ca ses; “say it logically” dc3: orga nize the informa tion in the model consistently (ma xim of rela tion). for exa mple, orga nize the use ca ses in the dia gra m so tha t they present a logica l sequence for the producers; “say it clearly” dc4: orga nize the informa tion in the model clea rly (ma xim of ma nner). describe the na mes of the use ca ses so tha t they a re ea sily understood a nd differentia ted from ea ch other. 3.2 how can software engineers can apply the dcs? we designed the dcs to be employed by softwa re engineers in a rtifa cts tha t represent a spects of the softwa re developed from their perspectives, such a s uml dia gra ms, bpmn dia gra ms, a nd prototypes. in the study presented in subsection 4.2, a tea m a dopted uml use ca ses a nd prototypes to represent their softwa re development decisions. the dcs ca n reduce the risks of miscommunica tion via these a rtifa cts. figure 3 presents a schema tic of how softwa re engineers can a pply the dcs into uml use ca ses. figure 3. directives of communicability for software artifacts. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 in step 1 of figure 3, the producer begins his process of reflection on the consumers of the a rtifa ct produced based on the proposed questions. in step 2 of figure 3, the producer is a ble to obta in a better use of the directives in modeling, so tha t mutua l understa nding occurs. for exa mple, consider a pra ctitioner modeling a use ca se for a system tha t supports users in the use of medicines. by using dc2, ba sed on the pra ctitioner’s reflection, if the producer knows tha t the consumer ca n recognize the difference between the ‘reminders’ a nd ‘notices’ elements tha t will be used in the system, there is no need to deta il the difference between them. if the consumer does not know such a difference, it is importa nt that the producer describes this. dc2 will support this producer in producing use ca se specifica tions with the a mount of informa tion needed for those elements. rega rding the use of the dcs, producers ca n use them in digita l forma t, a vaila ble in a technica l report (lopes et a l., 2021), or print them to put on their worksta tions. we just empha size tha t it would be interesting for the producers had a ccess to directives during the development of the a rtifa cts. in section 5, we present proposa ls tha t ca n help softwa re engineers a dopt the dcs in softwa re projects. about the users of our proposa l, we crea t ed the dcs to be used by both beginners a nd experienced softwa re engineers, since they know the modeling nota tion. we empha size tha t the dcs support producers reflect on the a rtifa ct’s content to a chieve a mutua l understa nding a mong the members of a softwa re development tea m a nd not in modeling errors. 3.3 preliminary studies with the dcs in lopes et a l. (2019a ), two softwa re engineers, with the sa me level of experience in modeling, produced a rtifa cts. one of them used the dcs a nd the other did not. then, 3 0 pa rticipa nts were invited to crea te mockups ba sed on the a rtifa cts produced by the softwa re engineers. we divided the pa rticipa nts into two groups. the experimenta l group crea ted the mockups ba sed on the a rtifa cts produced with the dcs a nd the control group used the mockups ba sed on the a rtifa cts developed without the dcs. we noticed tha t the experimenta l group ha d a lower number of miscommunica tion. in lopes et a l. (2019b), the dcs were a lso a na lyzed to reduce the risk of miscommunica tion in softwa re a rtifa cts, such a s uml cla ss dia gra ms, bpmn (business process modeling a nd nota tion) dia gra ms (omg, 2011) a nd ifml (intera ction flow modeling la ngua ge) (bra mbilla a nd fra terna li, 2014). we choose these dia gra ms for different communica tion purposes during softwa re development. twentyfour pa rticipa nts, divided into two groups, ba sed on a modeling scena rio, produced such dia gra ms. the experimental group used the dcs a nd the control did not use the directives. the experimenta l group crea ted a rtifa cts with a lower number of risks of miscommunica tion compared to the control group. in lopes et a l. (2019a ) a nd lopes et a l. (2019b), we presented studies with qua ntita tive a na lyzes. however, it is importa nt to ca rry out qua lita tive studies on the dcs before tra nsferring them to the industry. for this rea son, we pla nned new studies tha t a im to a na lyze pra ctitioners’ perceptions a bout the directives. figure 4 presents a timeline a bout the studies ca rried out a nd our pla nning rega rds the new studies, which to answer the resea rch questions (rq) below. rq1 do practitioners perceive the dcs as support in improving the quality of artifacts? rq2 is the dc application by producers feasible in development teams? figure 4. timeline of preliminary studies with the dcs and planning of new studies in the software industry. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 4 experimental studies this section presents the studies ca rried out with pra ctitioners before tra nsferring the dcs to the industry. in the first study (study 1), fifteen pra ctitioners pa rticipa ted. th ey crea ted a uml use ca se with the support of the dcs. our main goa l in this study wa s to a na lyze the communica tion intention by a rtifa cts producers. after the study, the pa rticipa nts provided their perceptions a bout the dcs through a questionna ire. in the second study (study 2), we ca rry out a study in the context of a softwa re project. we eva lua te if the dcs can provide supports to identify risks in softwa re a rtifa cts that ca used miscommunica tion. in a ddition, producers a nd consumers provide their perceptions a bout communica tion via a rtifa cts through interviews a nd a n online questionna ire. 4.1 study 1: evaluation of the dcs from the practitioners’ perception we conducted a first study tha t eva lua tes the pra ctitioners’ perception of the dcs from 15 pra ctitioners rega rding their support during a rtifa cts development (lopes et a l., 2020). in this study, the pa rticipa nts a pplied the dcs in uml use ca ses, tha t is, in the use ca se dia gra m a nd specifica tion. after that, we sent questionnaires to collect the practitioners’ perceptions. as in this study we evaluated the practitioners’ perception of the dcs during the modeling of use cases. we did not investigate the communication between producers and consumers. therefore, the researchers analyzed only the possibility of a risk of miscommunication in the use cases. in addition, we analyzed the impact on the quality caused by the risks of miscommunication and qua lita tive da ta obta ined a bout the practitioners’ answers. 4.1.1 study 1: planning we selected 15 pra ctitioners to produce uml use ca ses with the support of the dcs. all pra ctitioners ha d a college degree a nd they were ta king the funda mentals of softwa re engineering cla ss in softwa re engineering postgra dua te course a t northern university center (uninorte). table 1 presents a summary of the participants’ experience. regarding the participants, most of them did not work creating artifacts in software projects related to our research. however, we consider practitioners who are consumers in software projects able to participate in the study, because they can provide their perception into our proposal to communication via artifacts. in this way, we planned training so that participants execute the study activities. we planned this study to take in a single day, du ring the morning and afternoon. in the morning, before we ca rried out the study, the pa rticipa nts received tra ining of a pproxima tely two hours for exercising use ca se modeling. it is noteworthy tha t a ll pa rticipa nts ha d prior knowledge of uml use ca ses. in the a fternoon, we reserved a la bora tory for the execution of this study, which ha d notebooks for the pa rticipa nts to use. we pla nned to run this study in a pproximately three hours. table 1: participants’ experience in software industry experience in the industry participants 1 – 3 years p1 (developer) p2 (software tester developed) p3 (software analyst) p4 (developer) p6 (process engineer) p7 (developer) p10 (developer) p12 (developer) p13 (developer) 4 – 8 years p8 (software tester developed) p9 (developer) p15 (developer) more than 9 years p5 (developer) p11 (developer) p14 (project manager) in order to observe the pa rticipa nts’ discussion rega rding the development of use ca ses in different modeling scena rios, we ra ndomly defined four groups. ea ch modeling scena rio ha d simple content, so tha t the pa rticipa nts complete the study a ctivities in the pla nned time. we present the description of the modeling scena rios for ea ch group below: group 1 scenario – to support students who want private lessons in basic classes such as mathematics, a system must be developed. the system should provide teachers with private lessons. additionally, evaluations of these teachers by students/other teachers should be displayed. the system should allow managing the teachers’ agendas on the classes, so that students can enroll in them. thus, it is possible to include and cancel classes. group 2 scenario – to support the small events, a system must be developed. in this system, the organizers will be able to create their accounts and, from th is, register events such as birthday parties, guest lists, and gift lists. they will also be able to send invitations via e-mail, control expenses, and generate reports for both guests and expenses. the system provides communication among organizers and gu ests. guests may or may not confirm their presence at the event and consult the gift list. group 3 scenario – to support sales professionals in their orders, such as delivery control, customer management (retailers and wholesalers), a system must be developed. the system will support professionals who want to computerize and innovate the service, minimizing errors and constraints from the lack of systematic control. the system should allow users to register their customers and to manage the stock of their products. after the payment record, the order is sent to the customer with the delivery invoice. group 4 scenario – to support residents of the state of amazonas in brazil, who have difficulty accessing the information on river routes for purchasing tickets, a system must be developed. the system will support passengers of different vessels, embarking/disembarking times, the vessels’ capacity, number of available spaces, price, and information on river routes. concerning vessels’ owners, they will be able towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 to register the number of employees availabl e for passenger assistance. before the execution of the study, we planned training with the dcs to be applied in use case modeling. rega rding the study a ctivities pla nned a nd figure 5 summa rizes them. figure 5. study 1 activities planned. to analyze the defects related to the risks of miscommunication, we used the types of defects presented by granda et al. (2015). table 2 shows these defects. table 2. types of defects (adapted from granda et al. (2015)) type description omission the required information has been omitted. incorrect fact some information in the model contradicts the list of requirements or general knowledge of the system domain. inconsistency information in one part of the model is inconsistent with information in other parts in the model. ambiguity the information in the model is ambiguous. this can lead to different interpretations of information. extraneous information the information that is provided is not required in the model. redundant information is repeated in the model. 4.1.2 study 1: execution we asked the participants to position themselves according to the groups defined to carry out the study activity. the participants were in the same laboratory, but the groups were far from each other. after that, we delivered to the groups the modeling scenarios and the printed dcs. the main researcher, the participants to draw up the use case diagram together, discussing relevant aspects of the system. after that, the researcher requested each participant to specify only one use case. the participants created the use cases from the modeling scenarios. regarding the use of the dcs, the participants should, for example, create use cases in the context of the problem domain (use of dc1) and analyze the amount of information to understand these use cases correctly (use of dc2). during the study, a researcher a researcher took notes of the directives most used by the participants. 2 https://a stah.net/ the participants used the astah21tool to model use cases. table 3 presents the four groups defined with the participants and the objective of each system in the modeling scenarios. table 3. groups defined in this study groups participants group 1 p4, p5, p6 e p7 group 2 p8, p9, p10 e p11 group 3 p12, p13, p14 e p15 group 4 p1, p2 e p3 regarding the use of the dcs, the main researcher informed the participants the directives can be applied according to the most appropriate for them, such as by using the directives during the modeling or after the participants made a modeling proposal. the main researcher noticed that all groups made a modeling proposal and then applied the dcs. at the end of the study, all participants answered a post study questionnaire to provide their perceptions about the dcs, including each participants’ experience in the industry. regarding the duration of the study, it was completed ahead of our planning. 4.1.3 study 1: results we analyzed the use cases produced by the groups regarding the risks of miscommunication, which were discussed with the other authors of this paper. the risks of miscommunication identified in the use cases of each group are shown in table 4, including their total number of occurrences and the description of each risk. table 4: risks of miscommunication in the use cases developed by the groups groups description of the risks of miscommunication group 1 lack of relationship in the use case diagram (1) different standards in the organization of the use case specification (3) lack of information in business rules (5) group 2 use case specification inconsistent with the use case diagram (1) lack of relationship in the use case diagram (1) lack of information in business rules (4) group 3 lack of information in business rules (2) lack of steps in the main flow of the use case specification (2) group 4 lack of steps in the main flow of the use case specification (2) lack of information in business rules (5) in this analysis, for example, we noticed that the participants in group 1 did not provide all the necessary information in the business rules, such as the fields in the system for a student to evaluate the teachers. the evaluation of the towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 artifacts produced by the groups showed few risks of miscommunication compared to the number of risks of miscommunication identified in other software artifacts in a preliminary study (lopes et al., 2019b). however, such risks can cause possible miscommunication. regarding the application of the dcs by the participants, based on the researcher’s notes during the study, the majority of them used the following directives: dc2 to evaluate the amount of information that should be represented, and dc3 for the organization of information logically in use cases. analysis of software defects related to risks of communication failure. we grouped the risks of miscommunica tion in ta ble 5, which a re rela ted to the groups. defects rela ted to risks ha ve a lso been described in this ta ble. rega rding the risks of miscommunica tion, we noticed a la ck of informa tion for the business rules in the four modeling groups. besides, there wa s a la ck of informa tion for the rela tionship between use ca ses in the dia gra ms produced by the two groups. there wa s a la ck of specifica tion of steps in the ma in flow of use ca ses of two groups. these risks would be mitiga ted if the pa rticipa nts ha d reflected better on the a mount of information, rela ted to dc2. rega rding the risks rela ted to the la ck of sta ndardiza tion of use ca se specification itself a nd inconsistency between the use ca se specifica tions a nd the use ca se dia gra m, this would be mitiga ted with dc4 a nd dc1, respectively. table 5. defects related to the risks of miscommunication in the use cases developed by the groups groups description of the risks defects group 1 group 2 lack of relationship in the use case diagram omission group 1 different standards in the organization of the use case specification ambiguity group 1 group 2 group 3 group 4 lack of information in business rules omission group 2 use case specification inconsistent with the use case diagram inconsistency group 3 group 4 lack of steps in the main flow of the use case specification omission rega rding the risks rela ted to the la ck of informa tion in: (i) the business rules, (ii) ma in flow steps in the use case specifica tion, a nd (iii) rela tionships between use ca ses in the dia gra m we considered them to be a n ‘omission’ defect. different sta nda rds in the specifica tion of use ca ses ma y a llow different interpreta tions by consumers, which we considered a n ‘ambiguity’ defect. fina lly, we considered inconsistent informa tion between the use ca se dia gra m a nd the use case specifica tion to be a n ‘inconsistency’ defect. analysis of the participants’ perception. rega rding the post-study questionna ire, the pa rticipa nts a nswered the following question: “wha t is your perception of the directives of communica bility?” we defined this question in a genera l wa y to collect different opinions of the pa rticipa nts on the dcs. to a na lyze the qua lita tive da ta obta ined in the study a ccording to stra uss a nd corbin (1998), resea rchers ca n use coding procedures to a chieve their resea rch objectives. we used open coding to understa nd pa rticipa nts’ perceptions. with tha t, we observed the following codes: dcs contribute to the quality of software artifacts "the directives help to reflect on what should be developed, avoiding inconsistencies" (p5) "the directives help to understand the system, support to identify possible errors" (p8) "facilitates the identification of problems in modeling" (p13) dcs promote the organization of information in artifacts "the directives helps to organize and improve the information required to create a system" (p3) "dcs assist in organizing ideas together with the development team" (p10) "the directives help to organize thoughts when designing the system" (p2) dcs support the understanding of the system "dcs assist to obtain relevant information for the project" (p4) "the directives provide great support for the production of the software" (p7) "helps to considerably improve the general understanding of a system" (p15) dcs can promote effective communication via artifact "dcs are a type of roadmap for organizing ideas in communication through a logical way" (p11) "they help to think about how to communicate with colleagues" (p6) "help in communicating correctly in software development" (p12) dcs promote the reduction of different interpretations "the directives help reduce the multiple interpretations of the same idea, as the ideas must be conveyed so that everyone understands" (p14) difficulties with the use of the dcs "it is not easy to understand the directives; it required more of my mental effort" (p2) "it is not easy to apply the directives; i believe it depends on the user's experience" (p5) "directives demand time for understanding" (p6) through the participants’ perceptions, we observed that the dcs contribute to the improvement of the quality of the towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 artifacts. such perceptions are represented by the co des ‘dcs contribute to the quality of software artifacts’ and ‘dcs promote the organization of information in artifacts’. most of the participants’ responses showed that they perceived the purpose of the dcs, as we noticed the codes ‘dcs can promote effective communication via artifact’ and ‘dcs promote the reduction of different interpretations’. some participants also reported ‘difficulties with the use of the dcs’, which may be related to their reflection on whether or not they are correctly applying the main concept of each directive for what each producer wants to communicate. however, this is part of the reflection process by producers regarding their communication through artifacts. analysis of acceptance. we applied the technology acceptance model (tam) to analyze the participants’ perception of the dcs in the post-study questionnaire (venkatesh and bala, 2008). tam is one of the most adopted models for collecting information about the decision to accept or reject technologies (marangunić et al., 2013). this model is basically based on two constructs: perceived ease of use: degree to which a user believes to use a specific technology with little effort. perceived usefulness: the degree to which a user believes tha t using a specific technology would improve their performa nce a t work. the user’s beha viora l intention to use a technology, the intention to use, is determined by the perceived ea se of use a nd perceived usefulness. the sta tements conta ined in the post-study questionna ire to a ssess the constructs of ea se of use, usefulness, a nd intention to use the dcs, a da pted from [25], a re presented below: perceived ease of use e1. my interaction with the directives of communicability is clear and understandable. e2. interacting with the directives of communicability does not require a lot of my mental effort. e3. i find the directives of communicability easy to use. e4. i find it easy to get the directives of communicability to do what i want it to do. perceived usefulness u1. using the directives of communicability improves my performance better for understanding aspects of the software. u2. using the directives of communicability in my job has improved my productivity, since i will not have to correct information that is not understood by colleagues. u3. using the directives of communicability enhances my effectiveness on communication with the team based on the artifacts. u4. i consider the directives of communicability useful for software design. intention to use i1. assuming i had enough time to design software, i intend to use the directives of communicability. i2. considering that if i could choose any tool, i predict that i would use the directives of communicability. i3. i plan to use the directives of communicability in my next project. rega rding the a da pted tam sta tements, pa rticipa nts provided their a nswers on a seven-point likert sca le (likert, 1932). the possible a nswers were “tota lly agree, strongly agree, pa rtia lly agree, neutra l, pa rtia lly disa gree, strongly disa gree, a nd tota lly disa gree”. the pa rticipa nts a nswered their degree of a greement on the usefulness, ea se of use, and intention to use the dcs in the production of a rtifa cts. figure 6 summa rizes the pa rticipa nts’ a nswers. figure 6. degree of participants’ acceptance regarding the use of the dcs in the production of artifacts rega rding the disa greements rela ted to e2, e3 a nd e4 for ea se of use, a s shown in figure 6, we noticed five pa rticipa nts a nswered tha t, including p2, p5 a nd p6 tha t cited it’s not ea sy to employ dcs, represented by the ‘difficultie s with the use of the dcs’ code in subsection 5.2. the other pa rticipa nts did not provide a nswers to expla in why they disa greed with e2. in summa ry, such a nswers ma y indica te that it is importa nt to provide ma teria l tha t helps in the producer's reflection ba sed on the dcs. about the disa greement a nd neutra l a nswers with the sta tements tha t measure usefulness, we noticed tha t p3, p6, a nd p11 a nswered this. however, a ll pa rticipa nts who consume informa tion from a rtifa cts, i.e. developers, a gree that our proposa l is useful to communica tion via artifa ct. overa ll, most of the pa rticipa nts’ a nswers showed a greement rega rding ea se of use, usefulness, a nd intention to use the dcs. with this resea rch (lopes et a l., 2020), we observed that the dcs promoted the pa rticipa nts’ reflection on their communica tion to the others involved in the development of a softwa re. the dcs a lso ma de it possible to reduce the introduction of defects, beca use we perceived consistent ma pping between the risks of miscommunica tion a nd softwa re defects. additiona lly, most of the pa rticipa nts’ a nswers to the dcs were positive a bout their use. with this, it is possible to infer tha t the softwa re industry considered the directives useful. ba sed on the results obta ined in this study, we decided to ca rry out a fea sibility study in a softwa re development team. this study ma y increa se the indica tions on the tra nsfer of the dcs to the industry. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 4.1.4 threats to validity of study 1 in all experimental studies, there are threats that can affect the validity of the results. the threats related to this study are discussed below with the classification of threats to validity presented by wohlin et al. (2012): internal validity. tra ining effect it would be interesting if there wa s no need for tra ining. however, the short tra ining time a llowed the dcs to be used by pra ctitioners during the production of uml use ca ses. in a ddition, tra ining on use ca se modeling a lso ena bled pa rticipa nts to execute the study a ctivities, a s most of them did not work crea ting a rtifa cts that represent softwa re decisions in projects. time used for the study despite the time considered long for the use case modeling, a ll pa rticipa nts completed the study a ctivities before the expected tim e. external validity. validity of the artifacts – we carried out only modeling of uml use cases in this study. it is not possible to claim that uml use cases represent all the artifacts that support communication. besides, the use cases were modeled for four software projects. it is not possible to claim that this artifact represents all types of software. construct validity. indicators for miscommunication the measures adopted to analyze miscommunication were based on the semiotic engineering theory (de souza, 2005; de souza et al., 2016), which has different methods to assess communication during the development process. conclusion validity. there is a limitation in the representativeness of the results, a known problem in experimental studies of software engineering (fernandez et al., 2012). the results obtained in this study may not be reproduced in other software artifacts that support the und erstanding of members of a team. analysis of artifacts – about the risks of miscommunication in use cases, there is a threat regarding the researcher who carried out such analysis. to mitigate this threat, we added another researcher to discuss this analysis. 4.2 study 2: feasibility study in study 1, a lthough we perceived positive a nswers a bout the dcs, we did not ca rry out the study in the context of a softwa re project. we ca rried out a nother study, our second study, in a softwa re development tea m. we investiga ted the use of the dcs in the a rtifa cts used by the tea m to identify risks tha t ca used miscommunica tion in the development of the bulletin system (sisbol). sisbol is a web system, with client-server a rchitecture, following the sta nda rd representa tional sta te tra nsfer (rest) [1], with the purpose of a utomating the process of genera ting newsletters (officia l) a nd ma naging the members' persona l historica l (cha nges of the milita ry) of the bra zilia n army (eb). a bulletin represents a n instrument by which the comma nder, chief or director of the eb dissemina tes the orders of the higher a uthorities a nd the fa cts tha t must be known by the milita ry orga niza tions in which the members pa rticipa te. sisbol is composed of entities a ssocia ted with milita ry, such a s qua lifica tion, gra dua tion, subunit/division/section, milita ry orga niza tion, function a nd a ltera tion, a ssocia ted with the bulletin structure (type of bulletin, section, pa rt, genera l subject, specific subject, note) a nd a ssocia ted with system users. notes a re documents proposed by a competent a uthority to be a pproved by the commander, chief or director, for publica tion in its bulletin. the system has a certa in degree of configura bility, a llowing the a pprova l processing workflows for notes a nd bulletins to be customized for ea ch milita ry orga niza tion. 4.2.1 study 2: planning we initia lly designed the study to a na lyze how the team conducted its a ctivities a nd how softwa re a rtifa cts support communica tion. then, we pla nned the a na lysis of the a rtifa cts with the support of the dcs to identify opportu nities for improvement to a better communica tion. fina lly, we pla nned a collection of the tea m members’ perception of the support of the a rtifa cts. the tea m selected for the study wa s composed of 14 pra ctitioners who developed the sisbol. ta ble 6 shows the cha ra cteriza tion of the tea m. table 6. characterization of the team team experience systems analyst (product ownerpo) 20 years designer 9 years developer 1 7 years developer 2 20 years developer 3 5 years developer 4 16 years developer 5 4 years developer 6 4 years developer 7 19 years developer 8 12 years developer 9 12 years developer 10 3 years developer 11 10 years developer 12 3 years the scope of the new sisbol involves 30 functiona lities, which were divided into lega cy fea tures (23) a nd new fea tures (07). the tea m used the a gile development methodology. the development tea m used the a gile scrum methodology. the a rtifa cts ela bora tion process wa s colla bora tive a nd involved different project sta keholders. the tea m used uml use ca ses a nd prototypes a s a rtifacts tha t conta in the solution designed for softwa re development. rega rding the tea m selected, the pra ctitioners did not create a doma in model, just the use ca ses a nd mockups. about the experience of the producers, the system a na lyst ha d twenty yea rs of experience with uml a nd the designer ha d nine yea rs of experience with prototypes in projects. rega rding the system developed by the tea m, it wa s a lrea dy in its fina l pha se, a s the tea m wa s only ma king corrections to some features. 4.2.2 study 2: execution we ca rried out the following steps in this study: towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 (i) a meeting between the po a nd the ma in resea rcher in order to obta in a n overview of the a ctivities and the a rtifa cts used a s a mea n of communica tion; (ii) meeting a mong the ma in resea rcher with the tea ms’ producers to a na lyze the a rtifa cts’ content ba sed on the dcs; (iii) a fter tha t, we prepa red a n electronic questionna ire for producers to a nswer their perceptions of the a rtifa cts a s a support for communica tion; (iv) a meeting of the ma in resea rcher with a n a rtifa cts consumer to understa nd how they used the a rtifa cts; (v) we a lso sent a n electronic questionna ire for consumers to a nswer their perceptions of the a rtifa cts with questions ba sed on the dcs. due to the una va ila bility of some pa rticipa nts to pa rticipa te in individua l meetings, this questionna ire fa cilita ted the collection of tea m members’ perceptions. in rela tion to step 2, the ma in resea rcher should be present to support the producers a nd to collect their perception a bout the a rtifa cts ba sed on the dcs. the ma teria l used in these steps is a va ila ble in a technica l report (lop es et a l., 2021). rega rding step 3, the pa rticipa nts a nswered the following questions on the electronic questionna ire: 1. what is your perception about this artifact as a means of communication? 2. tell us about your perception regarding the communication via artifact. about the step 5, we used the following questions to collect the consumer’s perceptions a bout the softwa re a rtifa cts: 1. during the software development, did you notice any inconsistent information regards the team knowledge about the software? – based on dc1; 2. about the quantity of information, is there the lack of information or excessive information? based on dc2; 3. is all information in the artifacts relevant to software development? please, tell us your perception – based on dc3; 4. it was difficult to understand any information in the artifacts? – based on dc4; with the execution of these steps, the dcs ca n help pra ctitioners to understa nd a spects that need improvements in the informa tiona l content of the a rtifa cts. these improvements ca n lea d a better communica tion via a rtifa cts. 4.2.3 study 2: results firstly, the producers a na lyzed the a rtifa cts with the support of the dcs a nd we a na lyzed the types of defects rela ted to risks identified. after this, a na lyzed the pa rticipa nts’ a nswers. analysis of software defects related to risks of communication failure. the ma in risks of miscommunica tion are in the use ca ses. such risks a re rela ted to the la ck of upda ting of some informa tion, identified with the support of dc1, and the excess of informa tion, identified with dc2. figure 7 presents a cha ra cteriza tion of the identified risks. figure 7. analysis of artifacts based on dcs rega rding defects rela ted to the risks of miscommunication, we ha ve identified: • lack of updating of some information in the use cases – inconsistence defect: the la ck of upda ting led to inconsistent informa tion in the a rtifa ct; a nd extraneous information defect: informa tion not needed in the a rtifa ct. • excess of information ambiguity defect: the excess of informa tion promotes different interpreta tions. analysis of communication via team artifact. we used open coding (stra uss a nd corbin, 1998) to understa nd the tea m’s communication via a rtifact and how the dcs ca n support the improvement of this type of communica tion. we a pplied the coding ba sed on the a nswer of producers a n d consumers a bout the a rtifa cts’ content. when a na lyzing the tea m’s communica tion through the a rtifa cts, we identified cha ra cteristics in the informa tional content tha t a ffected the communica tion. we noted tha t consumers ha d a dopted mockups more to support their a ctivities compa red to use ca ses. the tea m po, one of the producers of the a rtifa cts, a nd the consumers mentioned: “perhaps i put more information in the mockups than necessary and it led the team to not consult the use cases (systems ana lyst)”. “there was an excess of information in the documentation. so many details generated several differences in the documentation for implementation and other minimal details that did not affect the system's functionality itself... with the use of the mockups, it was easier to understand the user's needs, and so the doubts that i had about the functioning of the system were resolved (developer 11)”. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 “with the mockups, half of the system's functionalities were well defined, with only the business rules missing, which could not be modeled visually (developer 12)”. the dcs indica ted tha t consumers a dopted more mockups a s support in their a ctivities tha n the use ca ses due to the excess of informa tion in the use ca ses (with the support from dc2), a lso cited by developer 11. additiona lly, there wa s a n outdated use case (with the support from dc1), genera ting a nega tive impa ct on communica tion via a rtifact, as observed by one of the consumers: “throughout the development, i believe that the artifacts have become outdated in relation to the needs of users and the implementation of the system (developer 4)”. rega rding the communica tion via this tea m's a rtifact, one of the producers reflected on their communica tion based on the dcs a nd he believes there wa s a la ck of a nother a rtifa ct tha t supports the understa nding of the user’s intera ction with the system (dc 2): “the artifacts contain the necessary information that the team needs to understand the problem. however, there are some limitations and information that cannot be transmitted in the artifacts. for example, the 'disposable mockup' presents only an idea of what the interface with the possible fields of the system will look like, but it does not present how it will be done, or even the user's interaction with the system (designer)”. with the results of this study, we noticed miscommunica tion via a rtifa ct identified with the support of the dcs. the dcs were a ble to support the producers to ma ke improvements in the a rtifa cts, ena bling better communica tion via a rtifa ct. 4.2.4 threats to validity of study 2 the threats related to this study are discussed below with the classification of threats to validity presented by wohlin et al. (2012): internal validity. the ma in threa t to interna l va lidity wa s the sha ring of developers' perceptions of the a rtifa cts. to mitiga te this threa t, we sent a n electronic questionna ire to ea ch pa rticipa nt to a nswer their perception individua lly. however, this does not elimina te the possibility of communica tion between the pa rticipa nts. external validity. regarding the artifacts evaluated in this study, it is not possible to state that they represent all the artifacts that support communication. additionally, these artifacts were modeled for just one software project. construct validity. we identified the threat of the participant providing answers that do not reflect reality but rather personal expectations regarding the artifacts. to mitigate this threat, we informed the participants that the experiment did not provide any kind of personal or project assessment but rather as an assessment of the use of artifacts in support of communication. conclusion validity. there is a limitation in the representativeness of the results, this being a known problem in experimental studies of software engineering (fernandez et al., 2012). the results obtained in this study may not be reproduced in other software artifacts that support the understanding of those involved in the production of the systems. 4.3 lessons learned these studies helped us to understa nd different aspects of the dcs from the pra ctitioners’ perception. about study 1, we described our lessons lea rned below. • disagreements about the ease of use of the dcs show the need to create material that supports application of each directive – a lthough most pa rticipa nts a gree that dcs a re ea sy to use, we noticed some disa greements a bout this. the dcs a re genera l instructions tha t supports the ‘reflection’ of producers a bout their communica tion via a rtifa ct a nd there a re no specific steps for tha t. however, to support producers employ the directives, a ma teria l tha t indica tes some reflection points would be interesting. such ma teria l ca n be crea ted ba sed on common scena rios noticed in both studies presented in our pa per. • perceived usefulness by practitioners who act as consumers indicate that our proposal can support the communication of the artifact –the usefulness perceived by pra ctitioners who work a s developers indica ted tha t our proposa l supports mutua l understa nding between producers a nd consumers, since such pa rticipa nts ma y had experienced such a scena rio. about study 2, we noticed tha t dcs supported pra ctitioners in the eva lua tion of a rtifa cts a lrea dy used by a software tea m. we ha d the following lessons lea rned from our proposa l in this study: • consumers’ perceptions during evaluation of artifacts improve this type of communication – rega rding the employ of dcs in the eva lua tion of a rtifa cts a lready used by softwa re tea ms, both producers a nd consumers ca n do tha t, providing a contra st a bout communication via a rtifa ct in a tea m. such pra ctice supports continuous improvements in this type of communica tion. • material that supports producers to adopt the dcs in software projects – we designed the initia l proposa l of the dcs to a pply them in the production of a rtifa cts, but we noticed the potentia l of the directives eva lua te a rtifa cts a lrea dy used by tea ms. aims to help softwa re engineers to a dopt the dcs in their projects, it would be interesting the development of procedures tha t indica te the ma in steps to a pply the dcs. the next section presents the proposa l of the ma terials prepa red to support softwa re engineers a dopt the dcs in their projects. we crea ted such proposa ls ba sed in our lessons lea rning. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 5 proposal to support the application of the dcs in software projects rega rding the dcs, ea ch directive a ims to provide a genera l indica tion of how a rtifa cts ca n be expressed by their producers a bout their communica tion, so the risks of miscommunica tion a re mitiga ted. rega rding the a doption of the dcs to support improving the communica bility of a softwa re a rtifa ct, we observed two contexts in which a rtifa cts a re used: (1) when they a re a lrea dy being used by a tea m during the execution of a project , a nd (2) before being used by the tea m when the project is in its initia l st a ges. for the context a bove, we crea ted procedures to fa cilitate pra ctitioners who wish to a dopt the dcs in their projects. figure 8 presents a procedure to be followed by pra ctitioners who wish to a dopt the dcs to identify opportunities for improvements in the a rtifa cts. this procedure is suggested for tea ms tha t ha ve sta rted crea ting a rtifa cts without the support of our proposa l, but they would like to a dopt it in the a rtifa cts, a s noted in the second study presented in this pa per, such a s: 1. communication intent pra ctitioners should reflect on their communica tion intent ba sed on the questions: “ can the consumer understand the artifacts’ content? can the consumer achieve its goals?”, like developers a nd testers, a nd “what content should be addressed about the domain of the problem/solution of the system in the artifact? ”, a s the ta sks tha t a user ca n perform in the system. 2. use of the dcs in the artifacts’ content use of the dcs to identify risks tha t ca used miscommunica tion. to fa cilita te the use of the dcs, we prepa red a checklist, presented la ter in the text. at this sta ge, producers and consumers ca n ca rry out the eva lua tion for a better understa nding of the necessa ry improvements. 3. availability of artif acts with the improvements ma de, producers ma ke the a rtifa cts a vaila ble to consumers, a s this a lso a ffects communication via a rtifa cts, such a s e ma il or repository used by the tea m. figure 8. use of the dcs during execution of projects for the second context, figure 9 shows the procedure that pra ctitioners ca n a dopt when using the dcs before the production of the a rtifa cts. ea ch step to be followed in the procedure is described below. with the dcs a pplied to the a rtifa cts before their consumption, the risks tha t ca used miscommunica tion ca n be reduced. figure 9. adoption of the dcs in project planning 1. modeling notation it is importa nt for producers to reflect on the nota tion tha t will be a dopted when modeling the a rtifa cts to represent a spects of the softwa re. additiona lly, it is importa nt for producers to reflect on whether such nota tion is known to consumers. this st ep wa s not considered in the first context beca use the tea m a lready ha s the a rtifa cts esta blished to represent the solutions modeled for the softwa re. 2. communication intent simila rly to the first context, pra ctitioners must reflect on their intention to communicate ba sed on the questions proposed to use with the dcs. 3. use of the dcs in the artifacts’ content use of the dcs to reflect on producers' communica tion intent. the checklist a lso supports this reflection. 4. availability of artifacts producers should reflect on the best mea ns of communica tion tha t a rtifa cts should be ma de a vaila ble to consumers, a s it ca n a ffect communication via a rtifa cts, such a s e-ma il or repository. in a ddition to the procedures, we a lso developed checklists tha t ca n facilita te the a pplica tion of the dcs in the a rtifa cts investiga ted in our resea rch, such a s uml use case a nd mockups. ta ble 7 presents the checklist for mockups a nd ta ble 8 presents the checklist for uml use ca se. table 7. checklist based on dcs for mockups dcs item description dc1 is there information in the mockups that are outside the problem domain? if so, remove that information is there outdated information in the mockups? if so, update them dc2 are all requirements represented in the mockups? if not, design mockups with such information are all alternative paths represented in the mockups? if not, enter this information in the mockups in general, is the amount of information in the mockups sufficient for the team to understand the system? if not, enter the required amount of information is there an excess of information? if so, if this excess is unnecessary for understanding the system, remove it from the mockups dc3 is the order of the screens organized in such a way that the team better understands them? if not, arrange this sequence dc4 are the screen names clear in relation to their purpose? if not, clarify the names of the screens in mockups, are there any terms that are unknown to consumers? if so, please clarify such terms in mockups, is there any ambiguous information? if so, please clarify this information is information used to obtain implicit interpretation by the team? if so, reflect on whether such information should be expressed explicitly to avoid multiple interpretations towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 table 8. checklist based on dcs for use cases dcs item description dc1 is there information in the use cases that are outside the problem domain? if so, remove that information is there outdated information in the use cases? if so, update this information dc2 are all relationships between use cases represented in the diagram? if not, enter such relationships are all use cases represented in the diagram? if not, insert such use cases in use cases specification, are all actors involved represented? if not, insert such actors when specifying a use case, are all flows represented? if not, enter the necessary flows when specifying a use case, are all business rules represented? if not, insert the necessary rules is there an excess of information? if so, if this excess is unnecessary for understanding the system, remove it from the mockups dc3 are the use cases organized in the diagram logically? if not, organize the use cases are the actors organized concerning the use cases in the diagram? if not, organize the actors in the diagram is the sequence of information in each use case specification logically organized? if not, organize this information dc4 are the names of the use cases clear concerning their purpose? if not, clarify the names of the use cases are the names of the actors clear concerning their purpose? if not, clarify the actors in the use case specification, are there any terms that are unknown to consumers? if so, please clarify such terms when specifying a use case, is there any ambiguous information? if so, please clarify this information these checklists conta in questions ba sed on co mmon a rtifa ct scena rios tha t ha ve risks of miscommunica tion. however, we empha size tha t dcs help pra ctitioners to reflect on the a rtifa cts a nd checklists support the identifica tion of specific risks. in this wa y, checklists should be used together with the dcs. 6 discussion we ca rried out studies with the objective of tra nsferring the dcs to the softwa re industry. in the first study, conducted to a nswer rq1 (do practitioners perceive the dcs as support in improving the quality of artifacts? ), we noticed tha t the directives supported the pa rticipa nts' reflection on the communica tion via uml use ca ses. this a llowed reducing possible inconsistencies in the development of the explored a rtifa ct. it wa s possible to obta in evidence tha t t he dcs ca n contribute to improving the a rtifa cts’ qua lity, since the dcs supported reducing incorrect informa tion. this can reduce costs during softwa re development, a s defects discovered during the softwa re development process increa se costs due to the correction of such defects. the second study conducted a ims to understa nding the fea sibility of the dcs to support the improvements in the communica bility of softwa re a rtifa cts used by the tea m to a nswer rq2 (is the dc application by producers feasible in development teams?). the use of the dcs showed the main a spects tha t need improvements, since they nega tively a ffected the communication between producers a nd consumers of these a rtifa cts. the results of this study demonstra ted the benefit of using the dcs, a s the problems identified in the informa tiona l content of the a rtifa cts ca n be fixed. both studies showed evidence to tra nsfer the dcs to the industry. in a ddition, such studies help us to obta in insights for the development of proposa ls tha t fa cilita te the a doption of the dcs in softwa re projects. 7 final considerations this pa per presented resea rch ca rried out with the a im of tra nsferring the dcs to the softwa re industry. we explored the dcs concerning the pra ctitioners’ perception a bout the dcs a s supports in the reflexion of them a s producers, in a first study, a nd a specific softwa re development tea m about the risks in softwa re a rtifa cts tha t ca used miscommunication with supports of the dcs, in the second study. from the first study, the results showed tha t the dcs supported pa rticipa nts to reflect on the system, reducing possible inconsistencies in the development of the explored a rtifa ct, a uml use ca se. the dcs a lso promoted the pa rticipa nts’ reflection on their communica tion with the others involved in the softwa re development. the reduction of miscommunica tion a lso reduces the introduction of defects, a s a consistent ma pping between risks of miscommunica tion a nd softwa re defects has been perceived. in the second study, from the risks identified in the a rtifa cts used by the softwa re development tea m, producers ma de improvements in the a rtifa cts. with tha t, software development tea ms will be a ble to a dopt the dcs in their projects to improve communica tion via a rt ifa ct. additiona lly, most of the pa rticipa nts’ perception a bout the dcs were positive. from the studies results, we noticed the need to define some a rtifa cts so tha t pra ctitioners ca n use our proposa l in their projects. we presented in this pa per two procedures that fa cilita te the use of the dcs in softwa re projects. besides, for the better use of ea ch directiv e, we ha ve proposed checklists. we believe tha t pra ctitioners interested in a dopting our proposa l ca n use them. rega rding the use of the dcs in these studies, it is possible to infer tha t they were considered fea sible for the softwa re industry. the new studies in the context of softwa re projects ca n provide more evidence on the a pplica tion of the dcs to support producers a bout their communica tion, a iming to reduce the risks of miscommunica tion. as future work, we intend to ca rry out a n observa tional study in different softwa re development tea ms using our proposa l. in this future study, the tea ms will use our a rtifa cts, process a nd checklists, proposed in this pa per, including the eva lua tion of a rtifa cts developed in the ea rly sta ges of the softwa re development process not explored in our studies. in a ddition, we intend to investiga te the softwa re engineers perceptions a bout to include the dcs a s pa rt of the compa ny’s culture rela ted to crea tion of a rtifa cts used a s mea ns of communica tion. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 acknowledgements we a re gra teful for the fina ncia l support from capes (fina ncing code 001), cnpq (311494/2017-0 e 204081/20181/pde) a nd fapeam (062.00150/2020). references bra mbilla , m., & fra terna li, p. (2014). interaction flow modeling language: model-driven ui engineering of web and mobile apps with ifml. morga n ka ufma nn. bordin, s., & de angeli, a. (2016). foca l points for a more user-centered agile development. international conference on agile software development , 3-15. corbin, j., & stra uss, a. (2014). basics of qualitative research: techniques and procedures for developing grounded theory. sa ge publica tions. de souza , c. s. (2005). the semiotic engineering of human-computer interaction. mit press. de souza , c. s., cerqueira , r. d. g., afonso, l. m., bra ndã o, r. d. m., & ferreira , j. s. j. (2016). software developers as users. cha m: springer interna tional publishing. freire, e. s. s., oliveira , g. c., & de sousa gomes, m. e. (2018). ana lysis of open-source case tools for supporting softwa re modeling process with uml. in proceedings of the 17th brazilian symposium on software quality, 51-60. gra nda , m. f., condori-ferná ndez, n., vos, t. e., & pa stor, o. (2015). wha t do we know a bout the defect types detected in conceptua l models? in 2015 ieee 9th international conference on research challenges in information science (rcis), 88-99. grice, herbert p. logic a nd conversa tion. speech acts. brill, 1975. 41-58. ja kobson, r. (1960). linguistics a nd poetics. in style in language. ma: mit press, 350-377. kä fer, v. (2017). summa rizing softwa re engineering communica tion a rtifacts from different sources. in proceedings of the 2017 11th joint meeting on foundations of software engineering, 1038-1041. likert, r. (1932). a technique for the mea surement of attitudes. archives of psychology, 144 (55), 7-10. lopes, a., oliveira , e., conte, t., & de souza , c. s. (2019a ). directives of communica bility: towa rds better communica tion through softwa re models. in 2019 ieee/acm 12th international workshop on cooperative and human aspects of software engineering (chase), 45-48. lopes, a., conte, t., & de souza , c. s. (2019 b). reducing the risks of communica tion fa ilures through software models. in proceedings of the 18th brazilian symposium on human factors in computing systems, 1-10. lopes, a., conte, t., & de souza , c. s. (2020). exploring the directives of communica bility for improving the qua lity of softwa re artifa cts. in proceedings of the xix brazilian symposium on software quality (sbqs’20), 10 pa ges. lopes, a., conte, t., & de souza , c. s. (2021). directives of communica bility: towa rds softwa re development tea ms. uses resea rch group technica l report, tr -uses2021-01. https://doi.org/10.6084/m9.figsha re.15057984.v2 ma ra ngunić, n., & gra nić, a. (2015). technology a ccepta nce model: a litera ture review from 1986 to 2013. universal access in the information society, 14(1), 81-95. omg. (2011). business process model a nd nota tion (bpmn) version 2.0. object management group, 1(4). omg. (2015). unified modeling la ngua ge tm (uml) version 2.5. petre, m. (2013). uml in pra ctice. in proceedings of the 2013 international conference on software engineering (icse 2013), 722-731. kha re, r., & ta ylor, r. n. (2004). extending the representa tiona l sta te tra nsfer (rest) a rchitectura l style for decentra lized systems. in proceedings of the 26th international conference on software engineering , 428-437. sebe, n. (2010). huma n-centered computing. in handbook of ambient intelligence and smart environments, springer, boston, ma, 349-370. schoonewille, h. h., heijstek, w., cha udron, m. r., & kühne, t. (2011). a cognitive perspective on developer comprehension of softwa re design documentation. in proceedings of the 29th acm international conference on design of communication, 211-218. tilley, s. (2009). documenting softwa re systems with views vi: lessons lea rned from 15 yea rs of resea rch & pra ctice. in proceedings of the 27th acm international conference on design of communication , 239-244. venka tesh, v., & ba la , h. (2008). technology a ccepta nce model 3 a nd a resea rch a genda on interventions. decision sciences, 39(2), 273-315. wohlin, c., runeson, p., höst, m., ohlsson, m. c., regnell, b., & wesslén, a. (2012). experimentation in software engineering. springer science & business media . journal of software engineering research and development, 2021, 9:9, doi: 10.5753/jserd.2021.1802 this work is licensed under a creative commons attribution 4.0 international license.. how are test smells treated in the wild? a tale of two empirical studies nildo silva junior [ federal university of bahia | nildo.silva@ufba.br ] luana martins [ federal university of bahia | martins.luana@ufba.br ] larissa rocha [ federal university of bahia / state univ. of feira de santana| lrsoares@uefs.br ] heitor costa [ federal university of lavras | heitor@ufla.br ] ivan machado [ federal university of bahia | ivan.machado@ufba.br ] abstract developing test code may be a timeconsuming process that requires much effort and cost, especially when done manually. in addition, during this process, developers and testers are likely to adopt bad design choices, which may lead to introducing the socalled test smells in the test code. as the test code with test smells size increases, these tests might become more complex, and as a consequence, much more challenging to understand and evolve them correctly. therefore, test smells may harm the test code quality and maintenance and break the whole software testing activities. in this context, this study aims to understand whether software testing practitioners unintentionally insert test smells when they implement test code. we first carried out an expert survey to analyze the usage frequency of a set of test smells and then interviews to reach a deeper understanding of how practitioners deal with test smells. sixty professionals participated in the survey, and fifty professionals participated in the interviews. the yielded results indicate that experienced professionals introduce test smells during their daily programming tasks, even when using their companies’ standardized practices. additionally, tools support test development and quality improvement, but most interviewees are not aware of test smells’ concepts. keywords: test smells, survey study, interview study, mixedmethod research. 1 introduction software projects, both commercial and opensource ones, commonly include a set of automated test suites as one cru cial support to verify software quality (garousi and felderer, 2016). however, creating test code may require high ef fort and cost (wiederseiner et al., 2010; yusifoğlu et al., 2015; garousi and felderer, 2016). automated test genera tion tools, such as randoop1, jwalk2, and evosuite3, emerge as alternatives to facilitate and streamline this activity. if designed with high quality, automated testing offers bene fits over manual testing, such as repeatability, predictabil ity, and efficient test runs, requiring less effort and costs (yusifoğlu et al., 2015; garousi and küçük, 2018). therefore, tests should be concise, repeatable, robust, sufficient, nec essary, clear, efficient, specific, independent, maintainable, and traceable (meszaros et al., 2003). however, the development of welldesigned test code is neither straightforward nor a simple task. developers are usu ally under time pressure and must deal with constrained bud gets, which can stimulate antipatterns in test code, leading to the occurrence of the socalled test smells. test smells are indicators of poor implementation solutions and problems in test code design (greiler et al., 2013). the presence of test smells in test code may lead to reduced quality and, conse quently, may not reach its expected capabilities at finding bugs while remaining understandable, maintainable, and so on (yusifoğlu et al., 2015; garousi and küçük, 2018). the lit erature reports 196 test smell types classified in the following 1https://randoop.github.io/randoop/ 2http://staffwww.dcs.shef.ac.uk/people/a.simons/ jwalk/ 3http://www.evosuite.org/ groups (garousi and küçük, 2018): behavior, logic, design related, issue in test steps, mock and stubrelated, association in production code, coderelated, and dependencies. the literature presents studies aimed to identify and an alyze the effect of test smells in software projects in sev eral aspects (greiler et al., 2013; garousi and felderer, 2016; van rompaey et al., 2006). the authors introduce test smells as nonfunctional quality attributes within the software test code engineering process in those studies. in addition, they discussed existing test smell types and their consequences in terms of test code maintenance (garousi and felderer, 2016). some authors attempted to correlate metrics, and the pres ence of test smells (greiler et al., 2013). however, few dis cussions about daily practices and programming styles that may contribute to insert test smells exist in the literature. un derstanding the relationship between development practices and the introduction of test smell may support improving the activity of test creation. this study extends our previous investigation (silva junior et al., 2020), which aimed to understand whether software testing practitioners 4 unintentionally insert test smells. we used an expert survey with sixty practitioners from brazilian companies to analyze which and how often they adopt prac tices that might introduce test smells during test creation and execution. in this extension, we sought to understand (i) how much the practitioners know about test smells and (ii) how the practitioners deal with the test code quality regarding test smells. for identifying whether and to what extent the practi tioners know about test smells and how they deal with them, we interviewed fifty practitioners. the results from both stud 4for simplicity, we will use “practitioners” to inform “software testing practitioners” https://orcid.org/0000-0003-1763-3421 mailto:nildo.silva@ufba.br https://orcid.org/0000-0001-6340-7615 mailto:martins.luana@ufba.br https://orcid.org/0000-0002-8069-5249 mailto:lrsoares@uefs.br https://orcid.org/0000-0002-9903-7414 mailto:heitor@ufla.br https://orcid.org/0000-0001-9027-2293 mailto:ivan.machado@ufba.br https://randoop.github.io/randoop/ http://staffwww. dcs.shef.ac.uk/people/a.simons/jwalk/ http://staffwww. dcs.shef.ac.uk/people/a.simons/jwalk/ http://www.evosuite.org/ silva junior et al. ies are complementary. we found that most of the intervie wees did not know anything about the concept of test smells. they commonly used practices that introduced test smells, but they hardly removed them from the test code. we mapped which daily programming practices would be associated with each test smell for both test creation and ex ecution. then, we asked the practitioners if they used those practices without the need to name the test smells. we used the interviews to complement the survey and analyze the practitioners’ unit test creation, maintenance, and quality ver ification activities. in addition, we investigated the practition ers’ knowledge about test smells and how they treat those smells during unit test creation and maintenance. our study may provide insights to understand how and which practices may introduce test smells in test code. in ad dition, we presented the practitioners’ point of view about activities related to unit test code and their beliefs about test smells’ treatment. thus, we investigated the following re search questions: rq1: do practitioners use test case design prac tices that might lead to the introduction of test smells? we investigated whether bad design choices may be related to test smells. rq2: which practices are present in practitioners’ daily activities that lead to introducing test smells? we investigated which test smells are as sociated with the most frequent practitioners’ prac tices. rq3: does the practitioners’ experience interfere with the introduction of test smells? we investi gated whether, over time, practitioners improve the activity of test creation. rq4: how aware of test smells are the practitioners? we investigated the practitioners’ knowledge of test smells. rq5: what practices have practitioners employed to treat test smells? we investigated how the practi tioners deal with test smells in their daily activities. the remainder of this article is structured as follows: sec tion 2 introduces the concept of test smells; section 3 details the research method applied in this study; section 4 presents the survey’s design and results; section 5 presents the inter view’s design and results; section 6 discusses the main find ings of this investigation; section 7 presents the threats to va lidity; section 8 discusses related work, and section 9 draws concluding remarks. 2 test smells automated tests may generate more efficient results when compared to manually executed ones. due to their repeata bility and nonhuman interference, automated tests might lead to time and execution effort reductions (yusifoğlu et al., 2015; garousi and küçük, 2018). however, developing test code is not a trivial task, and the automated tools may not en sure the system quality because they can generate one poor design (palomba et al., 2016; virgínio et al., 2019). in real world practice, developers are likely to use antipatterns dur ing test creation and evolution, leading to errors in imple menting test code (van deursen et al., 2001; bavota et al., 2012). these antipatterns may negatively impact test code maintenance (van rompaey et al., 2006). several studies investigated different types of test smells. initially, van deursen et al. (2001) defined a catalog of 11 test smells and refactorings (to remove test smells from the test code). after that, other authors extended this catalog and analyzed the effects of the smells on the production and test code (van deursen et al., 2001; meszaros et al., 2003; van rompaey et al., 2006; bavota et al., 2012; greiler et al., 2013; bavota et al., 2015; garousi and felderer, 2016; palomba et al., 2016; peruma, 2018; virgínio et al., 2019; virgínio et al., 2020). for example, garousi and küçük (2018) identified more than 190 test smells in a literature re view of 166 studies. in this study, we selected 14 types of test smells frequently studied and implemented in cuttingedge test smell detection tools (van deursen et al., 2001; meszaros et al., 2003; pe ruma, 2018). these are described next: • assertion roulette (ar). a test method that contains assertions without explanation. if one of those asser tions fails, it is not possible to identify which one caused the problem (van deursen et al., 2001); • conditional test logic (ctl). a test method with conditional logic (ifelse or repeat instructions). tests with this structure do not guarantee that the same flow is verified, as they might not test a specific code piece (meszaros et al., 2003); • constructor initialization (ci). a test class that presents a constructor method instead of a setup method to initialize fields (peruma, 2018); • eager test (et). a test method checks many object methods at the same time. this test may be hard to un derstand and execute (van deursen et al., 2001); • empty test (ept). a test method does not contain ex ecutable assertions (peruma, 2018); • for testers only (fto). a production class has meth ods only used by test methods (van deursen et al., 2001); • general fixture (gf). the fields instantiated in the setup method are not used by all test methods of a test class. it may be hard to read and understand and may slow down the test execution (van deursen et al., 2001); • indirect testing (it). a test class has methods that perform tests in different objects because there are ref erences to those objects at the test class (van deursen et al., 2001); • magic numbers (mn). a test method contains asser tions with literal numbers as a test parameter (meszaros et al., 2003); • mystery guest (mg). a test method uses an external resource, such as a file with test data. if the external file is removed, the tests may fail (van deursen et al., 2001); • redundant print (rp). a test method contains irrele vant print statements (peruma, 2018); • resource optimism (ro). a test method contains op timist assumptions about the presence or absence of ex ternal resources. the test may return a positive result silva junior et al. figure 1. research method overview. once, but it may fail at other times (van deursen et al., 2001); • test code duplication (tcd). a test method has un desired duplication (van deursen et al., 2001); • test run war (trw). a test method fails when sev eral tests run simultaneously and access the same fix tures (van deursen et al., 2001). 3 research method we carried out two empirical studies in this investigation: a survey and an interview study (miles et al., 2014). figure 1 shows the methodological steps employed in this study. initially, we designed our study by defining the research questions and the suitable research methods to investigate them (fig. 1 design). we used the survey research method to identify which programming practices respondents (practi tioners who participate in the survey) adopt that might insert test smells in the test code (fig. 1 survey). we next applied the interview study method to identify how the interviewees (practitioners who participate in the interview) deal with test smells during the test creation and execution (fig. 1 inter view). we compared results obtained from both surveys and interviews to understand the adoption of practices that might lead to introducing test smells with the practitioners’ knowl edge about test smells from different perspectives (fig. 1 data comparison). for the survey, we adopted the design of observation by casecontrol. casecontrol is a descriptive design used to in vestigate previous situations to support understanding a cur rent phenomenon (pfleeger and kitchenham, 2001). it en compasses activities for the design, application, and analy sis of a survey questionnaire. we designed the questionnaire not to require specific knowledge about test smells. we corre lated each test smell to a set of programming practices, which the participants should read and analyze. section 4 details the survey study. to complement the findings of the survey questionnaire, we carried out a semistructured interview (singer et al., 2008; gubrium et al., 2012). the interview’s structure aims to capture the interviewees’ perception of test smells. as we needed the interviewees to know the definition of test smells for elaborating on how they deal with them, we first intro duced them to the concept of test smells. section 5 details the interview study. the survey and interview instruments were written and applied in the portuguese language with brazilian practitioners. finally, the data comparison summa rizes the survey and interview results methods to answer the research questions (creswell and clark, 2018). section 6 presents the results. 4 survey study we applied the survey research method to investigate how the respondents commonly insert test smells in the test code when designing or implementing their software projects (melegati and wang, 2020). throughout this section, we pro vide readers with detailed information about the research de sign and data analysis. all material used in the survey study, including the dataset, is publicly available at (junior et al., 2021). 4.1 design we structured the questionnaire so that the respondents were not required to be aware of test smells beforehand. thus, we silva junior et al. table 1. examples of practices related to test smells. test smell test creation practices test execution practices mystery guest i often create test cases using some configuration file (or supplementary) as support. a test case fails due to the unavailability of access to any configuration file. eager test i often create tests with a high number of parameters (number of files, database record, etc.). i run some tests without understanding what their pur pose is. assertion roulette i pack different test cases into one (i.e., put together tests that could be run separately). some tests fail, and it is not possible to identify the fail ure cause. for testers only i have already created a test to validate some feature that will not be used in the production environment. i run some tests to validate features that will not be used in the production environment. conditional test logic i have already created conditional or repeating tests. i run tests with conditional or repeating structures. empty test i have already created an empty test with no executable statement. i find empty tests, with no executable statement. covered a larger number of potential practitioners. we cor related the concepts of test smells to commonly applied test creation and execution practices. table 1 shows examples of those practices. for instance, the practices associated with conditional test logic (ctl) use loops or conditions in the test code. in this case, the respondents should analyze the practices to determine whether and how often they adopt them. in ctl, the respondents should indicate how often they create tests with those structures or face them during test execution. questionnaire instrument the questionnaire comprises three blocks of questions. the first block characterizes the respondents (profile) and has thirteen questions to identify their age, gender, education degree, and software testing/programming skills. the second block has fourteen statements and six comple mentary questions (four objective and two openended ques tions). the statements describe creation practices related to test smells. we structured those statements in a fivepoint likert scale, where the respondents could choose one of the following answers: always, frequently, rarely, never, or not applicable. in this scale, always indicates the adoption of bad practices for test creation. for example, the “i have already created a test to validate a feature that would not be used in the production environment” statement corresponds to the for testers only test smell. therefore, the answer “always” means that the respondent usually uses that practice in her daily tasks. as a consequence, it is likely that she uninten tionally inserts that test smells in the test code. we designed the six complementary questions to understand how the prac titioners deal with the test creation activity. the third block has fourteen statements and one additional question. those statements describe execution practices re lated to test smells. like the former block, we structured those statements on a fivepoint likert scale. the respon dents could choose one of the following answers: always, fre quently, rarely, never, or not applicable, where always indi cates that the respondent comes across with test smells. we designed the complementary question to understand which problems the respondents deal with when executing the tests. the survey was available from april 3rd, 2019, to june 3rd, 2019. appendix a includes all the questionnaire statements and questions used in this study. pilot application we ran a pilot survey with four practitioners to identify improvement opportunities. based on the responses, we im proved the questionnaire before running the survey. it is worth mentioning that we did not include data gathered in the pilot application in the research results. participants we sent invitations and one questionnaire copy (c1 c8) to practitioners from eight brazilian companies on a conve nience sampling basis. the questionnaire’s different versions served to control the number of respondents from the compa nies. those companies have 4 to 66 practitioners who per form manual and automated tests (table 2). in addition, we also sent the questionnaire through direct message (d1) and posted it on a facebook group dedicated to discussing soft ware testing (g1). in total, we contacted 305 practitioners, and 60 practitioners participated in the survey (#s1 #s60). analysis procedure to answer rq1, we analyzed the objective questions (statements) on test creation (second block) and execution (third block). to answer rq2, we grouped the practices by frequency to identify the most commonly used ones. the practices may be associated with test smells according to their characteristics, such as external file usage, conditional structure, and programming style. to answer rq3, we com pared the professional experience with the frequency of use of test smells. we also used the same answer format of rq1 but only considered test creation (second block). during the test execution, respondents identify test smells instead of cre ating them. we analyzed the three openended questions through cod ing and continuous comparison (kitchenham et al., 2015). the objective was to understand why the respondents use practices that may insert test smells. in addition, we also in tended to understand which difficulties they encounter when creating and executing tests. two researchers performed the coding task and validated it by consensus. we also associated some practices with the test code characteristics defined by meszaros et al. (2003). we employed open coding on the data collected to identify additional reasons why the respondents may use bad prac tices in their software testing activities. the obtained codes were peerreviewed and changed upon agreement with the silva junior et al. table 2. respondents source professionals answers c1 66 14 c2 30 1 c3 10 0 c4 6 0 c5 5 0 c6 4 4 c7 4 4 c8 4 0 d1 52 35 g1 124 2 total 305 60 paper authors. we used coding to complement our results on openended questions because they were optional. 4.2 results we received 60 answers (out of 305 potential respondents) from three brazilian states: 40 respondents from bahia (66.7%), 19 respondents from são paulo (31.7%), and one re spondent from paraná (1.6%). the respondents ranged from 22 to 41 years old, and their experience with quality assur ance ranged from 0 to 13 years (5.16 on average). experi ence as software developers also ranged from 0 to 13 years (average 1.67). regarding gender, 35 respondents were male (65%), 19 respondents were female (32%), and two respon dents were nonbinary (3%). most of the respondents hold a degree in computer sciencerelated courses (50 respondents 83.3%), six respon dents (10%) hold a degree in other stem (science, tech nology, engineering, and mathematics) courses, and four re spondents (6.7%) hold a degree in other areas. most of the re spondents (54 respondents 90%) pursued higher education degrees, as follows: 40 respondents hold a bachelor’s degree (66.7%), 13 respondents hold a graduate degree (21.7%), and one respondent holds a postdoc (1.6%). regarding the software testing tasks they commonly per form, (i) 26 respondents reported they create and run tests at the same rate (43.3%); (ii) 13 respondents execute tests with more frequency than create (21.7%); and (iii) 8 respon dents create tests with more frequency than execute (13.3%). moreover, 12 respondents only execute test cases (20%); one respondent only creates test cases (1.7%). they perform tests over many different platforms; 35 respondents (58%) work with two or more platforms (web 39 respondents (65%), android 35 respondents (58%), desktop 29 respondents (48%), and apple 17 respondents (28%)). they also cited other platforms, such as backend, microservices, api, main frame, and cable tv one respondent each (1.67%). in terms of domain, 39 respondents claimed they test mo bile applications (65%), and 36 respondents test web appli cations (60%). we also identified the following domains: 14 respondents work with embedded systems (23.33%), 11 re spondents work with cloud computing (18.33%), seven re spondents test information security (11.67%), four respon dents test internet of things systems (6.67%). they also men tioned other domains: big data, retail, artificial intelligence, cable tv, bioinformatics, commercial information, desktop figure 2. test smells frequency in test creation. system, and payment solutions one respondent each (1,67% each). 4.2.1 test creation and execution practices we asked whether the respondents search for test duplica tion and whether it was either personal or company prac tice. twentynine respondents (48,3%) answered that it was only an individual activity. eleven (18,3%) responded that it was only a company’s practice, and three respondents (5%) claimed that it was a personal and company activity. how ever, seventeen respondents (28,3%) do not apply this activ ity. checking tests with the same objective reduces the test code duplication (tcd) test smell. in addition, we established a relationship between the test creation and execution practices and the test smells occur rence using the data collected. figures 2 and 3 show the us age frequency of test smells during the test creation and exe cution activities, respectively. during test creation, the conditional test logic (ctl) and general fixture (gf) test smells were the most re ported ones. the former obtained 28 (47%) of always and frequently responses, and the latter, 27 (45%) in both re sponses (figure 2). the high rate of those responses may in dicate a common everyday use of practices related to ctl and gf. we also analyzed why developers create tests with bad practices (one openended nonmandatory question an swered by 27 respondents 45%). the main reasons were related to the company or personally employed standards, limited time, and attempt to reach better coverage and effi ciency. we also asked whether they modified existing test sets when they came across tests containing any of the problem atic test patterns illustrated in the survey. we found that seven respondents (11,7%) always perform any test code changes, twentythree respondents (38,3%) frequently change, six teen respondents (26,6%) rarely change, seven respondents (11,7%) never edit test code, and seven respondents (11,7%) answered as not applicable. among the reasons to modify the test, eighteen respondents reported ambiguities reduction (30%), sixteen respondents claimed execution speed im provement (26,7%), fourteen respondents stated adequacy silva junior et al. to the company standards (23,3%), eight respondents did not understand test objective (13,3%) and four respondents stated corresponding production class evolution (6,7%). in addition, the respondents pointed out that they used to face test structure problems. thirtyone respondents in dicated that some tests depended on third party resources (52%), 29 respondents reported that they were hard to under stand (48%), 24 respondents claimed to contain unnecessary information (40%), 24 respondents said ambiguous informa tion (40%), 20 respondents reported to depend on external files (33%), six respondents pointed to use an external config uration file (10%). one respondent presented resources limi tation (2%). regarding difficulties in creating test cases (one open ended nonmandatory question answered by 23 respondents (38%)), requirement issues were the most frequent ones, re ported by twelve respondents (52%). other problems were related to the difficulties in the test code reuse, lack of knowl edge, production code issues, code coverage, test environ ment problems, and time and resource limitation. the test execution questions also presented a sequence of statements about ordinary situations the developers usually face, in which respondents should answer according to the frequency. the ctl (52%) and gf (47%) test smells were also the most cited during test execution (figure 3). those test smells obtained 31 and 28 answers of always and fre quently frequencies, respectively. figure 3. test smells frequency in test execution. regarding difficulties in running test cases (one open ended nonmandatory question answered by 29 respondents 48%), ten respondents reported test environment as a prob lem related to test execution (34%), such as test environ ment unavailability, demand for thirdparty features, and lowperformance environments. the second most common problem is understanding the test purpose (28%), where eight respondents reported that tests were poorly written and with out a standard, allowing multiple interpretations. the lack of test maintenance was the third problem (24%), which in volves outdated and incomplete tests due to the system code evolution (7 respondents). table 3. answers grouped by experience range experience (in years) number of respondents total 0 2 11 143 > 2 4 12 156 > 4 6 15 195 > 6 8 5 65 > 8 10 9 117 > 10 12 4 52 > 12 14 4 52 4.2.2 professional experience although most respondents from the survey reported they create and execute tests simultaneously, our investigation presented a different scenario as the tester gets more expe rienced. figure 4 shows the daily activities according to the professional experience, with the following highlights: 10 re spondents (16.7%) with experience ranging from 4 to 6, and 5 respondents (8.3%) with 8 to 10 years of experience create and execute tests at the same proportion. eight respondents (13.4%) with less than two years of experience, six respon dents (10%) ranging from 2 to 4 years of experience, and four respondents (6.7%) ranging from 6 to 8 years of expe rience only run tests or run tests with more frequency than create. three respondents (5%) with more than 12 years of experience mostly create rather than run tests. therefore, less experienced respondents run more than creating tests, and re spondents with more experience create more than run tests. figure 4. testing tasks according to professional experience. we also analyzed whether the use of good practices to create tests increases as respondents become more experi enced. we provided the respondents with thirteen statements, with illustrative scenarios of problems with test cases. each scenario relates to a given test smell. the respondents had to answer how often they experienced each scenario. table 3 shows the number of respondents grouped by experience time (in years) and the number of valid responses. figure 5 presents the frequency of test smells grouped by professional experience. when we analyzed the first expe rience range (02), 71 answers (50%) from the respondents could not identify the adoption of practices related to test smells (not applicable). 9 (6%) answers pointed that respon silva junior et al. figure 5. test smells frequency in test creation according to professional experience. table 4. interview questions # question 1 how did you start working with software test? 2 what were your learning sources about test code? 3 which programming languages do you create tests for? 4 which programming languages do you use in your current software project? 5 how is your test creation process? 6 is there any flowchart or template document that standardizes this process? 7 which support tools are used for test creation and execution? 8 how do you verify the quality of unit tests? 9 moving to the test code maintenance process, tell me how is this process inside the company? 10 what do you know about the test smell? 11 how did you learn about that? 12 do you have any doubt about test smell? 13 how are the test smells handled in the unit testing creation process? 14 how are the test smells handled in the unit testing maintenance process? 15 how would it be possible to avoid the introduction of test smells during test creation? 16 do you have any question, additional information or suggestion to improve this interview? dents always adopted some practice related to test smells, 16 (11%) answers related to frequently, 29 (20%) answers to rarely, and 18 (13%) answers to never adopted practices re lated to test smells. when we extended that analysis through the next experience ranges, we could not observe any in crease in responses never and rarely with the professional experience, indicating that the experience might not influ ence the adoption of practices that lead to the introduction of test smells. 5 interview study after carrying out the survey study, we interviewed software engineers to gather further evidence on how the practitioners deal with test smells, develop unit test code, and deal with test smells in test creation and maintenance. the interview dataset, including the interview transcriptions, interviewees profile, and coding summary, is publicly available at (junior et al., 2021). 5.1 design we employed a semistructured interview approach, guided by a set of sixteen questions, as table 4 shows. interview organization we organized the interview into three blocks: • warmup block (#13). questions about the professional background, such as the learning resources on software test code the interviewees commonly use, as well as the programming language they often use to implement test code, if any; • technical block (#49). questions about how they cre ate, maintain, and assess the quality of developed unit tests; • test smell block (#1015). questions about the intervie wees’ awareness of test smell and how they handle these in test case creation and maintenance. the interviewees could also ask for more information or give additional information and suggestions to increase the interview quality (question #16). unlike the survey, in the in terview, we employed the actual test smell term in the ques tions related to such the concept instead of considering a transitive approach through statements containing practices embedded with test smells. when the participants were not aware of the term or asked for more information on test smells, we presented the concept and two test smells samples, e.g., ctl and ept (virgínio et al., 2020). those test smells were related to the most and the least frequently program ming practice used on survey results, respectively. there were no questions about challenges or problems involved in creating and maintaining test code. the interviewees an swered the questions in table 4 according to their experi ences, concepts, and shared information during the meeting. silva junior et al. the interviewer and interviewees did not access any test code from interviewees to analyze the presence of test smells. at the beginning of the interview, the practitioners an swered a professional profile’ form with academic back ground and professional experience. they also provided an email to solve eventual doubts or collect more data during data analysis. we interviewed on june 3rd and june 30th. due to the pandemic period, online meeting tools, such as skype and google meet, were used upon the participants’ request. we recorded the interviews with either the skype conversa tion recording tool or the google meet screen capture feature. additionally, we used an external voice recorder for every in terview. participants initially, we contacted practitioners from the survey who agreed to keep contributing to research. unlike the survey, we opted only for test code developers whose focus was creating and maintaining unit testing, including the treat ment of test smells. some interviewees participated in the survey study because we applied the snowballing technique (kitchenham et al., 2015). next, we used linkedin to invite other potential participants, using the “unit testing” expres sion in the profile ability search linkedin provides users. a total of 50 practitioners accepted the invitation (#i1 #i50). pilot study we performed two pilot interviews with practitioners to measure the interview length and analyze whether it would be necessary to modify any part of the predefined instrument. as a result, there was no need to perform any changes in the instrument. the average interview length was around 30 minutes. analysis procedure the first author was the one responsible for transcrib ing the interviews. from them, we performed open coding (corbin and strauss, 2014) to answer the research questions. the remaining coauthors analyzed the transcriptions to un derstand how the practitioners develop tests and deal with test smells. first, we analyzed and validated the coding until we reach a consensus. in the following, two authors individ ually reviewed the proposed coding. in the end, one expert researcher reviewed the final coding. 5.2 results the interviewees could answer openended questions in dif ferent ways, according to their reality. therefore, when pre senting the results, some responses got more than 100% dur ing the quantitative analysis. the respondents’ age ranged from 20 to 48 years old, most of them ranging from 25 to 34 years old (60%). regarding their education, six respondents have completed high school (12%), 31 respondents completed an undergraduate school (62%), and 13 respondents hold a graduate degree (26%). additionally, 48 respondents either have a degree or were studying any computer sciencerelated course (96%), one respondent holds a degree in applied business (2%), and one respondent holds a degree in psychology (2%). the respondents worked in companies of different sizes, table 5. respondents’ roles role respondents % developer 22 44% software engineer 7 14% systems analyst 7 14% software architect 5 10% team leader 3 6% automation engineer 2 4% consultant 2 4% project manager 2 4% quality specialist 2 4% quality engineer 1 2% test developer 1 2% test analyst 1 2% table 6. programming languages language respondents % java 25 30% javascript 14 17% c# 11 13% typescript 8 10% python 7 9% kotlin 5 6% php 4 5% swift 3 4% ruby 2 2% c 1 1% c++ 1 1% elixir 1 1% go 1 1% as follows: (i) 10 respondents worked in small companies (less than 50 employees 20%); (ii) 5 respondents worked in mediumsized companies (number of employees in the range from 50 to 99 employees 10%); and (iii) 35 respondents worked in largesized companies (more than 99 employees 70%). additionally, the interviewees were responsible for different tasks within companies related to their current roles (table 5). they created unit tests for mobile, desktop, and web platforms using different programming languages (ta ble 6). their experience in software development tasks var ied from 1 to 20 years, of which more than 50% were in the 1 6 years of experience range. two out of them were not work ing with unit test creation when we interviewed them. in such cases, they should consider their previous experience. we compared and analyzed the information for the open coding analysis and grouped them into codes using sentences, paragraphs, or the entire document. for example, when we asked them about their unit test creation process, the intervie wee #i47 answered: “when i worked only with java [...] if i know the context well if i have deep knowledge of the context that i will develop, i like to do a little tdd [...], but unfortu nately this is not something that can be 100% reality in the business, because you have n situations, n circumstances. so i cannot do tdd; at least i develop the specific feature, [...] the features, methods, etc., and then i will test it, for example, for each method that i know has a logic within that method, i do the test cases for n possibilities”. from this answer, we identified the following codes: codea tdd; codeb tld; silva junior et al. codec depends on personal skill. we found 159 codes. we did not consider the warmup block answers (#1 3) as we used them to stimulate the interviewees to pro vide as much information as possible. we used the technical block answers (#49) to analyze how the interviewees cre ated, maintained, and verified the test quality to complement and compare the survey’s supplementary questions. we used the answers for question #10 to analyze which information the interviewees presented about test smells. therefore, we could answer rq4. questions #11 and #12 complemented question #10. we used answers for questions #13 and #14 to analyze the strategies for dealing with test smells and answer the rq5. then, we analyzed the answers given to question #16 to understand how the interviewees believe it was pos sible to avoid introducing test smells. those questions let us understand better how they create, maintain, and verify unit test codes and how they deal with and possibly avoid test smells. 5.2.1 unit test code creation and maintenance we found that the developers usually create unit test code using testdriven development (tdd) (48%), test last de velopment (tld) (42%), or behavior driven development (bdd) (16%). those strategies’ usage was motivated accord ing to the project task or developer’s knowledge about the project’s programming language or architecture. for exam ple, the interviewee #i16 stated that he used tdd when he dominated the programming language; otherwise, the func tional software code was created first and then tested (tld). the interviewee #i25 claimed that she created unit tests ac cording to the stories from the bdd scenario. when there was no scenario, she used tdd. the method adoption could also depend on if the software was new or legacy. the in terviewee #i32 pointed that tdd was used on new projects when possible, and he used a bdd variation before the soft ware code creation. during the test code creation description, four intervie wees (8%) mentioned using mocks to simulate components, and two interviewees (4%) used to adopt clean code prac tices. for instance, the interviewee #i22 claimed he creates easytoread and understand, fast, and independent test codes. the interviewee #i36 uses code patterns and creates less ver bose tests. additionally, the focus of four interviewees (8%) is on test coverage. the interviewee #i12 claimed that he identifies “interesting features” to test. according to the inter viewee #i43, the test code should cover 80% of the software code. moreover, the interviewee #i10 mentioned the solid principles, and the interviewee #i15 adopts the modelview viewmodel (mvvm) project pattern (#i15) as practices dur ing the test creation. when we asked whether there was any document that stan dardized unit test creation, nine interviewees (18%) indicated the use of templates or some other documentation. the in terviewees #i5 and #i9 mentioned a test template in their projects that the team members could adopt. the interviewee #i29 claimed his team followed the microsoft’s official doc umentation, but there was not any internal document. the interviewee #i39 mentioned using a domain specific lan guage (dsl) to share project information, as follows: “on project day 0, we create and standardize an official dsl for the code. you have prerogatives, you have the test, and you have the result”. in addition, some interviewees answered that there was no documented standard, but they adopted the givenwhenthen (gwt) pattern and the arrangeact assert (aaa) programming practices. furthermore, the interviewees mentioned 90 different tools to create and run tests. those tools are related to (i) code development (junit 42%, jest 14%, and visual stu dio 20%); (ii) metrics analysis (sonar tools 18%), and (iii) continuous integration (jenkins 10%, azure 2%, and cir cle ci 2%). after creating unit test code, the test quality assessment was performed through code review (78%) by one or more developers inside the project team. this activity usually was supported by tools, such as pull panda. for example, the in terviewee #i2 claimed: “pull panda5 is a tool used to ran domly assign one or more developers to perform the code review. [...]”. furthermore, two other interviewees (inter viewee #i4) and (interviewee #i16) reported that they per formed peer review (4%), and four interviewees claimed they commonly verify test code quality through pair program ming (8%). other practices identified were: test coverage (30%), metric analysis tool (24%) (e.g., sonarqube tool), re viewing by continuous integration tool (16%), test execution (10%), application of programming practices (10%) (reuse, clean code, and libraries), running mutant test tool (6%), test validation by external quality assurance team (2%), and static validation (2%). three interviewees reported that there were “no test quality assurance” activities because there were not enough tests to perform this activity or because the company does not support it. the interviewees adopted various test maintenance types distributed by corrective (62%), adaptive (36%), preventive (4%), and perfective (4%) maintenance. four interviewees claimed there was no test code maintenance due to: (i) there was no defined maintenance process (interviewee #i22); (ii) the participation in one new project and no maintenance task was required (interviewee #i24); absence of maintenance ac tivity because of shortage of time (interviewees #i24 and #i36); and (iv) project environment (interviewee #i45). 5.2.2 test smells treatment we asked the interviewees about their knowledge of test smells to understand whether they comprehended the study subject. figure 6 summarizes the results. seven interviewees (14%) demonstrated some knowledge of test smells. for ex ample, the interviewee #i2 answered: “i know a few things. i consider these as bad practices, bad choices that you make in your test code that difficult its maintenance and evolu tion.”. twentythree interviewees (46%) related test smells with code smells but claimed they have never heard of the test smells. the interviewee #i16 mentioned: “test smell, i do not know the concept. the code smell is a problem that the static test analysis tool found in the program. would test smell be that same analysis on top of the test code?”. finally, twenty interviewees (40%) did not know test smells and did not relate to any smells type. 5https://pullpanda.com/ https://pullpanda.com/ silva junior et al. we presented the definition and examples of two test smells (ctl and ept) for the interviewees who did not know about test smells or asked for more information. table 7 shows how they prevent test smells during test code cre ation and how they treat test smells during the test code cre ation and maintenance. for example, during the test code cre ation, the code review practice was the most recommended (38%), followed by tool usage (26%) and programming practices (24%). when developing the test code, the devel oper should follow the programming practices to prevent test smells. tools and code reviews help to check the test smells insertion in an early stage of development. two interviewees believed there were not test smells in their repository. for example, the interviewee #i39 said: “i think we do not have this problem (test smells) in the recent project because of its difficulty level, we follow a coding standard. we educate peo ple on how we code it [...]”. the interviewee #i11 also said: “as i am the only one working on the project, i coded, under stood, and never had this vision of test smells. i do not think i have any problem with that.”. regarding maintenance, we asked how the interviewees treated test smells during the test code maintenance. the an swers were similar to the previous question (table 7). for the test code maintenance, the code review was also the most rec ommended practice (28%), followed by refactoring (20%) and tool usage (18%). as the test code was already devel oped and might have test smells, they suggested using tools to help detect test smells and refactoring techniques to re move them from the test code. the code review practice can doublecheck the test code to treat the test smells during the maintenance. we also asked the interviewees how to prevent test smells during test code creation (table 7). for the test smells pre vention, the tool usage was the most recommended practice (44%), followed by developers’ skills (28%) and code re view (20%). the developers’ skills are related to develop ing tests’ knowhow by following good practices, guidelines, and coding patterns. it should help the developers identify and prevent flaws in designing and implementing a test code. the tool usage can support the developers when developing a test code by identifying possible test smells. the code re view is a manual analysis of the test code to doublecheck the test code for test smells prevention. at the end of the interviews, they could either provide or ask for further information about test smells and test code quality assurance. therefore, the interviewee #i29 claimed: “for me, it is a quality guarantee in terms of dependence ex emption, in terms of development, cohesion, coupling, and fundamental architecture. from the moment you have unit testing or even tdd, it helps you improve the code and ar chitecture.”. the interviewee #i35 demonstrated interest in our study: “i would like to know more about the study, we can talk about it later if you want, [...] i thought the term ‘test smell’ is complicated, at least it does not seem to be a common industry expression.”. figure 6. prior knowledge about test smell. table 7. practices to prevent test smells or to treat them during the test code creation and maintenance # practice creation maintenance prevention 1 code analysis – – 2 (4%) 2 code removal – 1 (2%) – 3 code reuse – – 1 (2%) 4 code review 19 (38%) 14 (28%) 10 (20%) 5 coding patterns 4 (8%) 5 (10%) 8 (16%) 6 company support – – 1 (2%) 7 culture’s development – – 3 (6%) 8 developer skills 2 (4%) 2 (4%) 14 (28%) 9 documentation – – 1 (2%) 10 guidelines – – 3 (6%) 11 individual analysis 2 (4%) 6 (12%) – 12 mutant testing 1 (2%) 1 (2%) – 13 no treatment 13 (26%) 13 (26%) – 14 pair programming 2 (4%) 1 (2%) 4 (8%) 15 peer review 1 (2%) 1 (2%) 1 (2%) 16 professional experience – – 6 (12%) 17 programming practices 12 (24%) 8 (16%) 11 (22%) 18 refactoring 5 (10%) 10 (20%) – 19 tdd – – 3 (6%) 20 technical debt 1 (2%) 5 (10%) – 21 technical meeting 1 (2%) – – 22 tool usage 13 (26%) 9 (18%) 21 (44%) 23 traceability 1 (2%) 1 (2%) 1 (2%) 24 training – – 8 (16%) 25 take breaks – – 1 (2%) 26 software code improvement – 1 (2%) 27 test smell catalog – – 1 (2%) 6 discussion this section discusses the results obtained after conducting the survey and interview to answer the research questions. rq1, rq2, and rq3 are related to the survey, and rq4 and rq5 are related to the interview. 6.1 rq1: do practitioners use test case design practices that might lead to the introduc tion of test smells? we observed that at least one respondent pointed to 1 out of 14 practices related to test smells from the results. we ana lyzed those practices when creating and maintaining tests to identify which types of test smells the participants frequently insert in the test code. regarding test creation, we observed that every test smell presented at least three out of four possible answers (always, frequently, rarely, and never). we classified the data into two groups: the commonlyused practices group (cpg) and silva junior et al. the unused practices group (upg). cpg contains test smells that mostly present always and frequently as answers, and upg that mostly present rarely and never as answers. we considered a test smell belonging to one group when the difference between the always and frequently rates and the rarely and never rates is greater than 10%. for example, the empty test, for testers only, test run war, constructor initialization, resource optimism, redundant print, magic number, indirect test test smells belong to upg, which means practitioners rarely insert those smells on the testing activities. on the other hand, the respondents frequently adopt prac tices related to the general fixture test smell, the only member of cpg, indicating that they usually create tests with that smell. still, four test smells presented a similar perti nence frequency to both groups (less than 10% of difference). for them, there was not a pattern among respondents. for in stance, the eager test test smell obtained 38% to cpg and 40% to upg. in the test execution, upg contains the empty test, eager test, assertion roulette, redundant print, duplicated test, test run war, for testers only, mystery guest, constructor initialization, and resource optimism test smells, which means that the re spondents rarely face those smells during the test execution. otherwise, the respondents frequently find practices related to two test smells, general fixture and conditional test logic, which compose the cpg group. in addition, we did not perceive a significant difference among respon dents for two other test smells, indirect test and magic number, which presented similar pertinence frequency to both groups. we also investigated the reasons that lead the respondents to adopt the practices presented in the survey. thus, we an alyzed the openended questions and identified 16 different tags. the most common ones were company standard, per sonal standard, project politics, professional experience, sav ing time, and improving coverage. for example, the respon dent #s26 of the survey reported applying company stan dards when creating tests that may insert smells and com monly use bad practices “to match company development standards.” in another situation, respondent #s54 reported using personal standards when said: “i group tests by mod ules to execute them sequentially without compromising ef fectiveness.” this behavior suggests that participants may have misunderstood the test smells definition. when group ing tests, it is possible to insert the assertion roulette test smell and compromise test independence. a similar situ ation occurred with the respondents #s14, #s16, #s27, #s50, and #s59. in general, our study identified that all test smells appeared in testing activities. they all were cited by respondents, even if rarely. practitioners adopt practices for test case design, which introduce test smells. usually, those practices come from improper personal and company stan dards. 6.2 rq2: which practices are present in prac titioners’ daily activities that lead to intro ducing test smells? although there are specific tools to support test automation (fraser and arcuri, 2011; smeets and simons, 2011), 62% of respondents perform more manual than automated tests. besides, 55% have no experience with software development (less than two years of experience), the lack of knowledge does not influence the adoption of bad practices in the test code. according to the practices explored in the survey, we iden tified that the respondents usually come across: (i) the use of generic configuration data, which produces the general fixture test smell (most frequent on the activities of test creation and execution cpg); and (ii) the use of condi tional or repetition structure, directly associated with the conditional test logic test smell (second most de tected on the activity of test execution cpg). the respondents indicated they usually face several prob lems with tests, such as poorly written tests and outdated and incomplete test procedures. according to them, when the tests are associated with generic configuration data, test cases are hard to understand and may cause incorrect results. more over, the test coverage on the production code is unclear due to the conditional logic presence on the tests. understand ing which practices are most prevalent in the professionals’ activities supports improving test quality. other identified problems are related to incompleteness, outdatedness, or lack of documentation. these may hinder traceability, evolution, and maintenance of the testing tasks. the practices most present in the practitioner’s daily life that lead to test smells insertion were conditional structure or repetition and generic configuration data. 6.3 rq3: does the practitioners’ experience interfere with the introduction of test smells? in the survey study, we analyzed the respondents’ experi ence and its influence in adopting practices that might lead to insert test smells in their projects. as a result, we did not identify any clear causeeffect correlation. for example, the always option indicates they always use harmful practices. when we analyzed the answers’ frequencies for this option, the usage rate did not reduce over time. instead of that, we may observe from figure 5 that respondents with 8 to 10 years of experience achieved a higher usage rate of this fre quency. we also identified that behavior when we analyzed the other usage frequencies. however, we could not infer that inexperienced practitioners introduce more test smells than experienced ones regarding the activity of test creation. on the one hand, when testers are inexperienced program mers, they may write lowerquality tests. on the other hand, they can carry programming biases that may contain bad practices when they are more experienced. thus, the absence of a tendency indicates a nonbehavioral change between less and more experienced practitioners. silva junior et al. experienced practitioners may not produce fewer test smells than inexperienced ones. 6.4 rq4: how aware of test smells are the practitioners? the survey results indicate that the lack of information on test smells is one reason that leads practitioners to adopt program ming practices that may introduce test smells. although the test smell concept had appeared in 2001 (van deursen et al., 2001), when we asked in the interview what they know about them, 14% of the interviewees demonstrated having some knowledge. for example, two interviewees mentioned: “i know a little bit about test smells. if i am not mistaken, there are smells like test assertion and duplicated [...]” (#i5) and “test smell? from smells? i know the basics” (#i19). we be lieve that the industry should explore this topic more through the initiatives proposed in academia (santana et al., 2020). some interviewees (46%) associated the test smells’ term with the code smells and related test smells detection with tool’s usage or personal practices. for example, interviewee #i04 mentioned: “although i had never heard the term, it makes sense, because i saw everything as a code smell, but there are some strategies, some guidelines that i follow for unit tests.”. this behavior may generate disagreement on the tool functioning, such as the interviewee #i10 said: “one of the outputs of those software that i mentioned, sonarqube and code climate, are these test smells. they can find some of them, [...] because we can not publish a project with these types of test structures, tests with commented content, such as empty test, the test with a complexity greater than 1”. con versely, in the sonarqube documentation, there is no infor mation about test smells analysis. thus, we considered that those analyses are related to code smells in test code, which is different from test smells detection. test practitioners do not know what test smell is. they can associate the test smell concept with code smells, but they have no information about test smell types and refactoring. 6.5 rq5: what practices have practitioners employed to treat test smells? commonly, the interviewees did not know what test smells are. after explaining the concepts to them in the interview, they could understand and explain how they deal with test smells in their daily activities. they reported adopting a set of project’s activities (e.g., code review, pair programming, and technical debt) and programming practices during the test creation and maintenance processes (e.g., the clean code approach and given, when, then (gwt), and arrange, act, assert (aaa) patterns) to either prevent or treat test smells. the interviewees tended to develop unit tests according to their skills. the professional abilities also determine the result of code review. the interviewees who did not learn about test smells or programming practices can approve a submitted package with these issues. the code review was the most reported activity to treat test smell in test creation (38%) and the most common activity performed by the inter viewees (78%) during test quality verification. in this activ ity, one or more practitioners analyze the submitted code. in this context, the reviewer’s knowledge determines whether the code is good enough to merge it on the repository. each team adopts different strategies to perform code re views based on the number of reviewers, number of ap provals, and professional experience. although some inter viewees reported that only the experienced members review software and test code, the review may not avoid test smells in the project repository, mainly because both experienced and inexperienced practitioners adopt practices that intro duce test smells. when we asked about the test smell treatment during test maintenance, some interviewees reported creating a techni cal debt to refactor test smell in another moment (intervie wees #i08, #i09, #i22, #i25, and #i50). this behavior may indicate that the test smell correction is not a priority. the technical debt creation may also be the reason why test smells remain in the repository. for example, interviewee #i09 said: “there is nearly no treatment for test smells. [...] when remov ing a feature from the software or its business rule is changed, the test code is commented and left there. [...] hardly the de velopers handle with commented test codes. [...]”. the interviewees hardly addressed the technical debt and failing tests because they needed to prioritize other tasks as software code development. with less time for testing, test smells would be introduced in the test code during test cre ation and maintenance and keep in the repository through postponing maintenance activities. we did not know whether practitioners have learned about test smells. thus, we adopted the concepts of test smells in literature. the validation of those concepts was out of scope. although we did not ask specifically whether the intervie wees considered test smell as a problem or agreed with the given test smells examples as a smell, during their answers about test smell treatment, part of them told about how they treat at least one of the given examples. for example, the interviewee #i07 said: “despite not having worked exactly with this type of concept, sonar itself warned us about these two problems, both when the logic was very complex, with a lot of ”if,” it warned us to break it in different methods, things like type. moreover, i remember that it identified comments, commented code, and sends a warning”. regarding the conditional test logic smell example, #i37 said: “this specific code enters into a specific clean code case. this test may be doing more than it should”. accord ing to these comments, the interviewees consider test smells, including the given examples, as structures to fix. practitioners adopt a set of project activities and pro gramming practices to treat test smells. as they do not know well the test smells concepts, it is impossi ble to guarantee that those strategies treat test smells appropriately. silva junior et al. 7 threats to validity internal validity. although there are more than 100 test smells, this study only considered 14 test smells. however, we selected the most frequent test smells discussed in the lit erature. in addition, the test smells were presented in the sur vey as practices. to mitigate ambiguities and text compre hension, we applied a pilot with four testers from different companies. we used professional social networking to reach as many respondents as possible from brazilian companies demographically distributed for the survey and interview ex ecution. external validity. our survey and interview respondents may not adequately represent the practices adopted by the practitioners in the wider software engineering industry. al though our results may not generalize, they provide a prac tice adopted initial view by the testers. there is an agreement among the practitioners’ responses, indicating that additional data might not reveal new insights. construct validity. the survey did not inform that the questions referred to test smells to investigate whether the practitioners nonintentionally insert test smells. we pre vented the respondents’ partiality when identifying the prac tices adopted. complementary, to investigate how the prac titioners deal with test smells. we presented the concept to the interviewees who did not know this subject. after learn ing test smells, the respondents were interested in finding so lutions for this ”problem” (test smells). we collected open ended questions answers and performed one peerreviewed coding process to avoid biases. the survey and interview in struments were written in portuguese and translated to en glish by one author but reviewed by others. conclusion validity. the data analysis was an exhaustive process, which depends on the researchers’ interpretation of the openended questions answers. to prevent biases, we per formed the data analysis in three steps: i) two researchers ana lyzed the data on pair to discuss the identification of the code, ii) two researchers analyzed the data individually, checking if new codes could emerge, and iii) all researchers discussed and compiled the results from steps i and ii. additionally, to increase transparency, the crude survey and interview data are available online for other researchers to validate and repli cate our study. 8 related work bavota et al. (2015) presented a case study to investigate the test smells impact on maintenance activities. in that study, developers and students analyzed testing code to compare whether their experience would make a difference in test smell identification. as a result, they found that the inten sity of the test smells’ impact is different for different levels of experience; the number of impacting test smells is higher for students than industry professionals. additionally, they found that test smells have a significantly negative impact on maintenance activities. conversely, our survey found that the practitioners’ experience does not interfere in the test smell introduction during test creation and execution activ ities. moreover, the interview revealed that the practitioners are not aware of test smells, reinforcing that the experience is not influencing the test smells insertion in the test code. tufano et al. (2016) proposed an interview study with 19 participants to investigate developers’ perception of test smells. they performed an empirical investigation to ana lyze where test smells occur at the source code. the re sults showed that developers generally do not recognize test smells, and there are test smells since the first code commit in the repository. similarly, our interview indicated a lack of awareness of the developers about the underlying concept of test smells. additionally, we did not find any study investi gating how professional practices affect the test smells intro duction, and therefore we investigated it through a survey. spadini et al. (2020) surveyed developers to evaluate the severity thresholds for detecting test smells and investigate test smells’ perceived impact on test suite maintainability. the developers had to classify whether a test smell instance is valid and rate the test smell instance regarding its importance to its maintainability. the evaluation of test smells instances requires knowledge about the topic. therefore, our survey presented practices that might lead to test smells insertion, and our interview provided information about test smells to level the respondents about the topic. in our previous work (silva junior et al., 2020), we con ducted an expert survey to understand whether practitioners unintentionally insert test smells. we surveyed sixty brazil ian practitioners regarding fourteen bad practices that might lead to the test smell insertion during the test code creation and execution. the results indicated that the practitioners’ experience might not influence the test smells insertion. usu ally, practices that lead to test smells insertion came from im proper personal and company standards. this current study complements the previous one by investigating the practition ers’ knowledge about test smells and how they deal with the test code quality regarding the presence of test smells. we conducted interviews with fifty brazilian practitioners to ask them about the test code creation and maintenance processes. as a result, the interviewees indicated a set of practices that might be useful to treat test smells. however, as they do not know about test smells concepts, those practices need further investigation for test smell treatment. 9 conclusion test smells may decrease the test code quality and main tenance. our study aimed to identify whether practitioners unintentionally insert test smells in the test code and how they treat them. therefore, we applied two complementary research methods: a survey and an interview study. we surveyed sixty respondents to investigate the uninten tional test smells insertion in the test code. they evaluated a set of practices related to the test smells insertion in the test code. the results indicated that the respondents adopt bad practices that might lead to insert the test smells. the bad practices adoption is more related to the improper company standards than the respondent’s experience with test code de velopment. to investigate how the practitioners treat test smells, we interviewed fifty respondents. they answered questions on silva junior et al. how they prevent and treat test smells during the test code de velopment. the results indicated an overall knowledge lack on the test smells. for most of the interviewees, it was their first contact with this subject. however, after explaining one test smell to the respondents, they recognized it in their test code and identified practices that they adopted to deal with it. among the recommended practices, we highlight the adop tion of tools, coding patterns, programming practices, code review, and training to improve the developers’ skills and expertise. after analyzing the answers to the survey and the inter view, we could identify that practitioners did not know test smells. thus, they insert different test smells types, even the experienced ones. they have tried to treat test smells through some strategies, but as they have not learned about this sub ject, they have inserted test smells in their test code, and the strategies may not be enough to avoid that. those studies are starting points to researches that consider practitioners as agents in the test smell treatment. as future work, we aim to follow the grounded theory methodology (corbin and strauss, 1990) to leverage a com mon understanding of how the software industry is receptive to improving the test code quality by taking test smells into consideration. we would validate the respondents’ practices to prevent and treat test smells and elaborate a checklist for test code quality development and assurance with an indepth study. acknowledgements we would like to thank the participants in our survey and pilot study. this research was partially funded by ines 2.0; cnpq grants 465614/20140 and 408356/20189 and fapesb grants jcb0060/2016 and bol0188/2020. references bavota, g., qusef, a., oliveto, r., lucia, a., and binkley, d. (2012). an empirical analysis of the distribution of unit test smells and their impact on software maintenance. in 28th ieee international conference on software main tenance (icsm). bavota, g., qusef, a., oliveto, r., lucia, a., and binkley, d. (2015). are test smells really harmful? an empirical study. empirical software engineering, 20(4). corbin, j. and strauss, a. (2014). basics of qualita tive research: techniques and procedures for developing grounded theory. sage publications. corbin, j. m. and strauss, a. (1990). grounded theory re search: procedures, canons, and evaluative criteria. qual itative sociology, 13(1):3–21. creswell, j. w. and clark, v. l. p. (2018). designing and conducting mixed methods research. sage publica tions, third edition. fraser, g. and arcuri, a. (2011). evosuite: automatic test suite generation for objectoriented software. in 13th eu ropean conference on foundations of software engineer ing, esec/fse, new york, ny, usa. acm. garousi, v. and felderer, m. (2016). developing, verifying, and maintaining highquality automated test scripts. ieee software, 33(3). garousi, v. and küçük, b. (2018). smells in software test code: a survey of knowledge in industry and academia. journal of systems and software, 138. greiler, m., van deursen, a., and storey, m. (2013). auto mated detection of test fixture strategies and smells. in 2013 ieee sixth international conference on software testing, verification and validation. gubrium, j. f., holstein, j. a., marvasti, a. b., and mckin ney, k. d. (2012). the sage handbook of interview re search: the complexity of the craft. sage publications, 2nd edition. junior, n. s., martins, l., rocha, l., costa, h., and machado, i. (2021). how are test smells treated in the wild? a tale of two empirical studies [dataset]. available at: https: //doi.org/10.5281/zenodo.4548406. kitchenham, b. a., budgen, d., and brereton, p. (2015). evidencebased software engineering and systematic re views, volume 4. crc press. melegati, j. and wang, x. (2020). case survey studies in software engineering research. in proceedings of the 14th acm / ieee international symposium on empirical soft ware engineering and measurement (esem), esem ’20, new york, ny, usa. acm. meszaros, g., smith, s. m., and andrea, j. (2003). the test automation manifesto. in maurer, f. and wells, d., editors, extreme programming and agile methods xp/agile uni verse 2003. springer berlin heidelberg. miles, m. b., huberman, a. m., and saldaña, j. (2014). qual itative data analysis. sage publications, fourth edition. palomba, f., di nucci, d., panichella, a., oliveto, r., and de lucia, a. (2016). on the diffusion of test smells in au tomatically generated test code: an empirical study. in 9th international workshop on searchbased software testing. acm. peruma, a. s. a. (2018). what the smell? an empiri cal investigation on the distribution and severity of test smells in open source android applications. phd thesis, rochester institute of technology. pfleeger, s. l. and kitchenham, b. a. (2001). principles of survey research: part 1: turning lemons into lemonade. acm sigsoft software engineering notes, 26(6):16– 18. santana, r., martins, l., rocha, l., virgínio, t., cruz, a., costa, h., and machado, i. (2020). raide: a tool for asser tion roulette and duplicate assert identification and refac toring. in proceedings of the 34th brazilian symposium on software engineering, sbes ’20, page 374–379, new york, ny, usa. association for computing machinery. silva junior, n., rocha, l., martins, l. a., and machado, i. (2020). a survey on test practitioners’ awareness of test smells. in proceedings of the xxiii iberoamerican con ference on software engineering, cibse 2020, pages 462– 475. curran associates. singer, j., sim, s. e., and lethbridge, t. c. (2008). soft ware engineering data collection for field studies. in shull, f., singer, j., and sjøberg, d. i. k., editors, guide to ad https://doi.org/10.5281/zenodo.4548406 https://doi.org/10.5281/zenodo.4548406 silva junior et al. vanced empirical software engineering, pages 9–34, lon don. springer london. smeets, n. and simons, a. j. (2011). automated unit testing with randoop, jwalk and µjava versus manual junit test ing. research report, department of computer science, university of sheffield/university of antwerp, sheffield, antwerp. spadini, d., schvarcbacher, m., oprescu, a.m., bruntink, m., and bacchelli, a. (2020). investigating severity thresh olds for test smells. in proceedings of the 17th interna tional conference on mining software repositories, msr. tufano, m., palomba, f., bavota, g., di penta, m., oliveto, r., de lucia, a., and poshyvanyk, d. (2016). an empiri cal investigation into the nature of test smells. in 31st inter national conference on automated software engineering. ieee. van deursen, a., moonen, l., van den bergh, a., and kok, g. (2001). refactoring test code. in proceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (xp). van rompaey, b., du bois, b., and demeyer, s. (2006). characterizing the relative significance of a test smell. in 22nd international conference on software maintenance, icsm’06. ieee computer society. virgínio, t., martins, l., rocha, l., santana, r., cruz, a., costa, h., and machado, i. (2020). jnose: java test smell detector. in proceedings of the 34th brazilian symposium on software engineering, sbes ’20, page 564–569, new york, ny, usa. association for computing machinery. virgínio, t., martins, l. a., soares, l. r., santana, r., costa, h., and machado, i. (2020). an empirical study of automaticallygenerated tests from the perspective of test smells. in sbes ’20: 34th brazilian symposium on software engineering, pages 92–96. acm. virgínio, t., santana, r., martins, l. a., soares, l. r., costa, h., and machado, i. (2019). on the influence of test smells on test coverage. in proceedings of the xxxiii brazilian symposium on software engineering. acm. wiederseiner, c., jolly, s. a., garousi, v., and eskandar, m. m. (2010). an opensource tool for automated gen eration of blackbox xunit test code and its industrial eval uation. in bottaci, l. and fraser, g., editors, testing – practice and research techniques. springer berlin hei delberg. yusifoğlu, v. g., amannejad, y., and can, a. b. (2015). software testcode engineering: a systematic mapping. in formation and software technology, 58. a appendix a block 1: respondents’ profile q1. what is your gender? q2. what is your age? q3. which course do you have an academic background in? q4. what is the highest degree or level of education you have completed? q5. which brazilian state do you currently work? q6. how long have you been working with software test ing? q7. how long have you been working with software de velopment? q8. which activity do you perform daily? q9. what is the platforms of the projects that you have worked on? q10. what is the application domain of the last project that you worked on? q11. which test technique do you execute? q12. are the tests executed more often manually or auto mated? q13. how do you describe your expertise with coding? block 2: test creation q14. what is the source for creating the test cases for the projects in which you work? q15. is there verification to detect duplicate tests (with the same writing or with different writing and the same objective)? more than one option could be selected. evaluate the following statements according to your daily activities: q16. “i usually create test cases using some configuration file (or complementary file) as a backup” q17. “when creating a test, i analyze whether it can be ex ecuted at the same time with others or if it should be executed in isolation, due to the availability of exter nal resources .“ q18. “i analyze the possibility of a test failing because it uses a resource that is being used at the same time by another test.” q19. “i have a habit of creating tests with a high number of parameters (number of files, database record, etc.).” q20. “i group different test cases into one (that is, combine tests that could be run separately).” q21. “i create tests that depend on resources that may not have their own tests for validation (eg a test that in volves retrieving information from the database, but there is no test to validate database research). “ q22. “i have already created a test to validate some feature that will not be used in the production environment” q23. “i have already created a test with a high value for a specific parameter (eg number of records in the database, number of files in folder) even that makes it difficult to repeat. “ q24. “i have already created a test with a conditional or repetitive structure.” q25. “i have already created an empty test, with no exe cutable instructions.” q26. “i usually create tests using some data from a config uration file.” q27. “i usually create tests with printing or displaying re sults in a redundant way, or without need.” q28. “i have already created a test considering the exis tence of a resource, without checking its existence or availability.” silva junior et al. q29. “i already changed a test by identifying one of the pre vious points.” q30. if you answered “always”, “frequently” or “rarely” in the previous questions, why were the tests created with these standards? q31. if you changed any tests according to the design stan dards above, why were they edited? q32. what problems in the test structure have you encoun tered? q33. what difficulties do you often encounter when creat ing test cases? block 3: test execution evaluate the following statements according to the fre quency found in daily activities: q34. “a test case fails due to unavailability of access to a configuration file.” q35. “repeat a test case because it previously failed due to competition with some other test case that was run ning at the same time.” q36. “execute tests that could be executed performed more quickly, when modifying the contents of the configu ration file.” q37. “run a test without understanding its purpose.” q38. “some test fails and it is not possible to identify the cause of the failure.” q39. “run a test that depends on an external resource that does not have a test for direct validation.” q40. “a test case fails due to unavailability of access to any external resource.” q41. “run test with a high value for a specific parameter (eg: number of records in the database, number of files in folder) even if it makes it difficult to repeat.” q42. “run a test to validate a feature that will not be used in the production environment.” q43. “find duplicate test (with the same or different writ ing).” q44. “run test with conditional or repetitive structure.” q45. “find empty test, with no executable instruction.” q46. “run test with printing or display of results in a re dundant way, or unnecessary.” q47. “run a test considering the existence of a resource, without checking the existence or availability of it.” what difficulties do you usually encounter when run ning test cases? introduction test smells research method survey study design results test creation and execution practices professional experience interview study design results unit test code creation and maintenance test smells treatment discussion rq1: do practitioners use test case design practices that might lead to the introduction of test smells? rq2: which practices are present in practitioners' daily activities that lead to introducing test smells? rq3: does the practitioners’ experience interfere with the introduction of test smells? rq4: how aware of test smells are the practitioners? rq5: what practices have practitioners employed to treat test smells? threats to validity related work conclusion appendix a journal of software engineering research and development, 2022, 10:9, doi: 10.5753/jserd.2022.1897 this work is licensed under a creative commons attribution 4.0 international license.. assessing the credibility of grey literature: a study with brazilian software engineering researchers fernando kamei [ ufpe, ifal | fernando.kenji@ifal.edu.br ] igor wiese [ utfpr | igor@utfpr.edu.br ] gustavo pinto [ zup innovation & ufpa | gustavo.pinto@zup.com.br ] waldemar ferreira [ unicap | waldemar.neto@unicap.br ] márcio ribeiro [ ufal | marcio@ic.ufal.br ] renata souza [ ufpe | rmcrs@cin.ufpe.br ] sérgio soares [ ufpe | scbs@cin.ufpe.br ] in recent years, the use and investigations about grey literature (gl) increased, in particular, in software engineering (se) research. however, its understanding is still scarce and sometimes controversial, such as interpreting gl types and assessing their credibility. this study aimed to understand the credibility aspects that se researchers consider in assessing gl and its types. to achieve this goal, we surveyed 53 se researchers (who answered that they have used gl in our previous investigation), receiving a total of 34 valid responses. our main findings show that: 1) gl source produced or cited by a renowned source is the main credibility criteria used to assess gl, 2) most of the gl types tend to have a low to moderate level of control and expertise, 3) there is a positive statistical correlation between the level of control and expertise for most gl types, and 4) the different respondent profiles shared similar opinions about the credibility criteria. our investigation contributes to helping future se researchers that intend to use gl with more credibility. additionally, shows the need for future studies to better understand the gl types in se research. keywords: grey literature, credibility, empirical software engineering, evidence-based software engineering. 1 introduction grey literature (gl) refers to a kind of publication that does not go through a peer-reviewed process before its publication (petticrew and roberts, 2006). some areas of knowledge have used and investigated gl. for instance, in management, adams et al. (2016b) investigated how gl could be used with relevance for management and organization studies. in science of information (schöpfel and prost, 2020), there is an investigation about the term and concept of gl in scientific papers. in software engineering (se), many researchers interpret gl as any material that was not formally peer-reviewed and published (garousi et al., 2019). in the last years, se researchers increased their interest in investigating gl, motivated by the growth of social media and communication channels that se practitioners use to communicate, exchange problems and ideas (storey et al., 2017), including, for instance, code hosting websites such as github (coelho et al., 2020) and communication platforms such as slack (stray and moe, 2020). in se, several studies investigated and recognized the importance and usefulness of gl. for instance, garousi et al. (2016) explored the benefits of gl for multivocal literature reviews, showing what the secondary studies gained when considered gl and what was missed when it was not considered. other studies (williams and rainer, 2017; rainer and williams, 2018) investigated the benefits and challenges of using blog content for se research, and how to improve its use by selecting gl content with more credibility. despite the increase in investigations in this field, there are some misunderstandings about gl and its diverse types (tom et al., 2013; kamei et al., 2021), and how the set of credibility criteria investigated in previous studies (e.g.,williamsand rainer (2017)) could be used and interpreted to the diverse types of gl (kamei et al., 2021). according to adams et al. (2016a), the different types of gl could be classified in terms of the “shades” of grey, which groups gl according to two dimensions: control and expertise. garousi et al. (2019) explained these dimensions as follows: control is the extent to which content is produced, moderated, or edited in conformance with explicit and transparent knowledge creation criteria. on the other hand, expertise is the extent to which we can determine the producer’s authority and knowledge. in this paper, we begin by studying the different perceptions of se researchers about gl. we then focused on studying how gl could be assessed considering its different types. for each study, we surveyed brazilian se researchers. in the first survey — which was published previously (kamei et al., 2020) — we investigated how brazilian se researchers use gl, focusing on understanding which criteria they employed to assess its credibility as well as the benefits and challenges they perceived. in the second survey (the novel contribution of this paper), we focused on how brazilian se researchers that previously used gl perceived the criteria to assess the different gl types according to control and expertise. in the following, we list our main findings (s1 means survey 1, while s2 means otherwise): s1 we identified the main gl sources used by the brazilian se researchers; s1 we identified several motivations to use (or to avoid) gl; https://orcid.org/0000-0002-5572-2049 mailto:fernando.kenji@ifal.edu.br https://orcid.org/0000-0001-9943-5570 mailto:igor@utfpr.edu.br https://orcid.org/0000-0001-7598-2799 mailto:gustavo.pinto@zup.com.br https://orcid.org/0000-0003-4548-7601 mailto:waldemar.neto@unicap.br https://orcid.org/0000-0002-4293-4261 mailto:marcio@ic.ufal.br https://orcid.org/0000-0002-2849-1273 mailto:rmcrs@cin.ufpe.br https://orcid.org/0000-0002-4428-2535 mailto:scbs@cin.ufpe.br submitted to jserd kamei et al. 2022 s1, s2 we identified that the main criteria employed by brazilian se researchers to assess gl credibility are: gl source be provided by renowned authors, institutions, companies, or cited by a renowned source; s2 gl is not widely used as a reference in scientific studies; s2 we identified different interpretations to assess gl types, showing the importance to consider each type in particular; s2 we identified for most of the gl types a strong to very strong positive correlations (p-value <= 0.05%) between the perceptions of the level of control and expertise; s2 we did not find a significant correlation (p-value <= 0.05%) between the perceptions of control and expertise to gl types when considering the respondent’s profile; s2 we perceived misunderstandings about whether a source type is considered a gl type or not, mainly related to the most classified sources as high control and high expertise. this paper is structured as follows: section 2 presents the core concepts of this work. section 3 shows the research questions explored with their rationales. section 4 exposes the methods employed to conduct, analyze and synthesize the data collected. section 5 summarizes the answers to the researcher questions (rq1–rq4) of the previous investigation (kamei et al., 2020). section 6 provides the answers to the research questions (rq5–rq6) specifically for this investigation. section 7 presents the discussions about the findings, lessons learned, and the threats to the validity of this research. section 8 provides the description and comparison of the related works. finally, section 9 exposes the conclusions and future works. 2 background grey literature (gl) has many definitions. however, the most known is called as luxembourg definition (garousi et al., 2019), approved at the third international conference on grey literature in 1997, that stated: “[gl] is produced on all levels of government, academics, business, and industry in print and electronic formats, but which is not controlled by commercial publishers, i.e., where publishing is not the primary activity of the producing body.” focusing on software engineering (se) research, recently, garousi et al. (2019) proposed the following definition: “grey literature can be defined as any material about se that is not formally peerreviewed nor formally published.” considering those definitions, they showed a wide concept of what would be considered a gl, showing that it can be produced in different ways. however, it may lead to a misunderstanding. for this reason, adams et al. (2016a) introduced some terms to distinguish the different concepts about grey, including grey literature, grey data, and grey information. the term “grey data” describes user-generated web content (e.g., tweets, blogs, videos). the term “grey information” is informally published or not published (e.g., meeting notes, emails, personal memories). however, se literature hardly distinguishes these terms. similarly, we considered all forms of grey data and grey information as gl in our work. beyond the gl types, adams et al. (2016b) classified gl according to “shades of grey”. in se, garousi et al. (2019) adapted these shades according to three tiers, as shown in figure 1. in this figure, on the top of the pyramid is the “traditional literature” with scientific articles from conferences and journals. on the rest of the pyramid are what we called as three tiers of gl. these tiers are running according to two dimensions: control and expertise. the first dimension runs between extremes “low” and “higher” and the second runs between extremes “unknown” and “known”. the darker the color, the less moderated or edited the source in conformance with explicit and transparent knowledge creation criteria. figure 1. the “shades” of grey literature, adapted of garousi et al. (2019). recently, gl was used and investigated in se research for many purposes. for instance, primary studies explored the gl available on several social media sources used by se practitioners. for instance, rainer and williams (2018) assessed the importance of blog posts to se research, and oliveira oliveira et al. (2021) investigated several java projects from github to evaluate the developers’ skills based on the source code activities. thepresenceofglinsecondarystudieswasnotableinthe investigations conducted by zhang et al. (2020) and kamei et al. (2021) and by the increase in studies based on grey literature reviews (glr) (e.g., raulamo-jurvanen et al. (2017) and soldani et al. (2018)) and multivocal literature reviews (mlr) (e.g., garousi et al. (2017) and saltan (2019)). explaining these types of study, a glr is a secondary study that explores the evidence, looking at only gl sources, and a multivocal literature is also a secondary study that searches for gl and traditional literature. even with this increase in interest in gl, its use is recent in the se research (zhang et al., 2020; kamei et al., 2021). and there are some gaps and different findings of gl in se research. for instance, kamei et al. (2021) identified that there is a lack of understanding of what is considered a gl type, and previous studies provide different criteria to assess gl credibility (kamei et al., 2020; williams and rainer, 2019). 3 research questions in this section, we stated our research questions and the rationale for their purposes. submitted to jserd kamei et al. 2022 rq1: why do brazilian se researchers use grey literature? rationale: recently, se practitioners have relied on social media and communication channels to share and acquire knowledge (storey et al., 2017). on the one hand, some researchers try to take advantage of its use in se research. for instance, rainer and williams (2018) explored the benefits and challenges of blog articles as evidence in se research. on the other hand, some concerns (e.g., lack of detail and lack of empirical methods) related to gl could make se researchers skeptical about their credibility (rainer and williams, 2019). in this broad question, we intend (i) to understand if brazilian se researchers are using gl and, if so, (ii) what motivates them to use, or if not, (iii) the reasons that lead to not using gl. rq2: what types of grey literature are used by brazilian se researchers? rationale: according to adams et al. (2016a), gl has many forms, from traditional mediums such as question & answer websites and blogs to more dynamic mediums such as telegram and slack. for this reason, bonato (2018) emphasized the importance of exploring the gl definition and its types for each research area. there is a lack of understanding of gl types, precisely what the brazilian se researchers used. this research question sought to investigate what brazilian se researchers often use gl sources. a better understanding of the gl types could guide future research in this area. rq3: what are the criteria brazilian se researchers employ to assess grey literature credibility? rationale: software engineering research uses gl sources, such as data provided by practitioners retrieved from several social media and communication channels. however, as gl is, by nature, a not peer-reviewed source, se practitioners are free to share their thoughts using social media, for instance, without worrying about methodological concerns. thus, it is essential to assess gl sources to ensure the selected gl is appropriate for the study. answering this question will help us understand the credibility criteria that brazilian se researchers consider. rq4: what benefits and challenges brazilian se researchers perceive when using grey literature? rationale: according to storey et al. (2014), the se research community has increased its interest in gl since the widespread presence of se professionals using social media and communication channels. for instance, exploring the stack overflow, zahedi et al. (2020) found some trends and challenges in continuous se that researchers could better explore. in this question, we are interested in understanding the (i) benefits and (ii) challenges that researchers may face when resorting to gl. answering this question is essential to understanding the potential benefits and challenges of using gl more broadly by researchers. rq5: how do se researchers prioritize a set of criteria to assess grey literature credibility? rationale: in our first investigation (kamei et al., 2020), we provided a set of criteria used by brazilian se researchers to assess gl credibility. previous literature (williams and rainer, 2019) also identified another set of criteria. in this question, we focused on understanding the importance of those criteria to assess gl credibility. rq6: what is the perception of brazilian se researchers about the different types of grey literature according to the perspective of control and expertise? rationale: due to the diverse nature of the gl types, some studies suggested that gl needs to be assessed in different ways (garousi et al., 2019). for this reason, adams adams et al. (2016b) classified its types according to the shades of grey. this classification is based on two dimensions: control and expertise. control refers to the rigor with which a source is produced. expertise is the extent to which the knowledge and producer authority can be determined. nevertheless, this understanding and classification are still confused. this research question sought to understand how brazilian se researchers commonly perceived the gl types according to the (i) control and (ii) expertise. 4 research methods in this work, we followed (linåker et al., 2015), aiming to use a survey methodology for data collection. this data was collected from a group of people sampled from a large population. we conducted two surveys. the first (survey 1) aimed tounderstandthebrazilianseresearcher’sperceptionsabout gl. the second (survey 2) investigated only the brazilian researchers from the first survey who answered that they used gl. in the following sections, we detailed the procedures used to conduct survey 1 with participants of a flagship conference of se in brazil (section 4.1). then, we present the procedures used for survey 2 that focused on the researchers that have experience using gl (section 4.2). finally, we provide the methods used for the analysis of both surveys (section 4.3). 4.1 survey 1: initial investigation with the brazilian se researchers in survey 1, we intended to gather a broad perception of gl used by brazilian se researchers, focusing on understanding the motivations to use (or avoid), the types of gl used, the benefits and challenges, and the criteria used to assess its credibility. submitted to jserd kamei et al. 2022 4.1.1 survey design we conducted our survey with participants of the 10th brazilian conference on software: practice and theory (cbsoft), the largest brazilian software conference with many se researchers’ participating. it includes well-established and specialized satellite se conferences in its domain. our population comprehends se researchers are potentially interested in using gl in their research. we chose our sample using nonprobabilistic sampling by convenience (baltes and ralph, 2021). before sending the final survey version, an experienced researcher (ph.d. se researcher with more than 15 years of experience in research) reviewed our draft. we also conducted a pilot study by randomly selecting two participants and explicitly asking for their feedback. we received feedback suggesting changing the order and re-writing some questions to make them more understandable to the target population. we obtained the contact of all the 252 participants, asking the conference’s general chair whether s/he could share this information with us, which s/he gently provided.1 we used two approaches to invite the researchers to answer our questionnaire. first, we placed posters on the event’s walls and tables with a brief description of the work and the link to the online survey. second, we sent the actual survey to the 250 remaining participants of the event. in the invitation email, we briefly introduced ourselves, presented the research’s purposes, highlighted that the invite was to the participant of the cbsoft, and the link to the online survey. we also mentioned that the participant was free to withdraw at any moment, and all information stored was confidential. the survey was open for responses from september 26th to october 11th, 2019. we received a total of 76 valid answers (30.4% response rate). we did not consider the pilot survey answers. 4.1.2 survey respondents among the survey respondents, 48.7% have a ph.d., 31.6% have a master’s, 2.6% are graduate specialization, 14.5% have a bachelor’s degree, and 2.6% are undergraduates. among them, 72.4% are men, and 27.6% are women. table 1 presents the demographics’ information about the respondents and their experience using gl or not. this table shows that most respondents with ph.d. and master’s degrees answered that they were using gl. 1in the period of this research, the brazilian general data protection law was not yet officially published. table 1. demographics information of the survey 1 respondents. gender level of course used gl not used gl woman doctorate 5 5 man doctorate 24 3 woman master 4 2 man master 15 3 woman expert 1 1 man expert 0 0 woman university graduate 0 2 man university graduate 2 7 woman technical education 0 0 man technical education 0 0 woman high school 1 0 man high school 1 0 4.1.3 survey questions our survey had 11 questions (three were required, nine of which were open). we used different questions flow for those who used gl (did not answer question 10) from those who did not (answered only questions 1 to 4 and questions 10 and 11). table 2 presented the questions covered in this survey. 4.2 survey 2: investigating brazilian se researchers that use grey literature in this survey, we intended to do a follow-up survey to collect perceptions only from the brazilian se researchers from survey 1, who answered that they have previously used gl. we focused on the perceptions of the different gl types concerning the dimensions of control and expertise. 4.2.1 survey design using a non-probability sample by convenience (baltes and ralph, 2021), we invited by email once again the 53 researchers that participated in our survey 1 and mentioned the use of gl. we first drew our questionnaire and improved it through the conduction of three sequential steps: 1) a pilot study with five ph.d. se researchers; 2) another se researcher specialist assessed the questionnaire; and 3) received feedback of a participant relating a problem in the first hours after opening the survey. for this reason, we closed the survey to stop receiving answers. then, we deleted all answers previously received and sent a new questionnaire version to the researchers. we opened the survey for answers from february 10th to march 4th, 2021. we received a total of 34 valid answers (64.1% response rate). we did not consider the pilot survey answers. 4.2.2 survey respondents in this survey, as we retrieved our sample from the previous one who answered that they had used gl, we did not ask the same questions (e.g., gender, academic degree). instead, we collected information about their experience in se research and using gl in scientific articles. submitted to jserd kamei et al. 2022 table 2. questions covered in the survey 1. # question type of question options of answers (for closed questions) required? rq q1 what is your e-mail? open no q2 what is your gender? open yes q3 please list the highest academic degree you have received. closed high school, technical education, university graduate, expert, master’s degree, doctorate. yes q4 have you used grey literature? if you never used, go to question q10. closed yes, no. yes rq1 q5 what sources of grey literature did you use? open no rq2 q6 in which conditions do you use grey literature? open no rq1 q7 in which conditions do you do not use grey literature? open no rq1 q8 could you list any benefits in using grey literature? open no rq4 q9 could you list any challenges in using grey literature? open no rq4 q10 if you answered ’no’ in question four, please state why did you never use or avoid use grey literature? open no rq1 q11 what would be a reliable source of grey literature for you? open no rq3 the respondents’ profile of our survey was composed of 76.5% of professors or researchers and 23.5% of undergraduates. regarding se research experience, 55.9% of the respondents had more than ten years. considering the experience using gl, 47% had conducted between 2 and 5 scientific studies using gl, although 26.5% were unable to answer. 4.2.3 survey questions our second survey had ten questions (six were required, and four were open). table 3 presents the questions covered in this survey. before question 4, we produced and included a video2 to summarize and explain the “shades of gl” according to the level of control and expertise. 4.3 data analysis and synthesis in both surveys, we employed a mixed-method approach based on both qualitative (section 4.3.1) and quantitative (section 4.3.2) methods to analyze data. we used a qualitative approach when we were interested in questions about “what” and “how” and a quantitative analysis using descriptive statistics to discuss frequency and distribution and correlation analysis between the dimensions of control and expertise to each gl type. we describe these methods in the following. 4.3.1 qualitative analysis we used a qualitative approach based on the thematic analysis technique (braun and clarke, 2006). this process in2video explaining the “shades of gl” (in portuguese): https://youtu.be/hgmkvxiapr0 volved three se researchers with previous qualitative research experience (one ph.d. student (r1) and two ph.d. professors (r2–r3)) for both surveys. we performed an agreement analysis with the codes and categories generated by each researcher using the kappa statistic (viera and garrett, 2005) to survey 1. the kappa value was 0.749, indicating a substantial agreement level, according to the kappa reference table (viera and garrett, 2005). for survey 2, we do not calculate kappa due to the analysis process that occurred with the researchers working together. figure 2 presents a general overview of the process employed. in the following, we detailed the procedure used to analyze all the answers (adapted from pinto et al. (2019)) of both surveys, showing the differences employed in each survey research: 1. familiarizing with data: the process starts with two independent researchers reading the answers of the survey respondents, as expressed in figure 2-(a). 2. initial coding: then, for survey 1, two independent researchers (r1 and r2) individually analyzed and added codes. for survey 2, the researchers analyzed, discussed, and coded together (r1 and r2, into a dotted box). we used a post-formed code, so we labeled portions of text that expressed the meaning of the excerpts without any previous pre-formed code. the initial codes are temporaries, since they still need refinement. we refined the emerged codes throughout all the analyses. an example of coding is present in figure 2-(b). 3. from codes to categories: here, we already had an initial list of codes. for survey 1, two researchers individually conducted this process (r1 and r2). for survey submitted to jserd kamei et al. 2022 table 3. questions covered in the survey 2. # question type of question options of answers (for closed questions) required? rq q1 what is your occupation? closed professor/researcher, student (m.sc. or ph.d.), other (open). yes q2 how many years of experience did you have conducting se research? closed until 1 year, from 1 and 3years, from 4 to 6 years, from 7 to 9 years, 10 years or more. yes q3 how many scientific studies have you conducted using gl as source of evidence? closed i do not know, no one, only one, from 2 and 5, from 6 and 10, more than 10. yes q4 we are aware that the level of control varies from source to source. for this reason, we ask you to consider your experience more frequent in relation to each source type in relation to the control dimension of the production. closed source types: {adapted from maro et al. (2018); level of control: i did not consider it as a gl type, low control, moderate control, high control, no opinion. yes rq6 q5 please, explain what did you consider to classify each source type with the control criteria presented in question 5. open no rq6 q6 we are aware that the level of expertise varies from source to source. for this reason, we ask you to consider your experience more frequent in relation to each source type in relation to the expertise dimension of the production. closed source types: {adapted from maro et al. (2018); level of expertise: i did not consider it as a gl type, low expertise, moderate expertise, high expertise, no opinion. yes rq6 q7 please, explain what did you consider to classify each source type with the expertise criteria presented in question 7. open yes rq6 q8 considering a gl source with important information to your research, would you include a gl source if it is produced by/with. closed choices for expertise criteria: be produced by a renowned author, be produced by a renowned institution, be produced by a renowned company, be cited by others renowned sources, describe the methods of collection, cites an academic reference, cites a practitioner source, presents information with rigor, presents empirical data; choices for answers: no opinion, no, yes. yes rq5 q9 could you cite any additional potential aspect to assess the credibility of a gl source that was not mentioned before? open no rq6 q10 we are planning to conduct a future research about quality assessment in grey literature. please, could you inform your mail to future contact? open no submitted to jserd kamei et al. 2022 2, this process occurred with two researchers working together (r1 and r2). this process begins to look for similar codes in the data. we grouped the codes with similar characteristics in broader categories. eventually, we also had to refine the categories identified, comparing and re-analyzing them in parallel, using an approach similar to axial coding (spencer, 2009). figure 2-(c) presents an example of this process. 4. categories refinement: here, we have a potential set of categories. for both surveys, in a consensus meeting between r1 and r2 (figure 2-(d)), the categories were evaluated and solved the disagreements of interpretationforevidencethatsupportedorrefutedthecategories found. we also renamed or regrouped some categories to describe the excerpts better there. in cases where disagreements remained, we invited a third researcher (a ph.d. professor) to review and solve them for both surveys. 4.3.2 quantitative analysis we based our quantitative investigation on three samples: (i) we used the answers from 76 se researchers to answer rq1; (ii) we used the answers from 53 researchers that mentioned using gl to answer rq2, rq3, and rq4; and (iii) we used the answers from 34 to answer rq5 and rq6. for the descriptive statistics, we highlighted that one answer of a respondent could be related to more than one category found. in the investigations related between the gl types and the dimensions of control and expertise, we present it into boxplots to show the differences of interpretations of each gl type. we used spearman’s rank correlation coefficient for the correlation analysis of the control and expertise perceptions for each gl type. then, we transformed the answers related to the level of control and expertise (low, moderate, high) into non-linear scales: low = 0, moderate = 50, and high = 100. for the quantitative data analysis, we used r language and python. this last, with the support of google colab3. 5 previous results in this section, we summarized the findings of our first study to present answers to rq1–rq4. to understand these research questions, consider reading the previous study (kamei et al., 2020). to each rq, we summarized the categories in tables with the total number of occurrences of a given category in the column “#”. two critical observations are required: 1) the researchers may have reported more than one answer per question, which may happen to be grouped into different categories; and 2) some questions are not required. thus, the overall results might not reach 100% of respondents. 3https://colab.research.google.com rq1: why do brazilian se researchers use grey literature? in our survey 1, we identified 53 se researchers using gl for research purposes. focusing on understanding better why and how se researchers are using gl or avoiding its use, we asked questions that included the motivations to use gl or reasons to avoid it. in the following, we present a summary of the (i) motivations to use gl and (ii) and the reasons to avoid or never use gl. (i) motivations to use table 4 presents the identified se researchers’ motivations for using gl. in this table, the first column describes the motivation identified, followed by the number of respondents related to the category and the percentage associated with the total of se researchers that used gl (n=53). in the following, we briefly describe some motivations. table 4. motivations to use gl. motivation # % to understand the problems 28 52.8% to complement research findings 12 22.6% to answer practical and technical questions 10 18.9% to prepare classes 4 7.5% to conduct government studies 1 1.9% to understand problems was the most cited motivation to use gl, where several researchers noted the use of gl for some reasons: to understand or investigate a new topic, or to search for something to solve problems, or to acquire specific information to deepen the knowledge. to complement research findings was the second most cited motivation, mentioned when the knowledge gained from the traditional literature is not enough for the investigation. for instance, a researcher noted the use of gl to complement the findings of a mapping study. to answer practical and technical questions was the third most cited motivation, related to the necessity to understand the state of the practice in se. other motivations were mentioned but to less extent, such as to prepare class and to conduct government studies. (iii) reasons to avoid/never use even though several motivations to use gl were identified, 50.9% of se researchers (27/53) avoid using gl as a reference or to reinforce some claims in scientific studies. we also found some researchers that never used gl (23/76 occurrences, 30.3%) to any research situations. we used this value to analyze the extent of each category about reasons to never use gl. of the 23 respondents that never used gl, only 15 answered the reason. table 5 presents the summary of the findings for this question. in the following, we briefly describe the reasons to avoid gl. submitted to jserd kamei et al. 2022 figure 2. example of a coding process used to analyze the questionnaire answers. table 5. reasons to avoid/never use gl. reason # % lack of reliability 6 26% lack of scientific value 3 13% lack of opportunity to use 3 13% lack of reliability was the main reason that se researchers mentioned not to use gl. this is related to the lack of rigor in which gl sources are written and published, which affects its credibility. lack of scientific value was another category mentioned, where the researchers were afraid that the use of gl would weaken a research paper when submitted to the peer-review process. lack of opportunity to use was related to the nature of research previously conducted and because gl is recent in the context of se. summary of rq1: brazilian se researchers use gl motivated mainly to understand new topics, find information about practical and technical questions, and complement research findings. however, some researchers avoid gl, particularly as references in scientific papers, due to its lack of reliability and scientific value. rq2: what types of grey literature are used by brazilian se researchers? in this question, we explored the gl sources used by the 53 se researchers that mentioned using gl. table 6 listed these sources. in the following, we briefly present some of our findings. q&a websites were the most common source mentioned, used to interact with other users, create content, post comments, and assess the content. some examples of sources mentioning q&a websites were stack overflow and quora. blog post was the second most common category found. blogs from renowned practitioners and from companies that produce a diversity of material and content for se and software development, in general, were mentioned. technical reports were mentioned for se researchers that used technical experience, reports, and surveys derived from industry and national and international research groups. companies websites provided by google, facebook, and thoughtworks, containing information regarding their technologies, methods, and practices, were mentioned as sources used. some researchers said browsing these websites to find news to help decision-making about a specific technology. table 6. gl sources used by se researchers. source # % q&a websites 16 30.2% blog posts 15 28.3% technical reports 14 26.4% companies websites 8 15% preprints 5 9.4% books/book chapters 5 9.4% software repositories 4 7.5% videos 3 5.7% magazine articles 3 5.7% news articles 2 3.8% summary of rq2: brazilian se researchers are using several gl sources. the most common are q&a websites, blog submitted to jserd kamei et al. 2022 posts, technical reports, and companies websites. rq3: what are the criteria brazilian se researchers employ to assess grey literature credibility? in this research question, we explored the answers into one open-ended question the criteria of how the se researchers assess gl credibility. table 7 summarized our findings. in the following, we briefly describe the criteria identified. renowned authors were the criteria most cited, in which se researchers considered the author’s experience and reputation concerning the topic. for instance, martin fowler was cited as a notorious software engineer with much knowledge. renowned institutions were another crucial criteria, where se researchers assess if renowned institutions or renowned research groups provided the gl content. cited by others was a criterion mentioned to express those researchers that considered as a trusted source cited by others (studies or people). renowned companies was a criterion identified that consider relevant when renowned software industries or portals produce the gl source. table 7. criteria to assess gl credibility. criteria # % renowned authors 15 28.3% renowned institutions 14 26.4% cited by a renowned source 8 15% renowned companies 7 13.2% summary of rq3: whoever produces gl’s content, whether made by a person, institution, or company since the producer is considered renowned, is a significant credibility criterion. rq4: what benefits and challenges brazilian se researchers perceive when using grey literature? in this research question, we explored the benefits and challenges on the gl use mentioned by se researchers. table 8 summarizes the benefits and table 9 the challenges. in the following, we briefly describe some of them. table 8. benefits of the use of gl. benefit # % easy to access and read 16 30.2% provide a practical evidence 13 24.5% knowledge acquisition 13 24.5% updated information 6 11.3% advance the state of the art/practice 5 9.4% different results from scientific studies 3 5.7% table 9. challenges of the use of gl. challenge # % lack of reliability 34 64.2% lack of scientific value 15 28.3% difficult to search/find information 6 11.3% non-structured information 6 11.3% (i) benefits easy to access and read was the most common benefit mentioned, mainly because most gl sources are open access, are quickly recovered by free search engines, and the contents are usually easy to read. empirical evidence was another essential benefit mentioned, showing that gl provides evidence from the se industry to understand the state of the practice. knowledge acquisition was mentioned as a benefit, as gl allows expanding knowledge with different information from what is usually obtained in traditional literature. updated information was mentioned because the production of gl content happens fast compared with traditional literature, mainly related to technical content. advance the state of the art/practice was mentioned due to the importance of gl to understand better the industry and to provide evidence to find relevant gaps in the practice. different results from scientific studies was mentioned because some researchers considered gl essential to provide additional knowledge not yet available in the research area. (ii) challenges lack of reliability was the main challenge the researchers perceived, where some questioned the reliability of the data retrieved from gl. lack of scientific value was the second category most cited. some researchers mentioned that they did not feel comfortable using gl as a reference in scientific works due to the research community’s lack of recognition of this source. difficult to search/find information in gl sources was perceived as a challenge due to the diversity of sources. each source has its structure and manner to provide access to the content, and it is not easy to replicate the study that used gl. non-structured information was mentioned due to the lack of a writing pattern and a large variety of formats in which the gl sources are published, making it difficult to find information, for instance, using an automatic process. summary of rq4: we found several benefits, the most common was that the gl’s content is easy to access and read, which is important to knowledge acquisition, mainly about providing practical evidence derived from se practitioners. the most cited challenges were using gl in scientific research due to the lack of reliability and scientific value. 6 results in this section, we present answers to rq5 and rq6, both research questions answered by the investigation of survey submitted to jserd kamei et al. 2022 table 10. prioritized criteria to assess gl credibility. criteria # % renowned authors 30 88.2% renowned institutions 30 88.2% cited by a renowned source 27 79.4% cites academic sourcea 26 76.5% present empirical dataa 26 76.5% renowned companies 25 73.5% cites practitioner sourcea 16 47.1% rigor in presenting informationa 12 35.3% describe the methods of collectiona 6 17.6% aproposed in williams and rainer (2019) 2. rq5: how do se researchers prioritize a set of criteria to assess grey literature credibility? in our second survey, we asked 53 researchers to prioritize the importance of a set of criteria to assess gl credibility. these criteria were derived from our first investigation and found in williams and rainer (2019) study. we received answers from 34 se researchers. table 10 presents the result of the ranking prioritization of credibility criteria, revealing that essential criteria perceived by se researchers are: gl source be provided by renowned authors, renowned institutions, or cited by a renowned source. we also investigated whether the se researchers have any additional criteria to assess gl credibility not mentioned in the previous survey questions. by analyzing the answers, we did not find any new criterion that was not related to the criteria as earlier presented in table 10. for instance, some researchers mentioned that the detailed description of the publication context is an important criterion. for this case, we considered that it is already contemplated in rigor in presenting information criterion, previously mentioned by williams and rainer (2019). the author’s experience with the topic was another criterion mentioned. we considered this criterion related to the renowned author’s criterion identified in our first survey. summary of rq5: we assessed the prioritization of credibility criteria identified in our first investigation, in addition to those identified in previous studies. we found that the most used criteria by se researchers are when the gl is produced by a renowned source, cited by a renowned authority, cites an academic source, and presents empirical data. rq6: what is the perception of brazilian se researchers about the different types of grey literature according to the perspective of control and expertise? our last research question explored how the researchers perceived the different types of gl concerns to the dimensions of control and expertise. these dimensions are used to classify the tiers of the “shades of gl.” each dimension could be evaluated into three levels (low, moderate, high). figure 3 presents the results of classifications according to the level of control, and figure 4 shows the results of the level of expertise. even we are investigating different dimensions, interestingly, in some cases, the figures 3 and 4 presented similar behaviors. for instance, for some gl types (e.g., blog posts, forums/list of discussions), the low level was predominantly in both dimensions. we also found similarities concerning theother levelsforbothdimensions.forinstance,sometypes (e.g., materials training, news articles, software repositories, and tutorials) run between low (1st quartile) to moderate (2nd quartile). although, for a diversity of cases, the median behavior varied. we also found differences. for instance, considering the level of control to cases/services descriptions and guidelines, the classifications run between low (1st quartile) to moderate (2nd quartile). in contrast, for the level of expertise to these gl types, we found outliers on the low level (1st quartile) and outliers on the high level (3rd quartile). other classifications caught our attention. for instance, regarding the control dimension, the opinions about the magazine articles are not equalized, as we identified some outliers in both extremes (low and high). a similar classification we identified related to guidelines for the expertise dimension. in addition to classifying the levels (low, moderate, and high) of the dimensions (control and expertise), we offered the possibility to the researcher to choose the options of “i did not consider it a gl type” or “i have no opinion.” we included these options because even previous studies (e.g., maro et al. (2018)) presented the gl types for se research; in our previous investigation (kamei et al., 2021), we identified different interpretations, for instance, in which some types were not considered as gl. table 11 shows the results of these classifications. comparing the findings presented in table 11 with the information presented in figures 3 and 4, we perceived that most of gl types classified with high expertise and high control were also, many times, considered as not a gl type (e.g., thesis, books/book chapters, and patents). moreover, we identified that patents are still unknown to several researchers. rationale to employ classification of each dimension (control and expertise) we asked why the researchers employed the classifications of each gl type according to the control and expertise. we identified four main reasons that are summarized in table 12 and described in the following. table 12. reasons to classify gl types according to the level of control and expertise. reasons # % rigor 23 67.6% producer reputation 14 41.2% research experience 13 38.2% peer interaction 5 14.7% submitted to jserd kamei et al. 2022 figure 3. classification of each gl source type according to the level of control. each level of control indicates: low = 0; moderate = 50; high = 100. rigor (23/34 occurrences). researchers considered the rigor (control) of each source’s production, for instance, the degree of formality present. in this regard, one researcher pointed out: “technical reports, for instance, present systematic studies with high control (of production).” this category was also related to the credibility dimension, as one researcher affirmed: “i consider that credibility is directly related to the rigor of the publication/availability of an artifact.” producer reputation (14/34 occurrences). the producer’s reputation was considered an essential criterion to assess control and expertise, as one researcher pointed out: “the credibility relates to who is the author of the material and to the platform being conveyed. another one mentioned: “depending on the publisher, i can consider high (e.g., elsevier) or low (e.g., autonomously published book) control. the same applies to news: the credibility of the source influences the level of control regarding stricter editorial control in favor of the integrity of the information.” researcher experience (13/34 occurrences). the own researchers’ experience was used to employ the classification. in this regard, one researcher pointed out: “i thought of the examples for each type that i have used and classified them according to my experience in dealing with each material.” another one mentioned that: “i considered what i have read about grey literature.” peer interaction (5/34 occurrences). another criterion considered for assessing gl control and expertise was the users’ interactions in gl sources. in this regard, one researcher mentioned: “another point is that if i have a lot of people interacting and building the content (such as q&a websites), i consider that it has a certain control in the final knowledge presented there.” another one pointed out: “in general, i consider the control to be higher when there is a peer review in some way, as in the case of theses and stack overflow.” correlation analysis between the level of the dimensions (control and expertise) and each gl type we conducted our analysis using correlation statistics between the two variables (control and expertise) to each gl type using the spearman coefficient. we interpreted the spearman coefficient according to dancey and reidy (2004). to conduct this analysis, aiming to pair the samples, we removed the answers in which one respondent answered that “i did not consider it a gl type” or “i have no opinion” to at least one dimension to the same gl type. based on the results of spearman’s rank correlation presented in table 13, we identified 13 gl types (13/19; 68.4%), with correlations that varies from strong to very strong positive correlations (p-value <= 0.05% of significance). it indicates that when the control’s level increases, the expertise tends to increase. considering only the group of gl types that presented less than 95% of significance, we identified six types. among these types, 4 out of 6 (forums/list of discussions, cases/services descriptions, keynote speeches, materials training) had moderate correlations. for the remaining two (books/book chapters and magazine articles), we identified the negligible correlations. submitted to jserd kamei et al. 2022 figure 4. classification of each gl source type according to the level of expertise. each level of expertise indicates: low = 0; moderate = 50; high = 100. table 13. types of grey literature: control and expertise correlation test. notes: *correlation is significant (strong) at the rho >= 0.4 and p-value <= 0.05 level; **p-value is not zero (we used three decimal places). type of grey literature spearman coefficient p-value blog post .441* .017 book/book chapter .106 .607 case/soft. description .341 .082 forum/discussion list .337 .069 guideline .518* .004 keynote speeches .305 .101 magazine article .167 .377 manual .620* .000** material training .308 .104 news articles .525* .003 patent .550* .027 q&a websites .656* .000** slide presentation .593* .001 soft. repository .652* .000** technical report .527* .005 thesis .546* .013 tutorial .688* .000** video .671* .000** white paper .769* .000** correlation analysis between the level of the dimensions (control and expertise) and the respondent profiles after analyzing our data, a chi-square test of independence was conducted between the respondent profiles and their inclination to answer “i did not consider it a gl type” or “i have no opinion”. therefore, we evaluated if the fact that the respondent is a professor or not has any influence in not considering as gl or not having an opinion. table 14 presents our result. submitted to jserd kamei et al. 2022 table 11. the types of gl in which se researchers have no opinion regarding the level of control and expertise, or do not consider as gl ( gl). control expertise type of source no opinion no opinion gl thesis 0 1 12 patents 7 10 7 books/book chapters 2 1 6 magazine articles 1 2 3 case/serv. desc 1 5 3 manuals 1 3 3 materials training 0 3 3 software repositories 0 3 3 blog posts 1 3 2 forums / lists 0 2 2 news articles 0 3 2 slide presentations 0 6 2 keynote speeches 0 2 2 videos 3 4 2 technical reports 3 2 2 q&a websites 1 3 1 guidelines 1 4 1 tutorials 0 4 1 white papers 2 5 1 table 14. chi-square test between respondent profiles and (i) not considered as gl, (ii) no opinion control, and (iii) no opinion expertise. type of gl i ii iii blog post .769 .526 .959 book/book chapter .925 .959 .526 case/soft. description .959 .959 .439 forum/discussion list .959 .999 .959 guideline .526 .526 .579 keynote speeches .959 .999 .959 magazine article .959 .526 .959 manual .769 .526 .769 material training .959 .999 .769 news articles .959 .999 .769 patent .883 .393 .726 q&a websites .769 .526 .959 slide presentation .959 .999 .925 soft. repository .769 .999 .769 technical report .959 .769 .769 thesis .526 .999 .194 tutorial .526 .999 .579 video .959 .769 .579 white paper .959 959 .711 as we can see in table 14, we did not have found a statistically significant association (p < 0.05) between respondent profile and their inclination to have no opinion regarding the level of control and expertise, or did not consider as a gl type. therefore, based on our results, we did not reject any null hypothesis, i.e., the respondent profile did not influence their answers, or our sample is not large enough to show this influence. we performed another chi-square statistical test to discover if the respondent profiles affect results to their opinion on low, moderate, or high level of control and expertise. for each factor (control or expertise) and gl (blog posts, books/book chapters, etc.), we populated a 2x3 contingency table composed of rows (i.e., respondent profile) and columns (i.e., their opinion as low, moderate, or high) variables. table 15 presents the p-value from the chi-square statistical test for each contingency table. table 15. chi-square test between respondent profiles and (i) expertise level and (ii) control level. type of gl expertise control blog post .785 .100 book/book chapter .958 .722 case/soft. description .632 .293 forum/discussion list .720 .557 guideline .769 .853 keynote speeches .185 .853 magazine article .539 .692 manual .496 .069 material training .316 .690 news articles .049 .205 patent .651 .905 q&a websites .567 .289 slide presentation .478 .157 soft. repository .387 .261 technical report .848 .743 thesis .746 .844 tutorial .132 .707 video .755 .894 white paper .925 .752 table 15 shows the distribution of the p-values per comparison from each chi-squared test of independence. as we can see, there is no evidence that different respondent profiles have different opinions. the only exception regards news articles credibility. the contingency table (see table 16) summarizes the results from comparing answers from professors/students and news articles credibility. we conclude that students think that news articles are more believable by analyzing this result. table 16. contingency table from respondent profiles and the levels of expertise for news articles respondent profile low moderate high professors/researchers 7 1 0 students 8 13 0 summary of rq6: we identified similar behaviors when considering the same gl type concerning the two dimensions: control and expertise. most gl types ran between the low and moderate levels in these dimensions. we also identified some differences, such as the median of answers for control were at the low level and a moderate level for the expertise dimension. the production rigor, the producer’s reputation, researcher experience, and the permission of peer intersubmitted to jserd kamei et al. 2022 action are the criteria employed by the researchers to assess gl source. moreover, we found some misunderstandings to consider or not some data sources as gl, mainly related to thesis, patents, magazine articles, and books/book chapters. considering the correlation analysis, we identified that it varied from strong to very strong between control and expertise dimensions for most gl types. our investigation also shows a correlation analysis between the level of control and expertise for most gl types, showing that when one dimension increases, the other one tends to increase too. the same happens when the level decrease. considering the researcher profile, we did not find evidence that different researcher’s profiles have different opinions, except for the news articles. 7 discussion in this section, we discussed each research question, relating them to previous studies (section 7.1). then, we discussed some findings out of the scope of the rqs that caught our attention (section 7.2). we also presented some advice to se researchers based on the lessons learned with this research and previous knowledge (section 7.3). finally, we discussed some threats to the validity of this work (section 7.4). 7.1 revisiting findings in this section, we discussed our findings with each rq. even though we have addressed the rq1–rq4 in our previous study (kamei et al., 2020), in this work, we included additional discussions and considered other related works not mentioned before. (rq1) motivations to use or reasons to avoid gl (i) even our first investigation showed several motivations and benefits in using gl. our second investigation shows that most researchers avoid its use as a reference in scientific papers. (ii) we organized the motivations to use gl into five categories. three of them were similar to previous works. for instance, rainer and williams (2019) and zhang et al. (2020) also discussed the motivation to complement research findings. another related motivation was to understand problems, identified in three studies (rainer and williams, 2019; neto et al., 2019; zhang et al., 2020). (rq2) types of grey literature used we did not find previous primary studies focusing on this research question. we found tertiary studies that investigated the most gl types found in selected studies. for instance, zhang et al. (2020) identified that the most common gl types used in the list of selected secondary studies were (in order) technical reports, blog posts, books/book chapters, and thesis. considering the types of gl used by brazilian se researchers, the most common are the q&a websites (e.g., stack overflow), blog posts (e.g., se firms, such as netflix, uber, facebook), and technical reports (e.g., from sei). our investigation shows that most of these types are related to se practice, mainly retrieved from renowned firms or research institutions. (rq3) criteria used to assess grey literature credibility we found several criteria to assess the gl credibility, showing that most of them are related to the gl producer being renowned (authors, institutions, and companies). these criteria caught our attention because we did not find any criterion mentioning to assess the gl content. however, the challenge of lack of reliability identified is related to this, and previous work (williams and rainer, 2019) have investigated a set of criteria to assess gl content (e.g., rigor in presenting information, presenting empirical data, describing the methods of data collection). (rq4) benefits and challenges using grey literature we identified some contradictory findings between the benefits and challenges of gl use. they are part of the trade-off between traditional literature and gl nature. for instance, on the one hand, se researchers mentioned that it is easy to access and read the gl content. on the other hand, they said it the difficult to search/find information. regarding the benefit, it is related to accessing the gl content without paywall restriction and to the informal language usually written. however, these benefits hinder the use of automatic data extraction. we identified another trade-off, for instance, even the perceived benefit of advanced the state of the art/practice, several researchers are avoiding the use of gl due to the challenges of lack of reliability and lack of scientific value. in part, those trade-offs are expected, showing the necessity for further investigations on how to improve the use of gl in se research. for instance, as we have done in this research. even though we confirmed some findings of the literature, the main benefit identified (easy to access and read) was not mentioned by previous studies (williams and rainer, 2017; rainer and williams, 2018, 2019; garousi et al., 2016). similarly, it occurred with the challenges. for instance, the lack of scientific value was not identified in previous studies. even, it was the second challenge most mentioned in our investigation. we informed that the benefits identified in this studyare relatedto ourresults ofa tertiarystudy (kameiet al., 2021). regarding the challenges, some findings in previous works (zhang et al., 2020; kamei et al., 2021). for instance, the uncertain availability of gl was not identified in our investigation. (rq5) prioritizing the criteria to assess grey literature expertise this investigation confirmed some findings of survey 1 (kamei et al., 2020), showing that the most important credibility criteria are related to the gl source be produced by a renowned source. however, using the prioritization criteria, some of these findings contrasted partly because, in survey 1 results, no criteria were related to assessing the gl content. at the same time, in survey 2, several se researchers submitted to jserd kamei et al. 2022 considered important criteria of citing academic sources and presenting empirical data. the criteria of citing academic sources, describing the collection methods, and presenting empirical data caught our attention due to the emphasis on applying scientific perspectives to assess gl sources. in our opinion, these criteria are difficult to be used, as we discuss in the following: 1) according to williams (2018), online articles and blogs produced by se practitioners rarely mentioned academic sources; 2) gl sources are produced mainly by practitioners (kamei et al., 2021), and consultant/companies have different manners of expressing than academics one; and 3) most of the gl sources do not present empirical data. instead, they are primarily based on their opinions and belief (rainer, 2017). (rq6) types of grey literature vs. dimensions of control and expertise some findings caught our attention because some gl types run between two and sometimes into three levels of the classification of the dimensions, showing that different interpretations may occur for the same type. although, the correlation analysis showed a strong correlation between these interpretations for most of the gl types investigated. considering the respondent’s profiles, different from what we expected, our statistical analysis based on the chi-square test showed that different respondent profiles shared similar opinions about each source type being considered a gl or not and concerning the level of control and credibility. the criteria used by se researchers to classify these dimensions are mostly related to the rigor of source, researcher experience, and the interaction permitted for the user to deal with each gl type. although some of them considered it is challenging to classify considering only the source type, without a real example to be deeply assessed, as one researcher pointed out: “(...) the credibility will depend on who produced that content.” moreover, we perceived that sources (e.g., technical reports, books/book chapters, thesis) produced by companies and institutions mainly were considered with moderate to a high level of control and expertise. in contrast, the sources commonly produced by se practitioners (e.g., forums/list of discussions, blog posts, videos) have a low level of control and expertise. these findings caught our attention because, in rq2 results, the most used gl sources runs between low to moderate level. it appears that the benefits and the motivations to use gl outweigh the low level of control and expertise presented in these sources. with these findings, we reinforce the claim of garousi et al. (2019) that it is complicated to assess the dimensions of control and expertise alone. although they could bring us one direction, other essential criteria include identifying gl’s producer and content. for this reason, we advocate that se researchers use the concept of the “shades of gl” to classify and assess a gl source because it recognizes the different perspectives of the nature of gl, although future investigations to set a limit between tiers of the shades are essential. beyond that, we claimed the importance of employing objective criteria to assess gl sources and better permit the gl classification according to the shades. although, as our findings showed, it could be essential to propose intermediate shades between each tier. 7.2 other discussions in this section, we discussed some findings and important discussions unrelated to a specific research question. first, we discussed the relations among the researcher’s perceptions’ of gl. second, describe the relationship between the credibility criteria and the dimensions of credibility investigated. lastly, discuss our findings of the perceptions of the different gl types. perceptions of grey literature we identified relations between the perceptions of gl, as shown in figure 5. for instance, we identified some motivations to use gl related to some benefits identified (slashed line) and some reasons to avoid gl with some challenges by gl use (dotted line). in what follows, we discussed some of them. regarding the motivation to use “to complement research findings” is related to the benefit of use gl to provide “different results from scientific studies” as some respondents informed that the inclusion of gl could provide evidence not explored or identified in the research area. another one is “to answer practical and technical question” related to the benefit of “practical evidence”, which was not perceived using only traditional literature. the reasons to avoid gl and the challenges identified are almost the same. except for the “lack of reliability” that hinders the replicability of the search for gl. it could be motivated due to the “non-structured information” of a gl source. expertise criteria vs. dimensions of control and expertise the most important criteria identified to assess gl credibility are related to the “producer reputation” and the “rigor” presented in the gl source. the first is related to the source be produced by a renowned author, institution, or cited by a renowned source. the second with how the information is presented, for instance, if it describes the methods used to collect the data. figure 6 presented these criteria. we also identified some relations between the credibility criteria with some reasons to classify the control and expertise dimensions, as shown in figure 6. the control (slashed line) is related to the “peer interaction”, “producer reputation”, and the “rigor”. the expertise (dotted line), their relations are the same as the control dimension, including the “researcher experience”. this last is related to their own researcher experience using gl to assess its credibility. gl types interpretation in our second investigation, we found some misunderstanding in interpreting gl types (see table11), even though those types were recognized as gl in some previous se works submitted to jserd kamei et al. 2022 figure 5. relationships identified between the motivations to use gl with benefits and the reasons to avoid with the challenges. (e.g., maro et al. (2018), zhang et al. (2020)). in the following, we present the most common types that were not considered gl: thesis (11/34 occurrences), patents (6/34 occurrences), books/book chapters (6/34 occurrences), and magazine articles (3/34 occurrences). in this regard, for instance, one researcher pointed out: “i understand that thesis and dissertations are not grey because external researchers formally assess them.” we also found in previous studies some contradictions in interpreting a source type as a gl type or not. for instance, while hosseinzadeh et al. (2018) considered books/book chapters as a gl type, the study of berg et al. (2018) did not. we identified another conflict, for instance, while neto et al. (2019) considered thesis a peer-reviewed source, rodríguezpérez et al. (2018) classified them as gl types. these misunderstandings were also identified in the previous investigation with secondary studies (kamei et al., 2021). in our opinion, these misunderstandings reflect on each source’s classification regarding control and expertise. for instance, for most researchers, books/book chapters, technical reports, thesis, and patents were not considered a gl type and related them to a high level of control and expertise (figures 3 and 4). it shows that the peer-reviewed process and grey literature boundary are unclear when considering only the source type. 7.3 lessons learned with this investigation and the previous one (kamei et al., 2020), we showed how gl could contribute to se research. however, some advice is important to this use could be improved. for se researchers, our findings highlight to pay attention when searching, selecting, and using grey literature in se research: 1) explore the gl sources before using on their research, as there are several types of gl source, to understand what evidence each gl source could provide and could benefit the research and how to retrieve information from them, due to the issues about the difficulty to search for; 2) it is important to the researchers be aware of a set of credibility criteria that could be used to assess gl sources. for instance, by selecting data produced by renowned sources (e.g., authors, institutions) and understanding how each credibility criteria could better fit each type of gl; 3) another criterion to improve gl credibility could be used, considering the various interpretations for gl assessment related to the control and expertise aspects; and 4) understand how to improve the submitted to jserd kamei et al. 2022 figure 6. relationships identified between the grey literature expertise criteria with the dimensions of control and expertise. search for gl using a systematic approach with methods and techniques to better deal with the content, aiming to reduce their lack of reliability. 7.4 threats to validity this section discussed some limitations and threats to validity and what we have done to mitigate them. construct validity: even our efforts to improve our questionnaire, we identified two potential threats in our research: 1) specifically on the questions that we asked for the participant to classify each source type concerning the control and expertise dimensions. we mitigate this, informing the researchers that we know that control and expertise vary from source to source, and asked them to consider the most frequent experience for each data source. however, three researchers reported that assessing these gl types’ dimensions was difficult without considering the content and the producer. this difficulty may have introduced some bias, and 2) we used a non-probability sample by convenience (baltes and ralph, 2021) because we intend to investigate only se researchers with previous experience in gl use. then, we surveyed only 53 brazilian researchers we knew had this experience. internal validity: as our investigation used personal interpretation, we may have introduced biases during the data extraction and analysis. we tried to minimize those by using a paired approach with a constant discussion between the researchers and invoking a third researcher to revise the derived codes and categories. external validity: our first investigation used a sample of the se researchers from the largest se conference in brazil. in the second investigation, our sample was representative of se research because we had a 30.4% response rate with a diversity of researchers (1/3 are women, 50% have a ph.d. in se, and 30% a master’s). in our second investigation, we conducted our survey with the researchers from the first survey that mentioned they had used gl in se research. we received 64.1% of response rate. from these, almost 60% are professors or researchers with more than ten years of se research experience, and most have used gl from 2 and 5 scientific studies. nevertheless, as we focused on the brazilian se research community for both surveys, the findings may not apply to other populations. although, we used the peer review process during all this research, aiming to improve submitted to jserd kamei et al. 2022 the external validity to draw general conclusions. conclusion validity: even with 30.4% and 64.1% of response rates in both surveys, we may have lost some important information. for the first investigation, we mitigated this threat by comparing our results with previous studies conducted with different populations, showing that our results showed similarly. even though we have reached a considerable response rate for the second investigation, our sample was small and focused only on the brazilian se researchers’ perspective to permit the results’ generalization. another threat is related to the correlation analysis between the dimensions of control and expertise to each gl type because we did not explicitly ask this correlation to the respondents. 8 related works this section groups the related works in studies that explored gl’s credibility and quality assessment in se research. for each study presented, we show the differences concerning our work. the grey literature review (glr) conducted by raulamo-jurvanen et al. (2017) focused on understanding how se practitioners choose a test automation tool by investigating the opinions and experiences of se practitioners produced in gl sources. they analyzed the gl source’s credibility during the quality assessment according to the number of readers, number of shares, number of comments, number of google hits for the titles, and adopting backlinks analysis (a reference comparable to a citation). our work differs because we provide different findings on assessing gl credibility. moreover, we also intend to understand the prioritization of a set of criteria identified in previous investigations (kamei et al., 2020; williams and rainer, 2019). soldani et al. (2018) conducted another study based on glr. this study investigated the pains and gains of the use of microservices. they perceived that the traditional literature on the topic is still in the early stage even though companies are working day-by-day with microservices, as witnessed by the considerable amount of gl on the subject. the authors considered a set of criteria of control factors to select gl sources: practical experience of the authors (+5 years), industrial case-study, heterogeneity (present the information about at least 5 top industrial domains), and implementation quantity (present detailed information). our work differs from this because we focused on investigating and providing a set of general criteria that could be used to assess different types of gl sources. williams and rainer conducted two studies to investigate how to improve the quality and credibility assessment of blog articles in se research. the first study (williams and rainer, 2017) examined some criteria to evaluate blog articles to be used as a source of se research evidence through two pilot studies (a systematic mapping study and preliminary analyses of blog posts). the findings showed some criteria for selecting a blog article’s content (e.g., authentic, informative). the second study (williams and rainer, 2019) focused on finding credibility criteria to assess blog posts by selecting 88 candidate credibility criteria from a previous mapping study (williams and rainer, 2017). then, to gather opinions on a blog post to evaluate those credibility criteria, they surveyed 43 se researchers. some criteria were found, for instance, the presence of reasoning, reporting empirical data, and reporting data collection methods. as discussed in the previous related works, our criteria were not focused on a specific type of gl. moreover, our identified criteria are different from williams and rainer’s, and we tried to understand what each se researcher considered in assessing the different types of gl. most recently, we conducted a tertiary study with secondary studies of se (kamei et al., 2021) presenting a critical review of gl use in secondary studies. in total, were investigated 446 studies, identifying 126 studies that searched or included gl as a primary source. this finding showed that gl was not widely used in the analyzed studies, although it increased in gl use over the years. the tertiary research explored the benefits, challenges, and motivations to use or avoid gl use. our work differs from this previous one because we asked the se researchers directly, different from investigations with published studies, where these questions were not directly explored, leaving the authors the option to include or not that information. even though the similarity of these works with our work, there are differences in at least four points: i) we found a different set of credibility criteria: the source needed to be provided by renowned institutions, renowned companies, cited by others, and derived from academia, ii) we did not focus on a specific type of gl source, iii) we explored the experience of se researchers to understand the perspectives on the credibility of different gl types and how se researchers assess them, and iv) we investigated a set of prioritization criteria used to assess gl credibility. 9 conclusions and future works although the use and investigation of grey literature in se research increased over the last years, they are still recent. in this work, we reported two investigations based on the brazilian se researchers’ perspective to present an overview of gl sources usage, potential benefits and challenges of its use, a set of criteria to assess gl credibility, and the perceptions about gl types concerning control and expertise criteria. our main findings show: 1. blogs, community websites, and technical experience/reports are the most common gl sources used by se researchers; 2. the main motivations to use gl is because its content could complement research findings by providing different results from scientific studies and answer practical and technical questions; 3. gl use is not widespread as a scientific reference due to some credibility and reliability constraints; 4. the use of the “shades of gl” can help se researchers to assess gl and interpret the different gl types. although, we identified that se researchers have different interpretations of gl control and expertise; 5. the most relevant criteria used to assess gl credibility submitted to jserd kamei et al. 2022 are the gl source be provided by renowned authors, institutions, companies, or be cited by a renowned source; 6. the most critical criteria to assess the control and expertise of a gl source are related to the producer reputation and the rigor of the gl content presented; 7. thereisapositivecorrelationforcredibilitycriteriaconsidering the dimensions of control and expertise for each gl. it shows that when the level of control increases, the level of expertise tends to increase too; 8. we did not find significant differences between the opinions of graduate students and professors/researchers concerning the control and expertise dimensions analyzed of each gl type. for replication purposes, all the data used in these investigations are available online at https://doi.org/10.5281/zenodo.5164714. for future works, we plan i) to expand our view by investigating other se research communities; and ii) to deeply understand the gl credibility aspects, focusing on building an objective quality assessment instrument that comprehends these several types. references adams, j., hillier-brown, f. c., moore, h. j., lake, a. a., araujo-soares, v., and summerbell, m. w. c. (2016a). searching and synthesising ‘grey literature’ and ‘grey information’ in public health: critical reflections on three case studies. systematic reviews, 5(1):164. adams, r. j., smart, p., and huff, a. s. (2016b). shades of grey: guidelines for working with the grey literature in systematic reviews for management and organizational studies. international journal of management reviews, 19(4):432–454. baltes, s. and ralph, p. (2021). sampling in software engineering research: a critical review and guidelines. berg, v., birkeland, j., nguyen-duc, a., pappas, i. o., and jaccheri, l. (2018). software startup engineering: a systematic mapping study. journal of systems and software, 144:255–274. bonato, s. (2018). searching the grey literature. rowman & littlefield. braun, v. and clarke, v. (2006). using thematic analysis in psychology. qualitative research in psychology, 3(2):77– 101. coelho, j., valente, m. t., milen, l., and silva, l. l. (2020). is this github project maintained? measuring the level of maintenance activity of open-source projects. information and software technology, 1:1–35. dancey, c. p. and reidy, j. (2004). statistics without maths for psychology: using spss for windows. prentice-hall, inc., usa. garousi, v., felderer, m., and hacaloğlu, t. (2017). software test maturity assessment and test process improvement: a multivocal literature review. information and software technology, 85:16–42. garousi, v., felderer, m., and mäntylä, m. v. (2016). the need for multivocal literature reviews in software engineering: complementing systematic literature reviews with grey literature. in proceedings of the 20th international conference on evaluation and assessment in software engineering, ease ’16, pages 26:1–26:6, new york, ny, usa. acm. garousi, v., felderer, m., and mäntylä, m. v. (2019). guidelines for including grey literature and conducting multivocal literature reviews in software engineering. information and software technology, 106:101–121. hosseinzadeh, s., rauti, s., laurén, s., mäkelä, j.-m., holvitie, j., hyrynsalmi, s., and leppänen, v. (2018). diversification and obfuscation techniques for software security: a systematic literature review. information and software technology, 104:72–93. kamei, f., wiese, i., lima, c., polato, i., nepomuceno, v., ferreira, w., ribeiro, m., pena, c., cartaxo, b., pinto, g., and soares, s. (2021). grey literature in software engineering: a critical review. information and software technology, page 106609. kamei, f., wiese, i., pinto, g., ribeiro, m., and soares, s. (2020). on the use of grey literature: a survey with the brazilian software engineering research community. in proceedings of the xxxiv brazilian symposium on software engineering, sbes 2020, new york, ny, usa. association for computing machinery. linåker, j., sulaman, s., maiani de mello, r., and martin, h. (2015). guidelines for conducting surveys in software engineering. technical report, lund university. maro, s., steghöfer, j.-p., and staron, m. (2018). software traceability in the automotive domain: challenges and solutions. journal of systems and software, 141:85 – 110. neto, g. t. g., santos, w. b., endo, p. t., and fagundes, r. a. a. (2019). multivocal literature reviews in software engineering: preliminary findings from a tertiary study. in proceedings of the acm/ieee international symposium on empirical software engineering and measurement, esem ’19, pages 1–6. oliveira, j. a., viggiato, m., pinheiro, d., and figueiredo, e. (2021). mining experts from source code analysis: an empirical evaluation. journal of software engineering research and development, 9(1):1:1 – 1:16. petticrew, m. and roberts, h. (2006). systematic reviews in the social sciences: a practical guide, volume 11. blackwell publishing ltd. pinto, g., ferreira, c., souza, c., steinmacher, i., and meirelles, p. (2019). training software engineers using open-source software: the students’ perspective. in proceedings of ieee/acm 41st international conference on software engineering: software engineering education and training, icse-seet ’19, pages 147–157. institute of electrical and electronics engineers (ieee). rainer, a. (2017). using argumentation theory to analyse software practitioners’ feasible evidence, inference and belief. information and software technology, 87:62–80. rainer, a. and williams, a. (2018). using blog articles in software engineering research: benefits, challenges and case–survey method. in proceedings of the 25th australasian software engineering conference), aswec ’18, pages 201–209. submitted to jserd kamei et al. 2022 rainer, a. and williams, a. (2019). using blog-like documents to investigate software practice: benefits, challenges, and research directions. journal of software: evolution and process, 31(11):e2197. raulamo-jurvanen, päivi, mäntylä, m., and garousi, v. (2017). choosing the right test automation tool: a grey literature review of practitioner sources. in proceedings of the 21st international conference on evaluation and assessment in software engineering, ease ’17, pages 21– 30. acm. rodríguez-pérez, g., robles, g., and gonzález-barahona, j. m. (2018). reproducibility and credibility in empirical software engineering: a case study based on a systematic literature review of the use of the szz algorithm. information and software technology, 99:164–176. saltan, a. (2019). do we know how to price saas: a multivocal literature review. in proceedings of the 2nd acm sigsoft international workshop on software-intensive business: start-ups, platforms, and ecosystems, iwsib 2019, pages 7–12. acm. schöpfel, j. and prost, h. (2020). how scientific papers mention grey literature: a scientometric study based on scopus data. collection and curation. soldani, j., tamburri, d. a., and heuvel, w.-j. v. d. (2018). the pains and gains of microservices: a systematic grey literature review. journal of systems and software, 146:215–232. spencer, d. (2009). card sorting: designing usable categories. rosenfeld media. storey, m.-a., singer, l., cleary, b., filho, f. f., and zagalsky, a. (2014). the (r) evolution of social media in software engineering. in proceedings of the on future of software engineering, fose ’14. acm press. storey, m.-a., zagalsky, a., filho, f. f., singer, l., and german, d. m. (2017). how social and communication channels shape and challenge a participatory culture in software development. ieee transactions on software engineering, 43(2):185–204. stray, v. and moe, n. b. (2020). understanding coordination in global software engineering: a mixed-methods study on the use of meetings and slack. journal of systems and software, 170:110717. tom, e., aurum, a., and vidgen, r. (2013). an exploration of technical debt. journal of systems and software, 86(6):1498–1516. viera, a. j. and garrett, j. m. (2005). understanding interobserver agreement: the kappa statistic. family medicine, 37(5):360–363. williams, a. (2018). using reasoning markers to select the more rigorous software practitioners’ online content when searching for grey literature. in proceedings of the 22nd international conference on evaluation and assessment in software engineering, ease ’18, pages 46–56. acm. williams, a. and rainer, a. (2017). toward the use of blog articles as a source of evidence for software engineering research. in proceedings of the 21st international conference on evaluation and assessment in software engineering, ease’17, pages 280–285, new york, ny, usa. acm. williams, a. and rainer, a. (2019). how do empirical software engineering researchers assess the credibility of practitioner-generated blog posts? in proceedings of the 23nd international conference on evaluation and assessment in software engineering, ease ’19, pages 211–220. acm. zahedi, m., rajapakse, r. n., and babar, m. a. (2020). mining questions asked about continuous software engineering: a case study of stack overflow. in li, j., jaccheri, l., dingsøyr, t., and chitchyan, r., editors, ease ’20: evaluation and assessment in software engineering, trondheim, norway, april 15-17, 2020, pages 41–50. acm. zhang, h., zhou, x., huang, x., huang, h., and babar, m. a. (2020). an evidence-based inquiry into the use of grey literature in software engineering. in proceedings of the 42th international conference on software engineering, icse ’20. introduction background research questions research methods survey 1: initial investigation with the brazilian se researchers survey design survey respondents survey questions survey 2: investigating brazilian se researchers that use grey literature survey design survey respondents survey questions data analysis and synthesis qualitative analysis quantitative analysis previous results results discussion revisiting findings other discussions lessons learned threats to validity related works conclusions and future works journal of software engineering research and development, 2023, 11:3, doi: 10.5753/jserd.2023.2581 this work is licensed under a creative commons attribution 4.0 international license. investigating the relationship between technical debt management and software development issues clara berenguer [ salvador university | claraberenguerledo@gmail.com ] adriano borges [ salvador university | arborges.12@gmail.com ] sávio freire [ federal institute of ceará and federal university of bahia | savio.freire@ifce.edu.br ] nicolli rios [ federal university of rio de janeiro | nicolli@cos.ufrj.br ] robert ramač [ university of novi sad | ramac.robert@uns.ac.rs ] nebojša taušan [ university of novi sad | nebojsa.tausan@ef.uns.ac.rs ] boris pérez [ francisco de paula santander university | br.perez41@uniandes.edu.co ] camilo castellanos [ university of los andes | cc.castellanos87@uniandes.edu.co ] darío correal [ university of los andes | dcorreal@uniandes.edu.co ] alexia pacheco [ university of costa rica | alexia.pacheco@ucr.ac.cr ] gustavo lópez [ university of costa rica | gustavo.lopezherrera@ucr.ac.cr ] manoel mendonça [ federal university of bahia | manoel.mendonca@ufba.br ] davide falessi [ university of rome tor vergata | d.falessi@gmail.com ] carolyn seaman [ university of maryland baltimore county | cseaman@umbc.edu ] vladimir mandić [ university of novi sad | vladman@uns.ac.rs ] clemente izurieta [ montana state university and idaho national laboratories | clemente.izurieta@montana.edu ] rodrigo spínola [ virginia commonwealth university and salvador university | spinolaro@vcu.edu ] abstract context: the presence of technical debt (td) brings risks to software projects. managers must continuously find a cost-benefit balance between the benefits of incurring in td and the costs of its presence in a software project. much attention has been given to td related to coding issues, but other types of debt can also have impactful consequences on projects. aims: this paper seeks to elaborate on the growing need to expand td research to other areas of software development, by analyzing six elements related to td management, namely: causes, effects, preventive practices, reasons for non-prevention, repayment practices, and reasons for nonrepayment of td. method: we survey and analyze, quantitatively and qualitatively, the answers of 653 software industry practitioners on td to investigate how the previously mentioned elements are related to coding and noncoding issues of the software development process. results: coding issues are commonly related to the investigated elements but, indeed, they are only part of the td management stage. issues related to the project planning and management, human factors, knowledge, quality, process, requirements, verification, validation, and test, design, architecture, and the organization are also common sources of td. we organize the results in a hump diagram and specialize it considering the point of view of practitioners that have used agile, hybrid, and traditional process models in their projects. conclusion: the hump diagram, in combination with the detailed results, provides guidance on what to expect from the presence of td and how to react to it considering several issues of software development. the results shed light on td management of software elements, beyond source code related artifacts. keywords: technical debt, technical debt management, causes of technical debt, effects of technical debt, process model 1 introduction technical debt (td) refers to postponed tasks or immature artifacts in software projects that can bring short-term benefits (e.g., higher productivity and lower costs), but may have harmful impacts in the long run (izurieta et al. 2012). by managing td items, software teams can reduce the risks associated with these items, such as unexpected delays in system evolution or difficulty in achieving quality criteria defined for the project (rios et al. 2020). technical debt management (tdm) is a challenging endeavor. successful tdm is about reaching a balance between the benefits of incurring in td and the later impacts of its presence in a software project (lim et al. 2012, guo et al. 2016). tdm must seek to define preventive practices to avoid potential td items and the appropriate actions to repay incurring debt (li et al. 2015, ribeiro et al. 2016, freire et al. 2020a, freire et al. 2020b). tdm requires knowledge of the causes that lead software teams to incur debt items and the effects of their presence in software projects (rios et al. 2020, besker et al. 2020). knowing the causes of td can support software teams in understanding their project context and define preventive practices to avoid the debt. having information on td effects can aid in the prioritization of td items to be paid off, supporting a more precise impact analysis and the identification of corrective actions to minimize possible negative consequences of td items for the project. although it was initially associated with code level issues, td can impact any type of software artifact and activity (alves et al. 2016, rios et al. 2018). for example, outdated requirement documentation can lead to a code that does not meet user requirements. despite the growing number of studies on td, there is a clear concentration of studies investigating it from the investigating the relationship between technical debt management and software development issues berenguer et al. 2023 source code and its related artifacts perspective (zazworka et al. 2014, alves et al. 2016, rios et al. 2018). focusing solely on coding is risky business, because td can affect many other software activities. but, how can one identify and manage td related to different software activities? this paper elaborates on the growing need to expand td research to other areas of software development. it analyzes six elements related to tdm: causes, effects, preventive practices, reasons for non-prevention, repayment practices, and reasons for non-repayment of td, for several types of software artifacts and activities. the paper uses a subset of the data collected by the insightd project, a family of surveys globally distributed on causes, effects, and management of td (rios et al. 2020). this data set consists of data from six countrywide replications of the survey, totaling 653 responses from software practitioners. by investigating how practitioners face td in their projects, we gain insight into the state of practice regarding tdm, which allow us to identify existing gaps in tdm theory. the data are analyzed qualitatively and quantitatively to investigate whether the above listed tdm elements are more related to coding or to non-coding issues (e.g., planning and management, requirements engineering, human factors) of the software development. this paper is based on our previous work by berenguer et al. (2021), extending it by including: • a more comprehensive analysis of the relation between td and non-coding activities, • specializations of the hump diagram by process model (agile, hybrid, and traditional), and • an analysis between td, coding and non-coding activities by process model. our results indicate that both coding and non-coding activities are commonly affected by td, but causes, effects, preventive practices, reasons for non-prevention, and reasons for non-repayment, affect non-coding activities more than coding activities. for repayment practices, we found similar behaviors between the two groups (coding and non-coding activities). given all the investigated tdm elements, some software development issues are more commonly reported by practitioners. planning and management issues and human factors stand out, but there are several issues related to debt items such as process, knowledge, td management, and requirement engineering issues. concerning the analysis per process models, we found that practitioners following agile, hybrid, or traditional process models shared a similar view on td elements affecting coding activities. on the other hand, practitioners who use traditional process models have a different view of those using agile and hybrid process models on td elements affecting non-coding activities. results are presented with a hump diagram that, in combination with the analyses of each of the investigated td management elements, provides guidance on what to expect from the presence of td and how to react to them considering several issues of the software development process. in addition to this introduction, this paper has seven additional sections. section 2 presents background information on td research and related work. section 3 describes the methodology used. section 4 presents the results of this work. and section 5 presents the hump diagram and its specializations by process models. section 6 summarizes the results and discusses their implications for researchers and practitioners. section 7 discusses the threats to validity. lastly, section 8 presents our concluding remarks. 2 background td can be incurred at any time and in several artifacts throughout the software development process. as such, it has different characteristics depending on the time it was incurred and the activities it is related to, such as testing, code, build, documentation, and so on (alves et al. 2016). although td is a rising research topic, many studies focus solely on its relationship to source code. li et al. (2015) investigated studies on td and its management (tdm), in addition to carrying out classification and thematic analysis on it, comprehensively understanding the concept of td and presenting an overview of the current state of research in tdm. in their results, it was observed that code debt was the most cited type among the primary articles that were analyzed. in alves et al. (2016), the authors also reported the focus on approaches to identify td items from source code. the authors suggested that a possible explanation for this is that there is a plethora of tools that perform the analysis of source code that can be used to support the detection of td. in another study, rios et al. (2018) presented fifteen types of td. the authors also indicated that there is a concentration of studies focusing on source code. the authors gave some explanations for this phenomenon. the term td was first coined by cunningham (1992), who directly related it to source code, which may have influenced subsequent studies. furthermore, the types related to code tend to cause effects that can be felt more quickly by development teams. more recently, saraiva et al. (2021) performed a systematic mapping study to investigate the current state of the art of td tools, identifying which activities, functionalities and types of debt are handled by the existing tools to support td management. the study identified 50 tools, 42 of which are new tools, and 8 tools extend an existing one. the main td types addressed by tools deal with source code (60% 30/50), architectural issues (40% 20/50) and design issues (28% 14/50). the distribution of tools over the categories was mainly: quantifying code properties, architectural smell detection, pattern matching, cost benefit analysis, project management, and code smell. the authors also reinforce that this trend is in line with the original definition of td, which is heavily defined by concepts coming from source code and related issues. lenarduzzi et al. (2021) also performed a systematic mapping study to understand which td prioritization approaches have been proposed in research and industry. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 the results showed that code debt (38%), architecture debt (24%) and design debt (10%) are by far the most frequently investigated types of debt when considering td prioritization, although there is scant evidence on other types like test and requirement debt. thus, the approaches mainly involve models that reduce td by acting on source code, removing or refactoring code smells or other metrics. such concentration of studies at the code level is a worrying scenario because other types of debt can also have impactful or even worse consequences on projects. we claim that it is necessary to go beyond the source code and investigate other facets of td. we do it under the perspective of td causes, effects, prevention, and repayment, and use data collected from insightd project, presented in the next section. 3 research method this section presents the insightd project in which this work is contextualized, our research questions, and the data collection and analysis procedures. 3.1 the insightd project insightd is a family of globally distributed industrial surveys, present in countries such as brazil, chile, colombia, costa rica, the united states, and serbia. it aims to investigate the causes, effects, and management of td in software projects. several results of the project have been disseminated so far, for example: the empirical design of the insightd and the results of its brazilian replication on causes and effects of td (rios et al. 2020), probabilistic diagrams of causes and effects of td (rios et al. 2019), the set of causes and effects of td collected from six insightd replications (ramač et al. 2022, freire et al. 2021b), the relation between td and process models (rios et al. 2021), td prevention (freire et al. 2020a, freire et al. 2021a), and practices and impediments to repay td items (freire et al. 2020b, perez et al. 2020, freire et al. 2021a, freire et al. 2021c). other results from the project can be found at http://www.td-survey.com/publication-map/. concerning the relation between td and development issues related to coding or other development issues, we previously investigated it in our previous work (berenguer et al. 2021). in this paper, we further investigated it by including: • a more comprehensive analysis of the relation between td and non-coding activities, as shown in section 4, • specializations of the hump diagram by process model (agile, hybrid, and traditional), as presented in section 5, and • an analysis between td, coding, and non-coding activities by process model, as discussed in subsection 5.2. 3.2 research questions in this work, we investigate whether td management elements (causes, effects, prevention, and repayment) are more related to coding issues or to other software development issues. to this end, we consider the following research questions: • rq1: are the causes of td more related to coding issues or other software development issues? • rq2: are the effects of td more felt in coding issues or other issues in the software development process? • rq3: is td prevention more related to coding issues or other issues in the software development process? • rq4: are the reasons for not preventing td more related to coding issues or other development issues? • rq5: is td repayment more associated with coding issues or other issues in the software development process? • rq6: are the reasons for not paying td more related to coding issues or other development issues? 3.3 data collection this study uses a subset of available data from 18 questions from the insightd questionnaire. table 1 shows these questions, reports their type and the rq they refer to. questions q1 through q8 document the characteristics of the survey respondents. more specifically, in q8, the respondents inform the process model adopted in their projects, choosing one of the following options: agile (a lightweight process that promotes iterative development, close collaboration between the development team and business side, constant communication, and tightly-knit teams); hybrid (is the combination of agile methods with other non-agile techniques. for example, a detailed requirements effort, followed by sprints of incremental delivery); and traditional (conventional document-driven software development methods that can be characterized as extensive planning, standardization of development stages, formalized communication, significant documentation and design up front). more information on the closed questions’ options is available in rios et al. (2020). in q13, respondents provide an example of a td item that occurred in their projects. participants discuss causes of td in q16 thru q18 and effects in q20. we use the answers given to these questions for answering rq1 (q16-q18) and rq2 (q20). concerning td prevention, participants give their responses in q22 and q23, and address td repayment in q26 and q27. the answers given in these questions are used for answering rq3-4 (q22 and q23) and rq5-6 (q26 and q27). we invite only software practitioners from the brazilian, chilean, colombian, costa rican, north american, and serbian software industries through linkedin, industryaffiliated member groups, and industry partners for answering the survey. http://www.td-survey.com/publication-map/ investigating the relationship between technical debt management and software development issues berenguer et al. 2023 3.4 data analysis procedures the analysis procedures are divided into three steps: demographics, preparing data for analysis, and data classification and analysis. 3.4.1 demographics we calculate the quantity of respondents choosing an option available through the closed questions of the survey. subsequently, we sum up the participants’ characterization. 3.4.2 preparing data for analysis for the open-ended questions, we applied coding process (strauss and corbin 1998). in answers given to q16 thru q18 and q20, we used the coding process described in rios et al. (2020) to identify a set of causes and effects, as well as the number of occurrences for each. to exemplify, let us consider the answers given by two respondents in q16: “poorly developed code” and “low quality code”. as these answers are associated with problems in source code, they were unified under the cause sloppy code. we used the coding process described in freire et al. (2020a) to code the responses to q23. we identified practices for td prevention from this process when q22 received a positive response; otherwise, we identified reasons for td non prevention. an example of this process is as follows: two respondents provided the following answers in q23 when q22 has a negative answer: “requirements are always going to change during development...” and “because when the client asks for features abruptly, no matter how generalized the architecture is towards the problem, with an outlier there may be, that can mean a refactor of the code, and that could dirty the code, reducing its maintainability”. as these answers are associated with requirements change requests, they were unified under the reason for td non-prevention requirements change. finally, we coded the responses to q27 using the coding procedure described in freire et al. (2020b). similarly, if q26 received a positive response, we identified td repayment practices; otherwise, we identified nonrepayment reasons. for both prevention and repayment, we also had a list of practices and reasons, and their corresponding number of occurrences. for example, two respondents provided the following answers in q27 when q26 has a positive answer: “we rewrote the offending code” and “it was fixed, code was refactored and greatly simplified”. these answers were unified under the repayment practice code refactoring. at least two researchers from each replication team participated in the coding process. the brazilian replication team created the first codified list of causes, effects, prevention practices, reasons for not preventing, repayment practices, and reasons for not repaying, which was distributed to the other replication teams in order to standardize the used nomenclature. the consistency was verified by the brazilian replication team. 3.4.3 data classification and analysis we began by analyzing the codes of each td management element to determine whether they are related to coding issues or other software development issues. repayment table 1. subset of the insightd survey’s questions (adapted from rios et al. (2020)). rq no. question (q) description type q1 what is the size of your company? closed q2 in which country are you currently working? closed q3 what is the size of the system being developed in that project? (loc) closed q4 what is the total number of people of this project? closed q5 what is the age of this system up to now or to when your involvement ended? closed q6 to which project role are you assigned in this project? closed q7 how do you rate your experience in this role? closed q8 which of the following most closely describes the development process model you follow on this project? closed q10 in your words, how would you define td? open q13 please give an example of td that had a significant impact on the project that you have chosen to tell us about: open rq1 q16 what was the immediate, or precipitating, cause of the example of td you just described? open rq1 q17 what other cause or factor contributed to the immediate cause you described above? open rq1 q18 what other causes contributed either directly or indirectly to the occurrence of the td example? open rq2 q20 considering the td item you described in question 13, what were the impacts felt in the project? open rq3-4 q22 do you think it would be possible to prevent the type of debt you described in question 13? closed rq3-4 q23 if yes, how? if not, why? open rq5-6 q26 has the debt item been repaid (eliminated) from the project? closed rq5-6 q27 if yes, how? if not, why? open investigating the relationship between technical debt management and software development issues berenguer et al. 2023 practices such as bug fixing, code refactoring, and code reuse, for example, were classified as practices related to coding issues. however, the repayment practices prioritizing td items and updating system documentation were linked to other software development issues. this procedure was carried out independently by the first and second authors. the third (prevention and repayment) and fourth (causes and effects) authors reached an agreement. the final classification was also reviewed by the last author. next, we classified the td management elements related to the other software development issues using the grouping process defined by strauss and corbin (1998). the categories show the relationship between software development process issues (for example, requirement engineering issues, planning and management issues, and human factors issues) and each td management element. the names of the categories are derived from the ongoing process of grouping the td management elements around the central concern to which they are related. the causes deadline and inappropriate planning, for example, are part of the category planning and management issues, whereas the effects team demotivation and dissatisfaction of the parties involved are part of the category human factors. this procedure was carried out independently by the first and second authors. the third (prevention and repayment) and fourth (causes and effects) authors reached a consensus, and the final result was reviewed by the last author. 4 results participants were asked to provide a definition of td (q10) and then an example of a significant td item in their professional experience (q13). as detailed in (rios et al. 2020), the answers to the questions provided in q13 were used as a criterion for the inclusion of participants. if they did not provide a valid example, their responses were discarded. in total, we considered the responses of 653 professionals from six countries (brazil = 107, chile = 89, colombia = 134, costa rica = 145, serbia = 79, us = 99). next, we will present the characterization data of the participants, as well as the answers to the research questions posed in this study. 4.1 demographics figure 1 presents the demographic information. half of the participants identified themselves as developers, but managers (17%), testers (7%), software architects (13%), and other roles (13%) also answered the questionnaire. besides, the participants described their experience level in their role. the majority of them is competent (good working and background knowledge of area of practice, with 34% of the total of participants), followed by proficient (depth of expertise of discipline and area of practice, 31%), expert (authoritative understanding of discipline and deep tacit information throughout area of practice, 21%), beginner (working information of key factors of practice, 12%), and novice (minimal or “textbook” knowledge without connecting it to practice, 2%). the majority of the participants worked in middle-sized companies (39%), followed by small (32%) and large (29%) ones. further, participants normally worked in teams composed of 5-9 people (34%), but participants working in teams with 10-20 people (22%), less than five people (20%), more than 30 people (16%), and 21-30 people (8%) also answered the questionnaire. concerning the process models adopted, the participants followed hybrid (45%), agile (42%), and traditional (13%) process models. regarding the systems, the respondents normally worked with systems with 10-100kloc (35%), followed by ones with 100kloc-1mloc (30%), less than 10kloc (14%), 1-10mloc (14%), and more than 10mloc (7%). lastly, the majority of the systems is 2-5 years old, followed by 12 (23%) years old, less than one year old (17%), 5-10 years old (15%), and more than 10 years old (11%). in summary, our data set is composed of answers from practitioners from different organization and team sizes, figure 1. participants’ demographics. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 system ages and sizes, roles, experience levels, and adopted process models. in the following subsections, we present the detailed results of each investigated td management element. we use the same structure when describing the results. for example, for the element td cause, initially we (i) present the overall result. next, we (ii) discuss the causes related to coding issues. then, we (iii) present the causes related to the other software development issues, and (iv) analyze which are the types of those issues (e.g., planning and management, human factors, knowledge issues). 4.2 rq1: are the causes of td more related to coding issues or other software development issues? in total, 96 causes1 that lead to the occurrence of td were identified, totaling 1695 citations. of this total, ~92% were related to other development issues, while only ~8% were related to code. this indicates a significant difference between the two subsets, representing a tendency of other software development issues to have an influence on the occurrence of td items. there are 13 causes related to coding. the ten most commonly cited are presented in the second column of table 2. the complete list is available at https://bit.ly/37bopif. the causes non-adoption of good practices, sloppy code, and lack of refactoring stand out. all of them indicate issues that compromise the internal quality of the product. alternatively, we identified 83 causes related to other software development issues. the three most commonly (third column of table 2) cited reflect concerns focused on project management and planning: deadline, not effective project management, and inappropriate planning. other issues related to the team's lack of technical knowledge and experience, pressure, and processes were also commonly mentioned. we observed that those causes were related to each other and grouped them, identifying 14 categories of causes that reflect the main concerns that practitioners have during the development of software projects: • planning and management: refers to causes related to the project's planning and management issues. some examples are deadline, inappropriate planning, and not effective project management; 1 some causes seem to overlap among them. for example, non-adoption of good practices could cover the causes lack of refactoring or lack of reuse practices. however, the cause non-adoption of good practices refers to the non-use of good practices that would facilitate the accomplishment and maintenance of activities in the project, as can be observed in the following responses from participants: “employment of bad design practices” and “lack of use of good software development practices”. on the other hand, lack of refactoring refers to situations in which the team does not perform the improvement of the internal structure of the code without changing its external behavior, as exemplified in “lack of code refactoring” and “there was no code refactoring at the beginning of the problem”. on its turn, lack table 2. the 10 most cited causes related to coding and other software development issues. coding other development issues cause # cause # 1st non-adoption of good practices 54 deadline 169 2nd sloppy code 21 not effective project management 98 3rd lack of refactoring 17 inappropriate planning 83 4th external component dependency 12 lack of technical knowledge 80 5th adoption of contour solutions as definitive 11 producing more at the expense of quality 67 6th lack of reuse practices 5 inappropriate / poorly planned / poorly executed test 59 7th lack of automated testing 5 lack of experience 58 8th discontinued component 4 inaccurate time estimate 56 9th concern with just back-end development 4 lack of qualified professional 54 10th inadequate data model 3 pressure 53 • human factors: groups causes related to people's participation in project issues. some examples are lack of experience and lack of commitment; • knowledge issues: groups items originating from concerns around the knowledge of team members. two examples are lack of technical knowledge and lack of domain knowledge; • requirements engineering: encompasses the causes related to requirements issues. examples are: change of requirements and requirements elicitation issues; • verification, validation, and testing: encompasses the causes related to the execution of quality assurance activities. two examples are inappropriate/poorly planned/poorly executed test and lack of code review; of reuse practices occurs when existing software component or software component knowledge is not used for the construction of a new software, for example, “need to create the culture of reusability”. another example of overlapping encompasses the causes not effective project management and inappropriate planning. however, the cause not effective project management refers to inadequate management during project development, as reported in: “not following planning” and “lack of understanding of managers”. differently, the cause inappropriate planning refers to issues in project planning, for example, “lack of prioritization of activities” and “deficiency in project planning (disorganization)”. https://bit.ly/37bopif investigating the relationship between technical debt management and software development issues berenguer et al. 2023 • architectural issues: groups causes related to decisions made regarding software architecture. examples are: inadequate technical decisions and problems in architecture; • process issues: refers to causes related to the definition or execution of the processes used in the development of the software. two examples are lack of a welldefined process and lack of traceability of bugs; • design issues: encompasses causes related to the design of the software. there are two causes in this category: poor design and changes in design; • documentation: groups causes related to documentation. example of causes in this category are nonexistent documentation and outdated/incomplete documentation; • external factors: refers to causes associated with external factors, such as customer does not listen the project team and structural change in the involved organizations; • infrastructure issues: encompasses causes related to problems in the software development infrastructure, such as required infrastructure unavailable and updating existing tools; • organizational issues: groups causes from the organizational context, such as lack of awareness of the importance of testing and refactoring and organizational misalignment; • quality issues: refers to causes (lack of quality) associated with lack of quality in software artifacts; • td management: encompasses causes related to management of td items. this category has only the cause lack of perception of the importance of dealing with td. table 3 shows the categories together with the corresponding number of causes, number of citations, and percentage of the causes cited in relation to the other categories. the category planning and management stood out with ~47% of citations, representing more than three times the citations of the second ranked category. this is an indication that the causes of the occurrence of td are strongly related to project management issues. the results also highlight the importance that human factors have, occupying the second position with ~13% of citations. this result is somehow aligned with previous work on social debt (tamburri et al. 2015, martini et al. 2019). concerns related to requirements engineering and issues related to knowledge were also commonly mentioned. 4.3 rq2: are the effects of td more felt in coding issues or other issues in the software development process? the participants reported a total of 73 td effects, totaling 980 citations. among them, ~64% are related to other development issues and ~36% are related to coding. table 3. categories of causes related to other software development issues. categories of causes #causes #cited causes ~%cited causes planning and management 22 733 47% human factors 10 206 13% knowledge issues 7 128 9% requirement engineering 7 120 8% vv&t 6 91 6% architectural issues 6 63 5% process issues 6 54 4% design issues 2 45 3% documentation 4 37 2% external factors 4 25 2% organizational issues 3 25 2% infrastructure issues 4 15 1% quality issues 1 12 1% td management 1 1 0.1% there are 18 coding-related effects experienced by the participants. the 10 most commonly cited are presented in table 4 (second column). the full list is available at https://bit.ly/37bopif. concerns about the capacity of the team to evolve the code, rework, and the need of employing refactoring practices to improve the internal quality of the software are common. other common effects are: bad code, low performance, and stop development for debt repayment. table 4. the 10 most cited effects related to coding and other development issues. coding other development issues effects # effects # 1st low maintainability 97 delivery delay 141 2nd rework 86 low external quality 78 3rd need of refactoring 35 financial loss 55 4th bad code 31 increased effort 41 5th low performance 28 stakeholder dissatisfaction 34 6th stop dev. activities for debt repayment 14 team demotivation 24 7th increase in the amount of maint. activities 13 stress with stakeholders 23 8th difficulty in impl. the system 10 team overload 16 9th low code reuse 8 fall in productivity 13 10th low reliability 7 project not completed 13 https://bit.ly/37bopif investigating the relationship between technical debt management and software development issues berenguer et al. 2023 we identified 55 effects related to other development issues. the four most commonly (third column of table 5) cited reflect concerns on the project management and planning (delivery delay, increased effort, financial loss) and external quality of the product (low external quality). issues related to human factors were also commonly cited, with emphasis on stakeholder dissatisfaction, team demotivation, and stress with stakeholders. table 5 shows the categories of effects related to other software development issues. the category planning and management has ~47% of citations, revealing that managerial aspects of software development are commonly affected by the presence of debt items. next is the human factor category, with ~18% of the effects cited, showing that td also impacts human aspects of software development. quality issues are also a common concern. the other categories are less commonly cited. table 5. categories of effects related to other software development issues. categories of effects #effects #cited effects ~%cited effects planning and management 15 297 47% human factors 7 110 18% quality issues 6 110 18% vv&t 3 23 4% design issues 2 21 3% knowledge issues 8 21 3% architectural issues 4 18 3% organizational issues 3 10 2% documentation 1 6 1% process issues 2 4 1% requirement engineering 2 4 1% infrastructure issues 1 3 0.5% td management 1 2 0.3% 4.4 rq3: is td prevention more related to coding issues or other issues in the software development process? the data shows a total of 89 practices to support the prevention of td items, resulting in 819 citations. from this, ~84% are items related to other development issues, while only ~16% are associated with code. this result indicates a tendency for other development issues to play a key role in the prevention of td. we identified a total of 13 td prevention practices related to coding. table 6, second column, presents the 10 most cited items. the complete list is available at https://bit.ly/37bopif. adoption of good practices, using good design practices, refactoring, code review, increasing time for analysis and design, use the most appropriate version of the technology, and appropriate reusing of code are the prevention practices most cited by the participants. the adoption of good practices and using good design practices reflect concerns that practitioners should have when carrying out their coding and design activities. the practices refactoring and code review are related to the continuous improvement of the code under development. lastly, increasing time for analysis and design, use of the most appropriate version of the technology, and appropriate reusing of code are related to concerns that teams must have around an adequate analysis of the functionalities, implementation of the software structure, and software reuse, respectively. table 6. top 10 most commonly cited td prevention practices related to coding or other development issues. coding other development issues prevention practices # prevention practices # 1st adoption of good practices 49 well-defined requirements 57 2nd using good design practices 26 better project management 43 3rd refactoring 12 providing training 36 4th code review 10 follow the proj. planning 34 5th increasing time for analysis and design 7 improving software development process 33 6th use the appropriate version of the tech. 7 improve documentation 26 7th appropriate reusing of code 6 well planned deadlines 26 8th version control 5 better project planning 24 9th considering technical constraints 4 creating tests 24 10th improving the project maintainability 4 allocation of qualified professionals 23 on the other hand, we found 76 prevention practices related to other development issues. table 6 (third column) shows the ten most cited. interestingly, five of them reflect different concerns through the software development process, such as management (following the project planning and better project management), the process itself (improving software development process), the documentation (well-defined requirements), and the qualification of the team (providing training). we see in table 7 that td prevention practices are commonly related to project management issues (~34%). the results also highlight the importance that the process followed by the team has, ranking second (~12%) among the most cited categories. concerns related to requirements, vv&t, td management, and human factors were also commonly mentioned. https://bit.ly/37bopif investigating the relationship between technical debt management and software development issues berenguer et al. 2023 table 7. categories of prevention practices related to other software development issues. categories of prevention practices #practices #cited practices ~%cited practices planning and management 21 232 34% process issues 8 80 12% requirement engineering 5 69 11% vv&t 11 67 10% td management 7 64 10% human factors 11 61 9% knowledge issues 4 51 8% documentation issues 2 28 4% architectural issues 3 27 4% organizational issues 2 4 1% infrastructure issues 2 3 1% 4.5 rq4: are the reasons for not preventing td more related to coding issues or other development issues? participants reported 25 reasons that lead to the nonprevention of td items, resulting in 63 citations. of them, ~87% are related to other development issues, while only eight ~13% are related to coding. again, other development issues have an important role in preventing td. there are only four reasons related to code leading teams not to prevent the occurrence of debt items: lack of technical knowledge, lack of good technical solutions, lack of concern about maintainability, and continuous change of coding standards. on the other hand, we found 21 reasons (the 10 most cited are presented in table 8) related to other software development issues. short deadline was the most cited. table 8. top 10 most cited reasons for not preventing td related to other development issues. other development issues reason # reason # 1st short deadline 14 6th documentation issues 2 2nd ineffective management 7 7th lack of process maturity 2 3rd lack of predictability in the soft. development 5 8th lack of qualified professionals 2 4th requirements change 5 9th legacy system difficult to heal 2 5th pressure for results 4 10th accepting the td 1 table 9 shows the categories identified. planning and management once again stands out with ~38% of citations. the other categories were less commonly cited, with less than seven citations. although not too mentioned, the result suggests that other issues related to the software development can also negatively influence teams in td prevention. table 9. categories of reasons for td non-prevention related to other software development issues. categories of reasons #reason #cited reasons ~%cited reasons planning and management 2 21 38% requirement engineering 2 6 11% coding 1 5 9% external factors 2 5 9% human factors 4 4 8% process issues 2 3 6% design issues 1 2 4% documentation issues 1 2 4% knowledge issues 1 2 4% td management 2 2 4% architectural issues 1 1 2% infrastructure issues 1 1 2% organizational issues 1 1 2% 4.6 rq5: is td repayment more associated with coding issues or other issues in the software development process? we identified 32 td repayment practices, resulting in 315 citations. of them, ~56% are related to other development issues, while ~44% are associated with code. unlike the other td management elements, these percentages differ slightly, indicating that coding issues play a key role in td repayment initiatives. we recognized eight td repayment practices related to coding, presented in table 10. code refactoring and design refactoring are the most cited practices. both are associated with changes in the internal structure of the system without changing its external behavior. the practices solving technical issues and bug fixing focus on fixing open issues in the code. lastly, the practices using code analysis, code reviewing, and using code reuse can support teams implementing td repayment initiatives, i.e., although these practices did not repay the debt, they increase the capacity for better repayment. the remaining 24 repayment practices are related to other development issues. table 10 (third column) shows the ten most cited ones. these practices evidence several concerns in software development processes: documentation (update system documentation), organizational decisions (hiring investigating the relationship between technical debt management and software development issues berenguer et al. 2023 specialized professionals), project management (increasing the project budget, monitoring and controlling project activities, negotiating deadline extension, investing effort on td repayment, and prioritizing td items), process (improving the development process and using short feedback iterations), and software quality (investing effort in testing activities). table 10. top 10 most commonly cited td repayment practices related to coding or other development issues. coding other development issues repayment practices # repayment practices # 1st code refactoring 80 investing effort on td repayment activities 33 2nd design refactoring 25 investing effort on testing activities 22 3rd adoption of good practices 10 prioritizing td items 15 4th solving technical issues 9 negotiating deadline extension 14 5th bug fixing 6 update system documentation 9 6th using code analysis 3 monitoring and controlling project activities 9 7th code reviewing 3 increase the project budget 9 8th using code reuse 2 improving the development process 8 9th hiring specialized professionals 8 10th using short feedback iterations 7 table 11 presents the categories of repayment practices. td management and planning and management stand out with ~32% and ~27% of the total of citations. the categories verification, validation and test, and process issues were both cited by ~12% of participants, while the others were less commonly reported. 4.7 rq6: are the reasons for not paying td more related to coding issues or other development issues? we identified 27 reasons for not repaying td items, totaling 319 citations. from these, 99.7% are related to other development issues and only lack of access to the component code (0.3%) is associated with code. the reasons for td non-repayment arise from development issues other than coding. table 11. categories of repayment practices related to other software development issues. categories of repayment practices #practices #cited practices ~%cited practices td management 4 56 32% planning and management 8 47 27% vv&t 1 22 13% process issues 5 21 12% documentation 1 9 6% organizational issues 1 8 5% human factors 1 6 4% requirement engineering 1 3 2% infrastructure issues 1 3 2% design issues 1 2 1% table 12 shows the ten best-positioned reasons for not repaying td. the complete list is available at https://bit.ly/37bopif. we notice that the majority of the reasons (focusing on short-term goals, lack of time, cost, lack of resources, effort, the project was discontinued, complexity of the td item, and insufficient management view about td repayment) are associated with project planning and management. the others refer to external (customer decision) and human (team overload) factors. table 12. top 10 most cited reasons for not paying off td related to other development issues. other development issues reason # reason # 1st focusing on short term goals 69 6th customer decision 13 2nd lack of org. interest 48 7th complexity of the td item 12 3rd lack of time 41 8th effort 11 4th cost 34 9th insufficient mgmt. view on td repayment 10 5th lack of resources 19 10th complexity of the project 10 the reasons were also grouped into categories. planning and management issues stand out with ~58% of citations, as shown in table 13, pointing out that the reasons for this category are categorical for td non-repayment. the categories organizational issues and td management were also commonly cited by ~16% and ~11% of the participants. https://bit.ly/37bopif investigating the relationship between technical debt management and software development issues berenguer et al. 2023 table 13. categories of reasons for td non-repayment related to other software development issues. categories of reasons #reason #cited reasons ~%cited reasons planning and management 7 185 58% organizational issues 2 50 16% td management 7 34 11% external factor 1 13 5% knowledge issues 3 12 4% human factors 3 11 4% architectural issues 2 11 4% vv&t 1 2 1% 5 organizing the td management elements into hump diagrams we represent the relationship between the investigated td management elements (causes, effects, prevention practices, reasons for td non-prevention, repayment practices, and reasons for td non-repayment) and software development issues in hump diagrams (figure 2). to plot results for coding and for other issues in the same hump diagram, we normalized the number of citations for an element of a specific software development issue with the total number of citations for that element. for example, prevention practices have in total 819 citations, but 232 citations for the issue planning and management. thus, the hump value for planning and management issues of prevention practices is 28% (232/819*100). this count is slightly different from the ones we used in tables 3, 5, 7, 9, 11, and 13 because now we consider coding as another software development issue. 5.1 using the diagram we can read the hump diagram horizontally and vertically. horizontally, we have a broad view on the impact of each software development issue through the td management elements. for example, in figure 2, we can notice that coding plays an important role for all the analyzed td elements, but mainly for td repayment. there is a high concentration of practices related to td repayment and, at the same time, almost none of reasons for the non-repayment of debt items is due to coding issues. we also perceive that there are many other issues we need to be aware of when dealing with td in software projects, mainly, planning and management. indeed, this is even stronger when combined with td management concerns. much about the non-repayment of td can be understood by looking at these issues. human factors also call our attention, clearly indicating that td, more than technical aspects of the software development, is also about team morale, satisfaction, motivation, communication, and commitment. other commonly found issues in several elements of the td management are architectural issues, design issues, documentation, knowledge issues, process issues, requirement engineering, and vv&t. figure 2. the hump diagram for td management elements and software development issues. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 by reading the diagram vertically, we can observe the impact of all identified software development issues on each td management element. in figure 2, for example, we can observe that planning and management, organizational, and td management issues are decisive for the non-repayment of debt items. we also notice that the presence of debt items mainly impacts (effect) planning and management, quality issues, maintenance issues, human factors, and coding. practitioners can use the hump diagram to have a comprehensive view on how td relates to several issues of their software projects, ranging from organizational to coding level issues. moreover, for each td management element, they can go through the detailed results presented in section 4 and the auxiliary material to understand how to deal with them. for example, by looking at figure 2, a practitioner can see that the effects of td are commonly related to coding, human factors, maintenance, quality, and planning and management issues. if (s)he is interested in discovering more about the human factors issues, then (s)he can observe in the results and auxiliary material that team demotivation, dissatisfaction of the parties involved, and stress with stakeholders are the main concerns to be mitigated. 5.2 specializing the diagram by process models practitioners can specialize the hump diagram for their context. to illustrate it, we organize the td management elements considering the process model used by the participants who answered the insightd questionnaire choosing one of the following options: agile, hybrid, and traditional. figures 3, 4, and 5 present the hump diagram for agile, hybrid, and traditional process models, respectively. comparing them, we can notice that the diagrams for agile and hybrid process models are just slightly different from each other. it indicates that the view on the td management elements goes in the same direction to these process models. conversely, traditional process model presents some particularities against the other models. for example, prevention practices are more affected by architectural, infrastructure, organizational, and requirement engineering issues in traditional process model than the others. reasons for td non-prevention are less affected by coding, design, documentation, human factors, knowledge, maintenance, requirement engineering, and td management in traditional process model, while external factors and planning and management affect mainly this model. to further understand the possible impact of different process models in the td management elements, we organized ranked lists of each td management element considering its number of citations by process models (agile, hybrid, and traditional). to verify if there are differences between the lists, we adopted the rbo (rank-biased overlap) analysis (webber et al. 2010), which quantitatively measures how similar the ranked lists are. figure 3. the hump diagram for agile model process. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 figure 4. the hump diagram for hybrid model process. figure 5. the hump diagram for traditional model process. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 rbo gives a value ranging from 0 to 1. the closer this value is to 1, the greater the similarity between the lists. as rbo supports top-weighted ranked lists, the first elements of a list have more impact on the similarity index than the last ones. we can configure what elements will be compared by setting the p-value, which, differently than the p statistic, refers to a level of overlapping and the degree of topweightedness. in the analysis, we chose p-value ranging from 0.5 (only the very initial elements of a rank are considered) to 0.9 (almost all elements are considered). the results of the comparison for each of the td management elements are presented in the following subsections. 5.2.1 comparing td causes between agile, hybrid, and traditional process models figure 6 shows the results of the comparison between the ranked lists of causes for each process model considering (a) causes related to coding issues and (b) causes related to other software development issues. the rbo analysis for causes related to coding (figure 6 (a)) reveals that the similarity level is about 80-90% between the three lists. it indicates that the lists are quite similar with little variation when more causes are included, i.e., the p-value increases. this similarity can be perceived when we observe the top5 ranked causes for each process model (table 14). the cause non-adoption of good practices was the most cited cause for all process models, while lack of refactoring, sloppy code, adoption of contour solutions as definitive were perceived, but in different positions. for example, lack of refactoring (agile: 2nd, hybrid: 4th, and traditional: 3rd) and sloppy code (agile: 3rd, hybrid and traditional: 2nd). further, we can see that the cause external component dependency is not perceived in traditional process model while lack of reuse practices is only perceived in this process model. for causes related to other software development issues (figure 6 (b)), we can see that the rbo value is almost constant with similarity level about 80-90% for agile and hybrid process models. differently, the similarity level is about 65-80% when comparing traditional with agile/hybrid. in table 15, we can see that the cause deadline was the most cited cause for each process model. regarding agile and hybrid process models, they did not share the causes focus on producing more at the expense of quality and lack of experience. however, the causes inaccurate time estimate, inappropriate planning, and lack of qualified professional were perceived only in the context of traditional process model. table 14. top 5 most cited causes related to coding issues per process model. agile hybrid traditional 1 non-adoption of good practices (25) non-adoption of good practices (23) non-adoption of good practices (6) 2 lack of refactoring (10) sloppy code (8) sloppy code (5) 3 sloppy code (8) external component dependency (7) lack of refactoring (2) 4 adoption of contour solutions as definitive (6) lack of refactoring (5) lack of reuse practices (2) 5 external component dependency (4) adoption of contour solutions as definitive (4) adoption of contour solutions as definitive (1) table 15. top 5 most cited causes related to other development issues per process model. agile hybrid traditional 1 deadline (66) deadline (85) deadline (18) 2 inappropriate planning (35) not effective project management (53) inaccurate time estimate (14) 3 not effective project management (35) inappropriate planning (38) inappropriate / poorly planned / poorly executed test (13) 4 lack of technical knowledge (34) lack of technical knowledge (38) inappropriate planning (10) 5 focus on producing more at the expense of quality (30) lack of experience (32) lack of qualified professional (10) figure 6. rbo comparing causes related to (a) coding and (b) other software development issues. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 in summary, coding-related causes are perceived in the same way in agile, hybrid, and traditional process models, while non-coding related causes are differently perceived by those who follow traditional process models. 5.2.2 comparing td effects between agile, hybrid, and traditional process models figure 7 shows the results of the comparison between the ranked lists of effects by process model considering (a) coding related effects and (b) effects related to other software development issues. the rbo analysis for effects related to coding (figure 7 (a)) reveals that the lists are quite similar, as the similarity level is about 90% between the three lists. analyzing the top 5 ranked effects of each process model (table 16), we can see this similarity. for example, the effects low maintainability and rework were the most cited effects for all process models, occupying the same position in the lists. further, the effect difficulty in implementing the system is only perceived by the traditional process model while it did not perceive the effect need for refactoring. table 16. top 5 most cited effects related to coding issues per process model. agile hybrid traditional 1 low maintainability (40) low maintainability (43) low maintainability (14) 2 rework (39) rework (35) rework (12) 3 need for refactoring (19) bad code (17) bad code (5) 4 low performance (14) need for refactoring (14) low performance (4) 5 bad code (9) low performance (10) difficulty in implementing the system (3) regarding the effects related to other software development issues (figure 7 (b)), the similarity level is almost 100% for the first effects in the agile and hybrid lists. it means that these process models have the same view on the most critical effects of td, but this similarity level decreases when more effects are considered. table 17 presents the top 5 ranked effects by process models. we can see that the effect delivery delay was the most perceived effect by the process models. besides, the effects from the list of agile and hybrid process models are quite the same, except team demotivation and stakeholder dissatisfaction. although the effect design problems is only perceived in the context of traditional process models, the other effects (financial loss, low external quality, and team demotivation) are also present in the other two lists. in conclusion, agile, hybrid, and traditional process models are related to almost the same coding-related effects. this also applies for non-coding related effects. table 17. top 5 most cited effects related to other development issues per process model. agile hybrid traditional 1 delivery delay (51) delivery delay (69) delivery delay (21) 2 low external quality (34) low external quality (36) financial loss (10) 3 financial loss (20) financial loss (25) low external quality (8) 4 increased effort (18) increased effort (20) team demotivation (5) 5 team demotivation (13) stakeholder dissatisfaction (19) design problems (3) 5.2.3 comparing td preventive practices between agile, hybrid, and traditional process models figure 8 shows the results of the comparison between the ranked lists of preventive practices by process model considering (a) preventive practices related to coding and (b) those related to other software development issues. the rbo analysis for preventive practices related to coding (figure 8 (a)) reveals that the lists are different. the similarity level is about 60-80% between the three lists. in table 18, we can see that while the preventive practice adoption of good practices was the most used practice in the process models, the other practices were not shared by all figure 7. rbo comparing effects related to (a) coding and (b) other software development issues. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 process models. for example, using good design practices, refactoring and considering technical constraints are only present in the context of agile process model, while use the most appropriate version of the technology and bug tracking are only related to traditional process model. table 18. top 5 preventive practices related to coding issues per process model. agile hybrid traditional 1 adoption of good practices (18) adoption of good practices (25) adoption of good practices (6) 2 using good design practices (13) appropriate reusing of code (3) increase time for analysis and design (2) 3 refactoring (8) code review (2) use the most appropriate version of the technology (2) 4 code review (7) improving the maintainability of the project (4) appropriate reusing of code (1) 5 considering technical constraints (4) increase time for analysis and design (3) bug tracking (1) concerning the preventive practices related to other software development issues, the similarity level is 70-80% (figure 8 (b)), indicating that the lists are also different. in table 19, we can see that the preventive practice welldefined requirement was present in all process models, but the others were not shared by all process models. for instance, well-defined architecture, creating tests, and improve documentation were only used by traditional process models. in summary, agile, hybrid, and traditional process models did not share the same view on preventive practices regardless they are related to coding or not. table 19. top 5 most cited preventive practices related to other development issues per process model. agile hybrid traditional 1 well-defined requirement (21) well-defined requirement (26) well-defined requirement (10) 2 following the project planning (17) better project management (22) well-defined architecture (6) 3 better project management (16) training (18) better project management (5) 4 training (13) improving software development process (17) creating tests (5) 5 better project planning (12) well planned deadlines (14) improve documentation (5) 5.2.4 comparing reasons for td non-prevention between agile, hybrid, and traditional process models figure 9 (a) shows the rbo result considering the lists of coding-related reasons for td non-prevention of agile and hybrid process models. we did not consider traditional process models because their practitioners did not mention any reason for td non-prevention. analyzing the figure, we can see that the similarity level is 10-30%, indicating that agile and hybrid did not share the same vision on reasons for td non-prevention. this low similarity level is also perceived when we compared the list of reasons for td nonprevention, as shown in table 20. table 20. top 5 most cited reasons for td non-prevention related to coding issues per process model. agile hybrid 1 lack of technical knowledge (2) lack of good technical solutions (2) 2 lack of concern about maintainability (1) continuous change of coding standards (1) 3 lack of concern about maintainability (1) 4 lack of technical knowledge (1) figure 8. rbo comparing preventive practices related to (a) coding and (b) other software development issues. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 about the reasons for td non-prevention related to other software development issues, figure 9 (b) shows that the similarity level is about 80-90% for the most cited reasons in agile and hybrid process models. but this value decreases, reaching about 55%, when considering the full list of reasons. traditional process models did not share the same view on reasons for td non-prevention as the similarity level is about 30-50%. this low similarity level can be perceived when we analyze the five most cited reasons for td nonprevention (table 21). in conclusion, agile and hybrid process models did not share the same vision on coding-related reasons for td nonprevention, but these models have the same view on the most cited non-coding-related reasons. traditional process models did not share the same non-coding-related reasons with agile and hybrid process models. 5.2.5 comparing td repayment practices between agile, hybrid, and traditional process models figure 10 (a) and table 22 show the rbo result considering the lists of repayment practices related to coding for each process model. we can see that agile, hybrid, and traditional process models share the same view in repayment practices. the similarity level varies between 80-90%. table 21. top 5 most cited reasons for td non-prevention related to other development issues per process model. agile hybrid traditional 1 short deadline (7) short deadline (5) pressure for results (2) 2 ineffective management (3) ineffective management (3) short deadline (2) 3 lack of predictability in the software development (3) lack of predictability in the software development (2) ineffective management (1) 4 requirements change (3) legacy system difficult to heal (2) lack of process maturity (1) 5 architectural evolution (1) requirements change (2) concerning the repayment practices related to other software development issues, figure 10 (b) shows the comparison for the three process models. agile and hybrid process models have used almost the same practices (similarity level is about 80-90%). on the contrary, the similarity level when comparing traditional process model with the other two is slightly low, almost 70-80%, for the top 5 ranked elements of their lists as noticed in table 23. figure 9. rbo comparing reasons for td non-prevention related to (a) coding and (b) other software development issues. figure 10. rbo comparing repayment practices related to (a) coding and (b) other software development issues. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 table 22. top 5 most cited repayment practices related to coding issues per process model. agile hybrid traditional 1 code refactoring (38) code refactoring (37) code refactoring (5) 2 design refactoring (14) design refactoring (7) design refactoring (4) 3 adoption of good practices (6) adoption of good practices (4) bug fixing (1) 4 solving tech. issues (6) bug fixing (3) solving tech. issues (1) 5 code reviewing (3) solving tech. issues (2) table 23. top 5 most cited repayment practices related to other development issues per process model. agile hybrid traditional 1 investing effort on td repayment activities (13) investing effort on td repayment activities (16) investing effort on td repayment activities (4) 2 investing effort on testing activities (12) investing effort on testing activities (7) increasing the project budget (4) 3 prioritizing td items (9) negotiating deadline extension (6) negotiating deadline extension (4) 4 using short feedback iterations (5) prioritizing td items (6) investing effort on testing activities (3) 5 implementing preventive actions for avoiding td(4) changing project scope (4) update system documentation (3) practitioners using agile, hybrid, and traditional process models have shared almost the same experience on repayment practices related to coding, but this scenario is different for repayment practices related to other software development issues when considering the context of traditional process models. 5.2.6 comparing reasons for td non-repayment between agile, hybrid, and traditional process models figure 11 presents the rbo result considering the lists of non-coding-related reasons for td non-repayment. we did not perform the analysis for coding-related reasons for td non-repayment because only one reason (lack of access on component code) was cited by the participants. analyzing the figure, we can see that the similarity level is around 8090%, indicating that practitioners have same view on noncoding-related reasons for td non-repayment. in table 24, we can observe that the reasons focusing on short term goal and lack of organizational interest were the most used reasons for explaining the td non-repayment. besides, the other reasons are also very similar among the process models. in summary, practitioners using agile, hybrid, and traditional process models share the same view on noncoding-related reasons for td non-repayment. figure 11. rbo comparing reasons for td non-repayment related to other software development issues. table 24. top 5 most cited reasons for td non-repayment related to other development issues per process model. agile hybrid traditional 1 focusing on short term goals (28) focusing on short term goals (32) focusing on short term goals (9) 2 lack of organizational interest (20) lack of organizational interest (21) lack of organizational interest (7) 3 lack of time (16) lack of time (20) cost (5) 4 cost (13) cost (16) lack of time (5) 5 effort (7) lack of resources (13) lack of technical knowledge (3) 6 discussion this section presents an overview of the findings and discusses their implications for practitioners and researchers. 6.1 summary of findings the results indicate that coding issues related to the causes, effects, prevention, non-prevention, repayment, and non-repayment of td are only a small part of the concerns that practitioners face in the presence of td. indeed, td has been more commonly found in other software development issues. the radar graph presented in figure 12 shows the percentages of the distribution of the participants’ responses to each of the investigated elements concerning the categories coding issues and other software development issues. for every investigated element, most of the responses are related to other software development issues. the difference is quite bigger for the elements: causes, prevention, reasons for not preventing, and reasons for not repaying. the values for td repayment are very close between the two groups (56% vs 44%). this is an indication that, although practitioners perceive that td is ubiquitous in investigating the relationship between technical debt management and software development issues berenguer et al. 2023 software development projects, they also see that its repayment is commonly related to coding issues. figure 12. distribution of the participants’ answers on the td management elements. we organized the td management elements into categories. the category planning and management concentrated the biggest number of citations of causes, effects, preventive practices, reasons for td nonprevention, and reasons for td non-repayment. alternatively, the category td management has the biggest quantity of preventive practices citations. all identified categories of each td management element were represented in a hump diagram. by analyzing the diagram, practitioners can perceive the influence of each td management element in a specific issue associated with the software development process. these issues correspond to the categories defined in this study. besides, practitioners can specialize the diagrams following their project context. for illustrating it, we specialized the hump diagram for agile, hybrid, and traditional process models, and compared them with each other. from the comparison, we noticed that agile and hybrid process models share the same point of view on the td management elements analyzed in this work. on the other hand, practitioners who adopted traditional process models tend to have a different view on these elements. strategies defined to support td management initiatives must consider the specificities of each process model. 6.2 implications for researchers and practitioners the hump diagram can guide practitioners, showing how each software development issue is related to each td management element. having this information, practitioners can define strategies to mitigate causes, effects, reasons for td non-prevention, or reasons for td non-repayment. also, the combined use of the hump diagram and the detailed results, presented in section 4 and available at https://bit.ly/37bopif, provides a comprehensive guidance for software development teams about what to expect from the presence of td and how to react to them considering several software development issues. for example, practitioners can diagnose the causes of td by consulting the hump diagram. as the causes from the category planning and management are more common in agile software projects, if an agile team has defined preventive practices for these causes and it still identifies new causes, by analyzing the diagram, the team can focus on other causes from more common categories in the agile process, such as human factors. practitioners can also identify preventive practices to avoid td items in their projects. suppose a traditional team has applied all preventive practices from the category planning and management (with the highest concentration of practices), but the team still felt the effects of td. the team can apply preventive practices from other categories by analyzing the hump diagram, such as requirement engineering and verification, validation, and test. for researchers, our results point out the need of investing more research effort on other issues of the software development. for example, complementary to understanding td at the code level, it is also necessary to investigate strategies to mitigate the managerial reasons that lead software teams to not repay debt items. another promising topic for investigation would be the relationship between human factors of the software development and td. for practitioners and researchers, the results of rbo analyses bring to the fore the need to further investigate practitioners' perceptions of the elements of tdm. this investigation may reveal differences that can be used to develop methods, techniques, and tools more suited to professionals needs. for example, our findings reveal that agile and traditional processes consider td prevention differently. before developing a td prevention strategy, researchers may investigate agile software development characteristics that influence td prevention. also, agile practitioners can learn from traditional practitioners by identifying the differences in perceptions concerning td prevention. 7 threats to validity as in any empirical study, there are threats to validity in this work. we attempt to remove them when possible, and mitigate their effects when removal is not possible the main threat to the validity of the conclusion is related to the coding process, as it is a creative process. to mitigate it, the analyses were carried out separately by two researchers, and the consensus was carried out by a third, more experienced one. also, additional procedures were considered for seeking consistency in the nomenclature used by each replication team during their coding activities. lastly, the classification of the coded td management elements into code/non-code, as well as the definition of their categories, are essentially subjective tasks. to mitigate them, we followed a rigorous analysis procedure. the classification process was always performed individually by two researchers, being reviewed by at least one experienced researcher. https://bit.ly/37bopif investigating the relationship between technical debt management and software development issues berenguer et al. 2023 another threat is related to the specialization of hump diagrams per process model. to this end, we relied on the responses from participants to the questions q8 of the insightd questionnaire, which explicitly states the definition of the three categories of processes considered in this research (agile, hybrid, and traditional). the questionnaire was designed to eliminate threats to internal validity. as discussed in (rios et al., 2020), the questionnaire went through a series of validations (three internal and one external) and a pilot study to identify any issues before its execution. it is also worth mentioning that the participants could act differently from what they usually do because they are part of a study. to avoid this, we clearly explain the purpose of the study and ask participants to answer the questions based on their own experience. we also state explicitly that the questionnaire is anonymous, and that the data collected is analyzed without considering the identity of the participants. also, participants may have misinterpreted the use of the terms prevention and repayment of td. to investigate whether this threat manifested, all responses on how participants avoided and repaid the debt item were analyzed (q23 and q27) to analyze if there were invalid answers. a high proportion of invalid responses would mean that the questions could be misinterpreted. in the end, we did not identify any invalid response, indicating that this threat did not appear in the study. lastly, external validity threats were reduced by targeting industry professionals and seeking to achieve participant diversity among survey respondents. in search of more generalizable results, insightd is being replicated in other countries. 8 concluding remarks in this paper, we investigate the relation between td management elements (causes, effects, preventive practices, repayment practices, reasons for td non-prevention, and reasons for td non-repayment) and software development issues related to coding or other activities. also, we categorize these elements and organize them into hump diagrams. further, we define a hump diagram for each process model (agile, hybrid, and traditional) to demonstrate how the diagram can be specialized by practitioners following one of their project’s variables, such as, process model and role. the next steps of this work include (i) to investigate whether the type of debt impacts how practitioners see td management elements, (ii) to develop a td management instrument encompassing the hump diagram and the detailed results, and (iii) to empirically assess this instrument on the supporting of td management. we also intend to investigate the main human factors associated with td. acknowledgements this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior – brasil (capes) finance code 001 and the conselho nacional de desenvolvimento científico e tecnológico – cnpq. this research was also supported in part by funds received from the david a. wilson award for excellence in teaching and learning, which was created by the laureate international universities network to support research focused on teaching and learning. references alves, n.s.r., mendes, t.s., mendonça, m.g., spínola, r., shull, f., & seaman, c. (2016). identification and management of technical debt: a systematic mapping study. information and software technology, 70, 100-121. doi: https://doi.org/10.1016/j.infsof.2015.10.008. berenguer, c., borges, a., freire, s., rios, n., tausan, n., ramac, r., pérez, b., castellanos, c., correal, d., pacheco, a., lópez, g., falessi, d., seaman, c., mandic, v., izurieta, c., & spínola, r. (2021). technical debt is not only about code and we need to be aware about it. in proceedings of the xx brazilian symposium on software quality (sbqs '21). acm, new york, ny, usa, 1–12. doi: https://doi.org/10.1145/3493244.3493285. besker, t., ghanbari, h., martini, a., & bosch, j. (2020). the influence of technical debt on software developer morale. journal of systems and software, 167. doi: https://doi.org/10.1016/j.jss.2020.110586. cunningham, w. (1992). the wycash portfolio management system. acm sigplan oops messenger, 4, 2 (april 1993), 29-30. doi: https://doi.org/10.1145/157710.157715. freire, s., rios, n., mendonça, m., falessi, d., seaman, c., izurieta, c., & spínola, r. (2020a). actions and impediments for technical debt prevention: results from a global family of industrial surveys. in proceedings of the 35th acm/sigapp symposium on applied computing, brno, 1548–1555. freire, s., rios, n., gutierrez, b., torres, d., mendonça, m., izurieta, c., seaman, c., & spínola, r. (2020b). surveying software practitioners on technical debt payment practices and reasons for not paying off debt items. in proceedings of the evaluation and assessment in software engineering. trondheim, 210–219. freire, s., rios, n., perez, b., castellanos, c., correal, d., ramac, r., mandic, v., tausan, n., pacheco, a., lópez, g., mendonça, m., izurieta, c., falessi, d., seaman, c., & spínola, r. (2021a). pitfalls and solutions for technical debt management in agile software projects. ieee software, vol. 38, no. 6, pp. 42-49, nov.-dec. 2021. doi: 10.1109/ms.2021.3101990. freire, s., rios, n., perez, b., castellanos, c., correal, d., ramac, r., mandic, v., tausan, n., lópez, g., pacheco, a., falessi, d., mendonça, m., izurieta, c., seaman, c., & spínola, r. (2021b). how experience impacts practitioners’ perception of causes and effects of technical debt. in proceedings of the ieee/acm 13th international workshop on cooperative and human aspects of software engineering (chase). doi: 10.1109/chase52884.2021.00011. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 freire, s., rios, n., pérez, b., correal, d., mendonça, m., izurieta, c., seaman, c., & spínola, r. (2021c). how do technical debt payment practices relate to the effects of the presence of debt items in software projects? in proceedings of the ieee international conference on software analysis, evolution and reengineering (saner). doi: 10.1109/saner50967.2021.00074. guo, y., spínola, r.o., & seaman, c. (2016). exploring the costs of technical debt management --a case study. empirical software engineering, 21, 1 (february 2016), 159–182. doi: https://doi.org/10.1007/s10664-014-9351-7. izurieta, c., vetrò, a., zazworka, n., cai, y., seaman, c., & shull, f. (2012). organizing the technical debt landscape. in proceedings of the 3rd international workshop on managing technical debt (mtd). zurich, 23-26. doi: https://doi.org/10.1109/mtd.2012.6225995. lenarduzzi, v., besker, t., taibi, d., martini, a., & fontana, f. a. (2021). a systematic literature review on technical debt prioritization: strategies, processes, factors, and tools. journal of systems and software, 171, 110827. li, z., avgeriou, p., & liang, p. (2015). a systematic mapping study on technical debt and its management. journal of systems and software, 101, 193–220. doi: https://doi.org/10.1016/j.jss.2014.12.027. lim, e., taksande, n., & seaman, c. (2012). a balancing act: what software practitioners have to say about technical debt. ieee software, 29, 6 (november 2012), 22–27. doi: https://doi.org/10.1109/ms.2012.130. martini, a., stray, v., & moe, n.b. (2019). technical-, social-and process debt in large-scale agile: an exploratory case-study. in proceeding of the international conference on agile software development (pp. 112-119). springer, cham. ramač, r., mandić, v., taušan, n., rios, n., freire, s., pérez, b., castellanos, c., correal, d., pacheco, a., lopez, g., izurieta, c., seaman, c., & spinola, r. (2022). prevalence, common causes and effects of technical debt: results from a family of surveys with the it industry. journal of systems and software, 184, 111114. doi: https://doi.org/10.1016/j.jss.2021.111114. ribeiro, l.f., farias, m.a.f, mendonça, m., & spínola, r.o. (2016). decision criteria for the payment of technical debt in software projects: a systematic mapping study. in proceedings of the 18th international conference on enterprise information systems (iceis). doi: https://doi.org/10.5220/0005914605720579 rios, n., freire, s., pérez, b., castellanos, c., correal, d., mendonça, m., falessi, d., izurieta, c., seaman, c., & spínola, r. (2021). on the relationship between technical debt management and process models. ieee software. rios, n., mendonça, m., & spínola, r. (2018). a tertiary study on technical debt: types, management strategies, research trends, and base information for practitioners. information and software technology, 102, 117-145. doi: https://doi.org/10.1016/j.infsof.2018.05.010. rios, n., spínola, r.o., mendonça, m., & seaman, c. (2019). supporting analysis of technical debt causes and effects with cross-company probabilistic cause-effect diagrams. in proceedings of the ieee/acm international conference on technical debt (techdebt). doi: https://doi.org/10.1109/techdebt.2019.00009. rios, n., spínola, r.o., mendonça, m., & seaman, c. (2020). the practitioners’ point of view on the concept of technical debt and its causes and consequences: a design for a global family of industrial surveys and its first results from brazil. empirical software engineering, 25, 32163287. saraiva, d., neto, j. g., kulesza, u., freitas, g., reboucas, r., & coelho, r. (2021). technical debt tools: a systematic mapping study. in proceedings of the 23rd international conference on enterprise information systems. doi:10.5220/0010459100880098. strauss, a. & corbin, j. (1998). basics of qualitative research: techniques and procedures for developing grounded theory. sage publications. tamburri, d.a., kruchten, p., lago, p. & van vliet, h. (2015). social debt in software engineering: insights from industry. journal of internet services and applications, 6(1), 1-17. webber, w., moffat, a., & zobel, j. (2010). a similarity measure for indefinite rankings. acm transactions on information systems, vol. 28, no.4. wohlin, c., runeson, p., host, m., ohlsson, m.c., regnell, b., & wesslen, a. (2012). experimentation in software engineering: an introduction. springer. zazworka, n., vetro’, a., izurieta, c., wong, s., cai, y., seaman, c., & shull, f. (2014). comparing four approaches for technical debt identification. software quality journal, 22, 403–426 (2014). doi: https://doi.org/10.1007/s11219-013-9200-8. investigating the relationship between technical debt management and software development issues 1 introduction 2 background 3 research method 3.1 the insightd project 3.2 research questions 3.3 data collection 3.4 data analysis procedures 3.4.1 demographics 3.4.2 preparing data for analysis 3.4.3 data classification and analysis 4 results 4.1 demographics 4.2 rq1: are the causes of td more related to coding issues or other software development issues? 4.3 rq2: are the effects of td more felt in coding issues or other issues in the software development process? 4.4 rq3: is td prevention more related to coding issues or other issues in the software development process? 4.5 rq4: are the reasons for not preventing td more related to coding issues or other development issues? 4.6 rq5: is td repayment more associated with coding issues or other issues in the software development process? 4.7 rq6: are the reasons for not paying td more related to coding issues or other development issues? 5 organizing the td management elements into hump diagrams 5.1 using the diagram 5.2 specializing the diagram by process models 5.2.1 comparing td causes between agile, hybrid, and traditional process models 5.2.2 comparing td effects between agile, hybrid, and traditional process models 5.2.3 comparing td preventive practices between agile, hybrid, and traditional process models 5.2.4 comparing reasons for td non-prevention between agile, hybrid, and traditional process models 5.2.5 comparing td repayment practices between agile, hybrid, and traditional process models 5.2.6 comparing reasons for td non-repayment between agile, hybrid, and traditional process models 6 discussion 6.1 summary of findings 6.2 implications for researchers and practitioners 7 threats to validity 8 concluding remarks acknowledgements references journal of software engineering research and development, 2021, 9:15, doi: 10.5753/jserd.2021.1944 this work is licensed under a creative commons attribution 4.0 international license. software process improvement programs: what are the pitfalls that lead to abandonment? regina albuquerque [ pontifícia universidade católica do paraná| regina.fabia@pucpr.br] gleison santos [ universidade federal do estado do rio de janeiro | gleison.santos@uniriotec.br] andreia malucelli [ pontifícia universidade católica do paraná | malu@ppgia.pucpr.br] sheila reinehr [ pontifícia universidade católica do paraná | sheila.reinehr@pucpr.br] abstract while many organizations successfully embrace and experience software process improvement (spi) benefits, others abandon the effort before realizing the total potential result of an spi initiative. therefore, researchers' interest in understanding the reasons why software organizations that have a successful start in adopting spi abandon improvement initiatives after evaluation has increased. thus, this work aims to investigate how the abandonment of spi programs based on maturity models occurs after the evaluation. the multiple case study method was used with eight organizations. data were analyzed using grounded theory open and axial coding procedures. the results show that spi initiatives failed because of internal factors (people, spi project management, organizational aspects, and processes) and external factors to the organizational context (country economic crisis, outsourcing, governmental political influence, and external pressure from the client). as a contribution, we highlight the identification of these factors that organizations can use to learn about their initiatives and avoid pitfalls that can lead to the abandonment of spi. keywords: software and its engineering, software quality, software process improvement, abandonment of software process improvement 1 introduction software organizations operate in a highly competitive market that demands quality and productivity (canedo et al., 2019). in this sense, software process improvement (spi) aims to offer insights into the software process as it is used within organizations and, thus, lead to the implementation of changes to achieve specific objectives, such as increasing product quality or reducing cost and development time (coleman et al., 2008). several process improvement support models have gained ground in the software industry, such as cmmi-dev (cmmi institute, 2018) and iso/iec 33020 (iso/iec, 2015). in brazil, where this research was conducted, the mps.br (brazilian program for software process improvement) resulting model is primarily used. mps.br is a mobilizing, long-term program that aims to define software and service process improvement and assessment models targeting primarily micro-small and medium-sized enterprises to meet business needs (softex, 2020). the mr-mps-sw (brazilian reference model for software process improvement) model is structured in seven evolving maturity levels. they are a combination of processes, which are based on iso/iec 12207 (iso/iec, 2017), and compatible with cmmi-dev (cmmi institute, 2018) and their capabilities, which are based on iso/iec 33020 (iso/iec, 2015). the maturity levels establish thresholds of process evolution that characterize improvement stages for spi implementation in software organizations. the maturity evolution begins with level g and progresses up to level a (softex, 2020). to qualify their processes, organizations must undergo an official assessment, which is valid for three years. previous studies have reported benefits such as higher customer satisfaction, cost reduction, greater predictability of costs and deadlines, and increased productivity and quality (kalinowski et al., 2010). until april 2021, 816 assessments had been successfully completed (http://www.softex.br/). many organizations were assessed at the initial levels g (55%) and f (31%). only 14% of the assessments are associated with the upper levels (level e: 4%, level c: 9%, and level a: 1%), which signifies that progress occurs up to level f, in general. that suggests that most organizations either abandon their spi programs or maintain compliance with the maturity level requirements without undertaking renewal appraisals. therefore, an important question arises: if companies achieve benefits by improving software processes, why do they abandon spi programs? our previous research has pointed to organizational, human, and process-related issues (albuquerque et al., 2018). other research studies have sought to gather further information on maintaining process practitioners' participation after the appraisal period (uskarci et al., 2017). nalepa et al. (2019) and fontana et al. (2015) have found a different way for organizations that use agile methods to mature. understanding how companies continue to improve their processes after an appraisal is relevant to the software industry, which still faces challenges posed by time and budget constraints that may hinder spi initiatives' continuation. given this context, the aim of this study is to understand how abandonment occurs in spi programs after a successful software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 assessment based on maturity models. to accomplish this objective, we conducted case studies in eight brazilian software companies. data were analyzed using open and axial coding procedures from grounded theory (strauss and corbin, 1998). spi managers can use the results of this research to avoid the pitfalls that can lead to abandoning the spi initiative. results from four of these organizations were published in albuquerque et al. (2020). the main contribution of the present paper is the confirmation that factors internal to the organization (human, organizational, spi project, and processes) and factors external to the organization (the economic crisis of the country) when neglected can cause the abandonment of the spi. in addition, new results emerged, such as lack of external demand for evaluation1, dissolution of the company, the fusion of companies, and adherence to agile methodologies. the paper is organized into seven sections besides this introduction: section 2 presents the background; section 3 describes the research method; section 4 reports the results; section 5 presents the discussion; section 6 presents threats to validity; section 7 presents the final considerations. 2 related works software process improvement (spi) is an approach that has attracted the interest of software companies because it promises to increase quality and decrease costs and project deadlines (coleman et al., 2008). while many organizations successfully adopt and experience the benefits of spi (kalinowski et al., 2010), others abandon the effort before realizing the potential of spi benefits (albuquerque et al., 2018). therefore, there is an interest in understanding the reasons why these companies abandon these improvement initiatives. almeida et al. (2011) have identified factors that can affect continued adherence to the software process in an organization, focusing on the software processes assessed using mr-mps-sw as a basis. the results of their study were classified into four factors: technical factors, sociocultural factors, resources and, commitment. besides, they have shown that project management processes are challenging to maintain in the routine of companies. uskarci et al. (2017) sought to identify the problems of continuity and participation in software process improvement activities in two level 3 cmmi-dev companies in turkey. they have identified higher submission rates of suggestions for improving the process when the assessment date is approaching and lower rates when the assessment is completed. besides, the employees' participation in these activities and their prospects for process improvement are highly dependent on their role within the organization. the authors have identified greater involvement of employees in the quality group and process 1 in some parts of this text the term certification will be used meaning evaluation, specially in the transcriptions of the interviews. group. on the other hand, practitioners of the process are reluctant to suggest improvements in the process. albuquerque et al. (2018) present a survey conducted in brazil to identify which factors (based on a systematic literature review) can lead to spi programs' maintenance or abandonment. the interviewees comprised specialists in spi (consultants and appraisers of cmmi-dev and mr-mpssw models). results indicate that spi programs continuation is positively influenced by human factors (motivation and acceptance; support, commitment, and involvement; technical and personal competencies), the spi project itself (definition of strategies; resources; adequate external consultancy service), organizational factors (communication; goals; organizational structure; internal and external policies; return on investment and leadership), consultancy and processes. albuquerque et al. (2019) investigated how organizations using agile methods evolved their processes after assessing the maturity model. the unit of analysis of the case study was four privately owned software organizations that have been assessed with the mr-mpssw model and that used agile methods. results showed that companies using agile methods have difficulties in implementing spi initiatives with maturity models. it was found that processes based on maturity models were partially abandoned and that project management practices are the most difficult to maintain, confirming the results found by uskarci et al. (2017). according to anastassiu et al. (2020), the resistance negatively affects spi, both in implementation and maintenance. they conducted a qualitative study on the causes and effects of change resistance in spi initiatives and procedures to mitigate resistance. they interviewed 21 professionals and specialists in improving software processes. the authors identified 32 causes of resistance, 16 effects, and 29 behaviors related to resistance to change. among the results, it is worth highlighting the effects that resistance creates in spi initiatives, were: ef01: rejection of resistant members who boycott the process, ef02: the firing of members resistant to change and/or to follow the process, ef03: demotivation of the process team due to the resistance of its executors, ef04: compromised improvement project goals, ef05: use of bypass solutions, ef06: abandonment of the process, ef07: real improvements are not achieved, ef08: demotivation due to the difficulty in changing the culture, ef09: skepticism due to the difficulty in changing culture, ef10: resignation from employment because of the difficulty in changing culture, ef11: inappropriate attitudes (rebellious and deceitful) by some of the leaders, ef12: feeling of isolation in the organization, ef13: submission by fear by middle management and executors of the process, ef14: bad influence for new hires, ef15: one-off and non-continuous improvements and ef16: fear of job loss. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 although previous studies have provided information on the post-assessment phase, they have limitations for not addressing information regarding the abandonment of spi. it is crucial for organizations interested in adopting spi to know what causes can lead to spi failure to avoid or mitigate these risks. for example, almeida et al. (2011) and uskarci et al. (2017) reported results from organizations with valid official assessments. in albuquerque et al. (2018), the authors reported a survey with spi specialists and anastassiu et al. (2020) in a qualitative study with spi specialists. although these specialists' point of view is relevant, it is essential to conduct qualitative research to identify how human, organizational, spi project, and process factors influence spi initiatives' continuity in organizations from the organizations' point of view. albuquerque et al. (2019) presented the difficulty of agile companies in sustaining spi programs using maturity models. however, there is a lack of information about organizations' challenges with their overdue official assessments. understanding this topic is essential to conduct qualitative research in different contexts and from the organizations' perspective. 3 research method this paper addresses the following research question: rq: how does abandonment occur in software process improvement programs? to answer the question, we conducted a case study in eight software organizations. yin (2017) states that when the research aims at answering a "how" question, a case study is a method that offers the response. in case studies, the definition of propositions guides data collection and analysis. they also help to accomplish the research objective. based on the literature (albuquerque et al. (2018), almeida et al. (2011), albuquerque et al. (2019), and uskarci et al. (2017), the following propositions were defined: ▪ p1. there are human factors that influence the abandonment of the spi program. ▪ p2. there are spi design factors that influence the abandonment of the spi program. ▪ p3. there are organizational factors that influence the abandonment of the spi program. ▪ p4. there are process-related factors that influence the abandonment of the spi program. 3.1 context the analysis unit, also called a case, is a software organization evaluated by the mr-mps-sw model and has not carried out new evaluations. an organization was considered to be abandoning spi when they reported no longer using the processes (organizations 4 and 8) or partially using it (organizations 1, 2, 3, 5, 6, and 7). we carried out the case study in eight software organizations with different profiles, as shown in table 1. organizations of various sizes participated in this research, such as small (2 and 7), medium (3 and 8), large (1, 4, and 6), and micro-enterprise (5). only organization 1 is from the public sector. regarding the main activities, organizations 1, 4, and 8 maintain software products and develop custom software. organizations 2, 3, 5, and 7 perform maintenance on software products. organization 6 develops software and offers software services. table 1. profile of the studied companies. org. profile data 1 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 300 employees public tic not not g june 2016 2 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 40 employees private erp product not yes f january 2017 3 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 80 employees private erp product yes not c november 2018 4 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 100 employees private custom/embedded software not yes e may 2018 5 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment 05 employees private erp product not yes g november 2015 6 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 270 employees private software factory/ services yes not c january 2020 7 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 30 employees private erp product not yes f august 2019 8 organization size + 50 employees software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment private erp product/ software factory yes yes f september 2015 only organizations 3, 6 e 8 participates in government bids. it is worth clarifying that in brazil, the federal government launches bids to carry out software projects. some of them require the company to have a valid assessment compliant with a quality model or standard. therefore, a company that has a maturity model evaluation can achieve a higher score than its competitors. to incentivize organizations to improve their processes, softex has developed a business model that offered some financial support for organizations with less than 100 employees. organizations that were interested in implementing the reference models of the mps.br program could have had financial support by mct (ministry of science and technology) or by sebrae (support service for micro and small companies) (softex, 2020). regarding the federal grant, organizations 2, 4, 5, 7, and 8 received this benefit. table 1 also shows the mps-sw maturity level that the organization accomplished in its last evaluation. the study was conducted with: 2 level g organizations (1 and 5), 3 level f organizations (2, 7, and 8), 2 level c organizations (3 and 6), and one level e organization (4). 3.2 data collection we sent a letter of introduction to the organizations explaining the research objectives with a non-disclosure agreement (nda) signed by the researchers for data collection. to obtain the vision of different software development roles, we interviewed people in management positions (sponsor, director, project manager, process improvement team, and quality assurance) and software engineers (analysts, developers, and testers). table 2 shows the participants' profiles. table 2. profile of the participants. org. participants 1 1 sponsor 1 spi manager 3 project managers, 1 coordinator of project managers 2 quality assurance analysts 4 analysts and developers (acting in both roles) 2 1 sponsor 1 project managers 1 development director 3 analysts and developers (acting in both roles) 3 1 quality assurance manager 4 1 process manager 5 1 sponsor 6 1 sponsor 1 human resources manager 7 1 sponsor 1 project managers 1 quality assurance manager 8 1 sponsor as shown in table 2, in some organizations, due to high turnover, only one person who took part in the spi initiative was still in the company to be interviewed. we built a semi-structured script to guide the interviews. the questionnaire consisted of two sets of questions: one to characterize the organization and interviewee profiles, and the other about spi, aiming to gather information about the challenges faced after evaluating the company and the strategies to deal with these challenges. the second part also helped to obtain information about the processes considered challenging to continue after the assessment. the following questions were used as a semi-structured interview script to guide the researcher. it is worth noticing that the questions asked in the field were broader to allow higher data coverage and richer answers. the questions supported the researcher while conducting the semistructured interview acting more as a checklist than a fixed route: part 1 characterization questions ▪ can you describe the organization in terms of business and culture? ▪ what position do you currently hold in the organization? ▪ how long have you worked in the organization? ▪ what is your academic background? part 2 questions about spi ▪ how is top management involved, and which support is offered to the spi program? ▪ what is your perception of the involvement and support of the technical team in the spi program? ▪ is there an ongoing investment in training? which trainings are offered? ▪ what is your perception of the involvement and support of the technical team in the spi program? how have the improvement program activities changed your development activities? are the activities easier or harder to work with? ▪ is there a specific budget for the spi project (hours, staff, infrastructure)? how is the spi project structured in terms of infrastructure (environment and tools) and staff? ▪ how are changes in the organization's development process made? who defines the process activities, and who determines how they are executed? how are the changes introduced in the projects? ▪ how did the consultant evaluate the company's previous process before defining the current process? how do you software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 evaluate external consultancy's performance during the improvement model's implementation period (hours of service, relationship, competence)? ▪ is the company interested in renewing or evolving its maturity level? why (not)? how is the spi program aligned with the organization's strategic planning? how are these business goals monitored in the organization? ▪ is there a software engeneering process group (sepg) to lead process improvements implementations? what is the composition of this group? how are the activities of the sepg conducted (meetings, periodicity)? what is the degree of influence of this group on the company's other groups regarding knowledge, reputation, and relationships? ▪ how constant is the organization's project flow? how are the roles and responsibilities shared within the organization? is turnover an issue in the organization? how is it avoided? ▪ how are the improvement project goals communicated to the employees? ▪ how is day-by-day communication performed in the spi project? how are the results of the spi project communicated to the employees? ▪ how are the processes used in the organization? are they used in all areas and projects? ▪ which processes are most challenging to maintain? why? ▪ which processes are more natural to maintain? why? ▪ are there performance indicators for the spi project? ▪ how is the return on investment (roi) of the spi project measured (for instance, product quality, customer satisfaction, market expansion, estimates, cost, and term)? how are the process activities monitored (i.e., detection of nonconformities and their solution)? 3.3 data analysis yin (2017) guides the researcher to define the logic that links the data to the study's propositions and the criteria to interpret the results. in this research, we used the model proposed by (reinehr et al., 2008) that defines points of analysis (pa), which supports concepts based on the literature review to evaluate whether a proposition is confirmed or not to answer the main research question. table 3 shows the defined research propositions and related points of analysis. the propositions, as previously explained, are the statements of what the researchers expect to find in the field study, based on the previous literature. the points of analysis are the connection between data collected in the field and propositions analysis. the theoretical basis for constructing these research elements (propositions and points of analysis) was the background presented in section 2, a systematic literature review, and the survey carried out with spi specialists presented in albuquerque et al. (2018). the categories of critical factors for spi maintenance were used (human, organizational, spi project, and process) to define the propositions. to determine the points of analysis, we used the factors related to each category: ▪ human factors: motivation and acceptance, support, commitment, and involvement, technical competencies; ▪ organizational factors: goals, communication, organizational structure, internal and external policies, return on investment and leadership; ▪ the spi project itself: definition of strategies, resources, appropriate external consultancy service, consultancy; and, ▪ processes factors: level of bureaucracy; measurement program for continuous improvement. table 3. propositions and points of analysis. proposition p1. there are human factors that influence the abandonment of the spi program pa.01: training is offered for the qualification of the employees of the company. pa.02: there is support, commitment, and involvement of organization members. pa.03: the technical team members are motivated and willing to carry out the process activities. proposition p2. there are spi project factors that influence the abandonment of the improvement program. pa.04: budget and resources are available for the spi initiative. pa.05: there is a strategy to introduce changes in software processes. pa.06: existence of an external consultancy with the ability and competence to implement a process compatible with company needs. proposition p3. there are organizational factors that influence the abandonment of the improvement program. pa.07: existence of a strategic plan that relates the spi program to business goals achievement. pa.08: leadership is available to support continuous process improvement. pa-09: there is an organizational structure favorable to the spi program. pa-10: there are communication mechanisms for the dissemination of the spi project. proposition p4: there are process-related factors that influence the abandonment of the improvement program. pa.11: there is a non-bureaucratic process that meets the needs of the company. pa.12: there is a measurement program of continuous process improvement. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 we used grounded theory (strauss; corbin, 1998) open and axial coding procedures for qualitative analysis because it is a systematic analysis approach, which adds value in terms of academic rigor, providing validity in terms of traceability from the coding of the initial data to the final result of the analysis (o'connor, 2012). we did not intend to create a theory using the interactive process of conducting interviews and then analyzing the data to guide the following interviews, oriented by strauss and corbin (1998). we did not achieve saturation as preconized by colleman et al. (2008). all the interviews were recorded and then transcripted. we performed the analysis after all interviews were completed. the transcript was read (more than once by the first author) and analyzed with the support of the atlas.ti tool. the first author performed the open coding activities, which is the microanalysis of the interviews. she analyzed each transcript line-by-line and created codes merged with existing codes as appropriate when new evidence data appeared. memos were created to support the analysis (also considering the field notes). then, the codes were grouped according to their properties, forming concepts that represent categories. finally, the categories and subcategories were related to each other in the axial coding stage. all the analyses were reviewed and discussed by the other authors. figure 1 shows how we identified the presence or the absence of a point of analysis in the interview excerpts and related them to the research propositions. as can be seen in figure 1, we used codes that differentiate the encoding stages. in open coding, codes called types of findings were identified with an [a]. codes from the axial coding cycle were grouped into negative factors [nf] and positive factors [pf]. subsequently, these positive and negative factors were grouped into the category called analysis points [pa]. figure 1. extract of codes and citations related to pa.12 monitoring. the example shows a negative factor [nd]. when the researcher asked: "how is the process monitored?" two participants answered, "no. this has not been done recently, because there is no professional to guarantee the quality" and "there is no charge for non-conformities in the process". based on these statements, the code generated was "lack of qa professional to guarantee the quality of the process". the same coding process was applied to the code "lack of control and collection of process evidence" which is contrary evidence to the code "monitoring the improvement process" which, in turn, is part of the point of analysis "pa.12_measurement program". later in the codification process, the analysis point mentioned above was related to "proposition p4. processes". during the analysis, new findings emerged from the data. these codes were called new discovery [nd], with the nd code followed by a number. 4 results 4.1 analysis of individual cases the following sections present the description of the analysis of each case study, listing the points of analysis (pa), the new discoveries (with the nd code followed by a number), and the participants' quotes. in addition, we present the context of spi in the implementation and maintenance period. 4.1.1 organization 1 implementation period. the reasons for the adoption of the maturity model were process improvement and market. the board appointed a team to work on the spi project, providing training for a group of people who participated in the definition of mr-mps-sw level g processes. at the beginning of the implementation, the spi was disseminated through different communication means (lectures, training, e-mail, and intranet). still, only the people directly involved with the group of processes were better informed. there was no hiring of consultants once people in the organization had experience implementing maturity models (pa.06). the quality assurance team monitored the process, and non-compliances were dealt with. the main difficulties were: failure in communication (as the organization is large, some people were uninformed), insufficient training, an overload of work due to the accumulation of functions, lack of human resources, bureaucracy in the process, and resistance to changes. interviewee: training (pa.01) and communication (pa.10). "we feel that people are doing the projects; they take the templates and come to ask. but how do i do this? will i attend the course? because we feel this … that there is still a failure in the issue of communication, because there are more than 300 people in the development area, so there are many people who are not yet having this level of information." maintenance period. organization 1 reported no intention to evolve its maturity level because the development area remained immature in project management practices. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 the training (pa.01) did not cover the whole development area. the lack of support (pa.02) from top management to demand project coordinators to use the processes led process practitioners and the quality assurance team to lose motivation (pa.03). for example, the quality assurance team failed to monitor processes because managers did not take the corrective actions needed after quality assessments. the lack of human resources (pa.04) resulted in the outsourcing of the projects. there is an active process group that defined strategies to support spi (pa.05). outsourcing (nd.01) was a new aspect that emerged during the analysis. for managers, it is difficult to adhere to process methodology in outsourced projects. there was an attempt to mentor the outsourced company, but it did not work out due to the high turnover in third-party companies. interviewee: outsourcing (nd.01). "[outsourcing] makes it very difficult. they [i.e., the contractors] are not manageable. it's not up to us to manage how they work, their productivity. we hire contractors (...) we don't know how the work is done, by how many people or which process is executed. it is not a partnership. it is a contract." resistance to change is the most prominent issue among respondents. as the company is public, its president and managers may change every four years, which favors some employees' skepticism. we were told that previous management initiatives were discontinued (nd.02), which caused instability among older employees, who tended to show disbelief and disinterest in using the processes. despite the difficulties, the process group continued to improve the process (pa.05), such as i) creation of an agile path for product development using scrum; ii) use of canvas in the preliminary phase to plan projects with a smaller scope; iii) use of kanban for task execution; iv) gamification of the standard process to improve usability and foster dissemination of process artifacts, and v) institutionalization of supporting tools (mantis and clarity). there are no spi program goals aligned with the company's strategic plan (pa.07). there is no effective leadership to support the actions of the process improvement group (pa.08). the organizational structure is not adequate due to a lack of human resources and roles overlapping (pa.09). lack of communication also influenced demotivation for using the process (pa.10). interviewee: communication (pa.10): "i think we have many problems. one of the hardest is that we have a serious problem with communication." the process meets the needs of the organization (pa.11). what hinders the use of the process is the lack of human resources to meet the demands. process monitoring (pa.12) is not performed; no information is collected to indicate the return-on-investment (roi). project management was identified as the most challenging process to maintain. interviewee: process monitoring (pa.12): "we did [quality checklists] for a long time, but the reports we generated from non-compliance had no corrective actions because the action is not ours." currently, the organization seeks to improve maturity in the project management process. for this, it created a group of project managers. however, the organization has no definition of whether it will undergo a new level g or f assessment in the future. 4.1.2 organization 2 implementation period. the organization implemented level g and later evolved to level f. in both implementations, the organization received financial assistance from the federal government. a project for spi was defined, and people from the development team were made available. but there were no resources with dedicated time for process improvement activities. the communication of changes in the processes was in lectures and by the group of key people involved in defining the processes. the consultancy was contracted on both implementations (pa.06), and satisfaction with consultancy services was reported. two people were hired to work in quality assurance management. the main difficulties were: insufficient training, lack of resources, lack of experience in spi, and the cultural changes that affected the oldest employees who were more resistant, for example, in the activities of configuration management. interviewee: resistance (pa.03). "the most difficult of all was the acceptance by people who had been here for a long time. the main thing, it was always this. people's acceptance. unfortunately, some people did not adapt to the process, and we had to dismiss them." maintenance period. the appraisal of organization 2 has expired. there is no intention to evolve the maturity level because managers believe that the current level meets their needs. besides, due to the country's economic crisis (nd.03), the organization had to reduce its maintenance fees to avoid losing customers. as a result, the professionals responsible for the process quality assurance (ppqa) activities were dismissed. after the appraisal, training (pa.01) was not available for new employees. the country's economic crisis inhibits new investments in the spi program (pa.02), reflecting on team members' motivation (pa.03) and leading to spi abandonment. researcher: training (pa.01): "do they have training in the process to get in?" interviewee: "no. training hasn't been done lately." software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 there is no employee exclusively in charge of managing the spi program (pa.04), and there is no strategy for introducing process improvement changes (pa.05). concerning consulting, the organization reported satisfaction in the services provided (pa.06). interviewee: resources (pa.04): "due to not pursuing further process appraisals, the quality team was dismissed. but then we reallocated the quality activities of the project to other internal people." there are no clearly defined goals (pa.07) nor a leading process group to foster continuous improvement in organization 2 (pa.08). although the organization is small, communication about the spi program is flawed (pa.10); for example, there is no information available on the spi program's benefits. besides, organization 2 experiences financial problems (i.e., decreased contract flow), and functions overlap due to its small size (pa.09). interviewee: strategic plan (pa.07): "last year, we started putting together the organization's strategic plan, so we have the outline of it (...) but, due to time constraints, we decided not to spend too much effort as planning activities requires." the development teams partially use the process. it is not because they are considered bureaucratic (pa.11), but because there are not enough employees to execute the quality assurance (qa) process. also, no measurement program (pa.12) exists to support process follow-up. interviewee: measurement (pa.12): "(...) having no financial resources, we ended up dismissed up the quality staff (composed of two employees)." 4.1.3 organization 3 implementation period. the organization implemented level f, evolved to level c (renewed level c once). the motivations for adopting the model were improved software processes, market, and legal need for maturity models to participate in bids. due to the quality manager's experience in renewing level c, consultancy services (pa.06) were hired to carry out only the assessment. the organization reported satisfaction with the services provided. maintenance period. organization 3 intends to renew its maturity level depending on its economic recovery. the company was going through a difficult financial situation (nd.03). therefore, the company has reduced its staff. the organization does not train its employees regularly (pa.01). however, top management supports the spi program (pa.02) because the company participates in bids. part of the team remains motivated to use the process because it automates activities (pa.03). interviewee: involvement (pa.02): "today, i see that you can always bring improvements by sharing [experiences] with the team because i think each one knows what can improve their own process." after downsizing, organization 3 started using opensource tools (redmine) (pa.04). there is no process group anymore (pa.04), and the process support strategies (pa.05) are carried out by the quality manager with experience implementing the mr-mps-sw model. interviewee: tools (pa.04): "so, the automation, it was fundamental to cover the lack of people." a strategic plan is aligned with the spi program objectives (pa.07), and the communication is appropriate (pa.10). notwithstanding, organization 3 difficult economic situation restricts investments in an assessment to renew its maturity level. currently, there is only one person responsible for process restructuring and monitoring (pa.09); there is no process group (pa.08). interviewee: structure favorable to spi (pa.09): "in 2015, the quality team consisted of five people. in 2016, it was reduced to three people. currently, there is only me on the quality team." organization 3 restructured and automated the processes using a free tool (redmine) that suits its needs (pa.11). therefore, the processes are considered easy to maintain. process monitoring is supported by redmine (pa.12). interviewee: monitoring (pa.12): "i can't identify improvements if i don't have a minimum measurement to monitor it..." 4.1.4 organization 4 implementation period. the motivation for adopting the mr-mps-sw model was to standardize organizational processes and the organization's ceo's prior knowledge, acquired in the graduate program in software engineering. before the maturity model implementation, some teams in the organization used some scrum practices. thus, the consultancy helped define a process that would combine the scrum practices with the maturity model. the main difficulties were: i) lack of support, employee involvement, ii) lack of a process group (sepg), iii) resistance of agile teams; iv) attitude of imposition of the director (who believed in the model) and, sometimes, of the consultant; v) lack of tools; vi) lack of support from team leaders; vii) focus on the result of the assessment. interviewee: resistance (pa.03). "there was an area of the company that questioned the process because they worked on an already agile scheme."... "what did we do? we did a process that was a little bit tailored: some things we used a little agile, some things were a little waterfall." software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 maintenance period. the organization does not intend to renew or evolve its maturity level. they develop software on-demand and do not participate in biddings that demand specific maturity levels. scrum currently meets its needs. although training (pa.01) and top management support (pa.02) were present after the assessment, the employees were unmotivated (pa.03). due to employees who worked with scrum on their projects, they did not accept the new process. there was also resistance from new employees to use the defined process based on the maturity model. these new employees were also resistant because they had previous experience in agile methods. interviewee: motivation (pa.03): "so as not to follow the process, she justified: i can't. i am doing this project in scrum, and there is no time to do anything because we have tight deadlines..." interviewee: veiled resistance (pa.03). "…you saw that they resisted, said it was ok because the ceo was defining it, then they said it was going to be used. but it was always like this: "no, because i need to put more hours in the estimate because of the model..." the consultancy (pa.06) took into consideration the teams that worked with scrum. however, these teams did not tell the truth to the consultant and helped define a process that would not be used after the assessment. after the assessment, the organization continued to invest in the spi program (pa.04) and hired a process manager to make the mr-mps-sw process compatible with scrum. however, he had no experience with agile methods. he defined a hybrid process that was also not well accepted by the teams (pa.05). organization 4 had a strategic plan, but it did not consider processes based on maturity models (pa.07). the spi program did not have effective leadership in charge of process improvement (pa.08). concerning the organizational structure, the organization has well-defined roles, which facilitate process execution (pa.09). communication was flawed (pa.10). there was no information on spi return on investment or benefits. the process defined in the implementation phase was abandoned shortly after the official assessment (pa.11). the lack of support from project managers and the organization's agile culture were the main reasons for the spi initiative's failure. project management was pointed out as the most challenging process to maintain, as the time estimated to perform activities increased due to process activities. the measurement process was abandoned after the appraisal (pa.12). interviewee: return on investment (pa.10): "is there information on return on investment?" interviewee: "no. we do not have." currently, the organization uses scrum, kanban, and squads. the current ceo of the organization, with experience in agile methods, used the following strategies to manage this software process improvement initiative: i) adapt the process with agile methodologies (pa.06), aiming to meet the needs of the business; ii) training; iii) standardization of tools (jira); iv) created the organization's agile manifesto (to encourage a sense of belonging); and, v) improved communication between teams. 4.1.4 organization 5 implementation period. before implementing the maturity model, the organization used extreme programming (xp) and kanban practices. however, the organization had a description of isolated procedures that generated the need for standardization. at the time of implementation, there were three partners, one of them actively participated in defining the processes. he participated in training on the model's processes. at that time, there was support from the owners for the spi initiative. the main difficulties were: i) lack of human resources; ii) change of external consultancy (pa.06) (failure in the model guidelines); iii) the second consultant was located in another region of brazil (difficulties in conducting the implementation), and iv) lack of a strategic plan. maintenance period. organization 5 has no interest in renewing or evolving the maturity level because the current process meets the business's needs. besides, with the lack of external demand for certification (nd.04), there is no need to maintain an assessment using reference models because its customers do not require such evaluation. after the evaluation, the organization went through economic difficulties due to the country's financial crisis (nd.03), lost the contracts of the civil engineering sector, and started developing a predial automation software product. this affected the owners' motivation (pa.03) and support (pa.02) for spi, who intended to implement the model's level e. interviewee: country's economic crisis (nd.03). "one of our biggest customers, the civil construction company, went into crisis. so, three years ago, we lost an entire segment of civil construction…" interviewee: disbelief and demotivation (pa.03). "i wonder why i participated in this, but why did we invent this ...?" the organization is a micro company. therefore, communication is easy (pa.10), and there was no need to provide training in the processes (pa.01). there is a shortage of resources and time (pa.04), and there is no spi project software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 management (pa.05) or spi specific goals (pa.07). it uses redmine as a tool to support daily activities (pa.04). the dissolution of society (nd.05) was the main factor that negatively influenced spi, because it affected the organizational structure (pa.09) and the leadership (pa.08) due to the loss of the partner who believed in the model. the process defined at the implementation time was considered bureaucratic (pa.11), being modified for scrum practices. the current sponsor had experience with agile methods and believes that it is more effective to give the team more decision-making power than to follow processes. project management that was considered the most bureaucratic process, was adapted with scrum practices. in the requirements management process, user stories were used together with prototyping for requirements specification and validation. there is no measurement program for continuous improvement of the process (pa.12). currently, the organization uses the appropriate process with agile methods because it meets the business's needs. interviewee: the dissolution of the company (nd.05). "as the company reduced the number of employees ... because we lost a partner, we didn't have time to renew the certification." "we were in the process of making the model's e-level. but then, in this process of changing partners and getting it right, we thought it was a good idea not to do it ... we don't have to do it to get the certificate…" interviewee: bureaucracy (pa.11). "… we fall into a planning task, and to count within our assessment, then, we had to have, for example, an action to define the communication plan. the communication plan was written once, and no one ever read it afterward... no one else used it..." 4.1.4 organization 6 implementation period. organization 6 assessed level g, level f (renewed once), and level c (renewed once) of the mr-mps-sw model but was undecided about the second renewal of the assessment of level c due to organizational restructuring caused by the fusion of companies (nd.06). the selection of the maturity model was influenced by the sponsor, who has previous project management training. the objective was to improve the process, product quality, and market. another strong motivator was the foreign policy to support spi, promoted by the model's executive body (formed of a cooperative group and external financial support). the most serious difficulty was the organization's lack of experience with process improvement that resulted in a bureaucratic process (pa11). work overload and resistance were caused (pa.03), especially for the project manager. what helped the organization achieve positive evaluation was the experience of external consultants (pa.06) and the networking between companies promoted by the cooperative group's formation. maintenance period. after the first evaluation, senior support management continued (pa.02), made the process group (pa.08) available to make adjustments to the process, intending to reduce bureaucracy (pa.11) and increase acceptance and motivation of the organization's members (pa.03). there is a policy of continuous training (pa.01). training needs are identified, with a technical training schedule (processes, programming language, and others) and behavioral training (motivation, integration, customer service, etc.). at the end of the training, an evaluation is made by the employees. interviewee: training policy (pa.01). "we carry out a needs assessment at the beginning of the year with the managers."… "after the training, hr [human resources] needs to know the attendance list, the initial reaction assessment and three months later an assessment of the effectiveness of the training…" the organizational structure is adequate (pa.09), with human resources and infrastructure (pa.04) (crm dynamics, pro-ject), with a strategic plan with spi goals aligned to the business (pa.07). when the first c-level assessment was renewed, the organization did not use external consultancy (pa.06) because one of the process group members had experience with spi consultancy. the process was tailored to the organization's needs. the audit of the process was automated (pa.12). the awareness of the benefits is subjective because there is no measurement of the return on investment (pa.12). spi's management was carried out by the sponsor, who believed in process improvement and influenced top management with the process group's support (pa.08). the main support strategy used was to facilitate the use of the process through automation and reduction of bureaucracy (pa.05). however, the determining factor for the abandonment of spi was the fusion of companies (nd.06). the fusion resulted in a clash of organizational cultures. there were changes in the business (in addition to the software factory, it started to focus on software services). there have been changes in the development process and in the way of working. the new manager of the development area encouraged discussions about the agility of organizational processes and the adhesion to the use of agile methods (nd.07), used: scrum, squads, sprint design, and other methodologies like design thinking. some members of the process group (pa.08) left the organization, and the process defined from the maturity model ended up abandoned. interviewee: fusion of companies (nd.06) business changes. "… there was a merge with company x ... and company x brought a new portfolio. i brought an software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 infrastructure portfolio, so we have infrastructure projects now, safety nets, so we have safety nets projects, which is very different from building software…" interviewee: fusion of companies (nd.06) change in the way of working. "one of the points, because of the merge, and, already advancing another point, is that it ends up that the software development process has changed a lot."… "we are reformulating our way of working."… "we are in that process like this: we certified a process, and today our process is already totally rigid. we are even looking at whether it will fit in for a reevaluation." 4.1.4 organization 7 implementation period. the purpose of adopting the model was the standardization of processes, product quality, market marketing, and the acquisition of public contracts (at the time, there was a requirement for evaluation using the maturity models). the spi initiative was supported by the sponsor (pa.02), who provided hours for the project manager and some members of the organization to define the processes (pa.04), and provided model training (pa.01). people's engagement was requested (pa.03). the organization's members had no experience with spi. what motivated the model's selection was forming a group of companies that were implementing the model in the region. before the assessment, they used scrum. they found the first implementation of the model more complex, with bureaucracies they were not used to (pa.11). the external consultancy was hired in both implementations of the model. however, in the second implementation, there was a conflict between the external consultant and the person responsible for implementing in the organization. it was reported that there was an exchange of consultancy because the consultancy had technical competence (pa.06) but lacked competence in soft skills. the consultancy had a very imposing posture. interviewee: consultancy service (pa.06). "our ideas didn't match; he didn't accept the suggestion to change the process. "no, you have to do it this way."... "this also made it very difficult for us, especially for me, who was in charge of this company project." maintenance period. although the organization members have reached maturity and the processes were standardized, the sponsor has no interest in renewing the assessment (pa.02). even meeting requirements for bids in the public sector, they did not achieve the goal defined in the strategic plan (pa.07), acquiring contracts in the public sector. interviewee: external pressure from customers (nd.04). "…even because concerning public projects, which was one of the ideals for us to have certification, that's not what happened..."… researcher: "but did they ask for certification?" interviewee: "in bidding yes." after the evaluation, there was no training available (pa.01) due to low turnover (pa.09). there were no human resources available (pa.04) to manage the spi (pa.05), and the tools used were not adequate (pa.04). there was no group of processes (pa.08) to lead continuous improvement in processes. the members of the organization were not motivated to continue with spi (pa.03). the process considered bureaucratic (pa.11) was adapted to the organization's needs, and they returned to using scrum with some practices of project management and requirements management. in quality assurance management, only the quality control of the product was carried out. the other level f processes were abandoned. interviewee: bureaucracy (pa.11). "at level g, i felt the processes were very bureaucratic, plastered ..." the monitoring of the process stopped being done (pa.12). therefore, there was no process institutionalization and no information on return on investment (pa.10). interviewee: monitoring of the process (pa.12). "today, we no longer do this audit of the process." currently, the organization uses scrum, and the organization members are satisfied with the reduction of bureaucracy. 4.1.5 organization 8 implementation period. the objective for implementing the model was to improve the process, product quality, and the acquisition of public contracts (at the time, there was a requirement for certification of models). interviewee: objectives of spi adoption. "we had two aspects of need. one was to improve our process, aiming for better quality."… "except that there was also a legal need for participation in public bids." a project for spi was defined, and people were involved in the definition of processes. consultancy services were hired, and the sponsor was satisfied with the consultancy service (pa.06). communication took place through engagement meetings and training (pa.10). maintenance period. after the evaluation, no training was available (pa.01). support from top management declined (pa.02) due to the country's economic crisis (nd.03) and the cooling of the model evaluation requirements in public bids. the organization no longer had the commercial motivation that was the requirement of software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 external customers (nd.04). these two factors affected the quality assurance process because the qa professional was not hired. therefore, there was no monitoring of the process (pa.12). the team and the sponsor were demotivated (pa.03). the team found the process bureaucratic (pa.11). besides, there was an overload of the product quality assurance activity, which was absorbed by the team. the sponsor thought that the documentation resulted in high costs. interviewee: country's economic crisis (nd.03). "i think the economic problem also helps, which is a consequence of it all."…" you see, if you don't have a crisis, you have the thriving thing."… "then how to hire someone exclusive to the gqa? but how do you do it? the budget does not allow it. the difficulties do not allow…" interviewee: lack of external demand for certification (nd.04). "the bidding processes started not to charge so much because the tcu (federal audit court) understands that, even, the biddings started to do as follows: if you have a certified development methodology, you present. if you don't have it, we do an audit. they kind of didn't charge. they're not charging anymore..." after the evaluation, there was no spi management (pa.04), with the availability of resources (pa.04) and support strategies (pa.05), and no processes group (pa.08) to define continuous improvements in the process. they use teams foundation as a support tool (pa.04). another factor was the turnover (pa.09) because the new employee has to learn and accept to use the process (pa.03). interviewee: adequate organizational structure turnover (pa.09). "eventually, that professional a or b who was already adhering to the process changes and then it will hurt us even more to have management." currently, the organization no longer uses the process defined with the maturity model and adherence to agile methods (nd.07) due to the need to streamline the process and reduce documentation costs. in addition, the private market accepts scrum well, and the public sector started to have contracts with the use of scrum. the sponsor reported satisfaction and several benefits from simplifying the process (there is no need to keep creating evidence), reducing the conflict with the client (there is no discussion about the project scope). interviewee: adherence to agile methods (nd.07). "we are now more with the private [sector], but with the private [sector] we can convince to use us in the agile model." 4.2 cross-analysis this section presents the data cross-analysis of the eight organizations based on the research propositions. we used three criteria to characterize the points of analysis (table 4): ▪ n (not identified): the point of analysis was not identified in the organization. ▪ p (partially identified): the point of analysis was partially identified in the organization. ▪ f (fully identified): the point of analysis was fully identified in the organization. to assess whether a proposition is confirmed, we analyzed whether the points of analysis were not identified (n) or were partially identified (p) in the organization. this means that the critical factors for maintaining spi have been neglected. the results indicate that neglecting these factors can lead to the abandonment of the spi program based on maturity models. to assess whether a proposition is not confirmed for the abandonment of spi, we defined that if all points of analysis were identified (f) in the organization, it meant that the organization continues to address critical spi maintenance factors after assessment. the following section discusses these results. table 4. analysis of proposition. proposition org. 1 org. 2 org. 3 org. 4 org. 5 org. 6 org. 7 org.8 p1: there are human factors that influence the abandonment of the spi program. pa.01. training is offered for the qualification of the employees of the company. p n n p n f n n pa.02: there is support, commitment, and involvement of organization members. p n p p n p n n pa.03. the technical team members are motivated and willing to carry out the activities of the process. p p p n n n n n p2: there are spi project factors that influence the abandonment of the improvement program. pa.04: budget and resources are available for the spi initiative. p f p f n n n n pa.05: there is a strategy to introduce changes in software processes. f n f n n n n n pa.06: existence of an external consultancy with the ability and competence to implement a process compatible with the company's needs. f f p f f p f p3: there are organizational factors that influence the abandonment of the improvement program. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 pa.07: existence of a strategic plan that relates the spi program to business goals achievement. f n f n n n f n pa.08: leadership is available to support continuous process improvement. p n p n n n n n pa-09: there is an organizational structure favorable to the spi program. n n n f n f n n pa-10: there are communication mechanisms for the dissemination of the spi program n n f n f n f n p4: there are process-related factors that influence the abandonment of the improvement program. pa.11: there is a non-bureaucratic process that meets the needs of the organization. p n f f n f n n pa.12: there is a program for the measurement of continuous process improvement. n n f n n f n n 5 discussion the research question guiding this work is: "how does the abandonment of software process improvement programs occur?" to answer this question, we conducted case studies on software organizations with either expired assessment date (organizations 1, 2, 4, 5, 7, and 8) or close to the assessment date expires (organizations 3 and 6). we identified that an organization is abandoning the improvement process when the interview participants report that all processes are no longer being used (organizations 4 and 8) or when they say that the processes are partially being used (organizations 1, 2, 3, 5, 6, and 7). we identified five pitfalls to spi and their relation to the research questions from the data analysis. we found that organizations do not set goals to pursue continuous process improvement. there is a lack of continuity in spi management and the sponsor's interest to continue. even after all the effort in implementing the spi, sponsors may not be satisfied with the results. this can lead the organization to return to its previous state or define a new way of working and improving its processes other than the maturity model. pitfall 1 negligence with human factors explanation: we found that organizations do not provide sufficient training (pa.01) (organizations 1 and 4) or have stopped providing training after assessment (organizations 2, 3, 5, 7, and 8 ). in these organizations, the lack of training negatively affected the use of the improved process because people do not use what they do not know. training a group of people only during the spi implementation period is not enough to ensure process understanding. the dissemination of knowledge about process improvement is complex, especially in large organizations (organizations 1 and 4), where communication can be more difficult. top management support can influence (pa.02) the investment provisions for spi initiatives. organization 2 dismissed the quality team, and in organization 3, the quality team's size was reduced to just one member. as for organization 1 (public capital), the quality team stopped monitoring the process due to the lack of top management support. in organizations 5, 6, 7, and 8, senior management's support was perceived only during the implementation period. regarding motivation (pa.03), we identified its partial occurrence in organizations 1, 2, and 3 because motivation depends on key people, and some people show resistance. in organizations 4, 5, and 7 that already used agile methods before implementation, employees were resistant and unmotivated to use the new process. organizations 6 and 8 started to adhere to agile methods (nd.07). in organization 8, it was possible to observe the sponsor's satisfaction regarding reducing documentation costs and greater understanding with the client due to the project scope. besides, this change in process was well accepted by its employees (especially by the younger programmers). thus, proposition p1 is confirmed (table 4). discussion: these results are consistent with the spi literature, which reports that training is essential for disseminating knowledge (alqadri et al., 2020) and providing awareness of the benefits of spi (peixoto et al., 2010). the importance of top management to be convinced about spi's benefits for both the implementation and continuity of spi is highlighted by almeida et al. (2011). resistance and lack of motivation were present in all organizational contexts. different issues influenced them, but the lack of human resources was a common point. the resistance literature corroborates these findings when reporting that work overload discourages new work practices (narciso et al., 2014) (anastassiu et al., 2020). it is worth mentioning the resistance of the agile teams in organizations 4, 5, and 7. this was observed in two distinct moments: a veiled resistance by the organization members in the implementation period (due to the interest of top management in the success of the evaluation) and a more declared resistance after the evaluation. in organization 4, the teams did not use the process, even with the support of the consultancy's effort to involve these teams in discussions to define a process that would meet the organization's needs. this finding corroborates the research software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 by albuquerque et al. (2019), which identified that teams from organizations that use agile methods have difficulties implementing and sustaining spi based on maturity models. pitfall 2 negligence with factors related to spi projects explanation: spi project management is a critical success factor (montoni et al., 2011). however, we have identified negligence in this regard. in most of the investigated organizations, it was possible to observe that in the implementation period there was a definition of a project with availability of dedicated resources (pa.04). however, after the evaluation there was no continuity in the management of the spi project. in other organizations (for example, 2 and 7), lack of management occurs even during the implementation period. the lack of a dedicated resource (pa.04) to manage spi negatively affects the continuous improvement of the process and the taking of actions to promote people's motivation, that is, the definition of spi support strategies (pa. 05). only organization 1 has a process group (pa.04) that continues to take actions (pa.05) to promote spi. however, it is difficult for a process group to keep the spi program running without senior management support (pa.02). in organizations 3 and 6, processes were automated to increase compliance (pa.05). regarding the analysis of this proposition, our data were not conclusive to confirm this proposition because the analysis point regarding the consultancy (pa.06) was not possible to evaluate in all organizations. for example, organization 1 did not hire consultancy services. thus, proposition p2 is partially confirmed (table 4). discussion: according to spi literature (montoni et al., 2011) (coleman et al., 2008) (peixoto et al., 2010) (almeida et al., 2011), spi initiatives are affected by the lack of human resources, resulting in work overload and, therefore, in the prioritization of activities related to the product. according to sulayman et al. (2012), the spi team needs to have the workforce available to define the processes, train the team members on these processes and supervise. for this reason, having a full-time person for coordination activities is essential for the success of the spi initiative (guerrero et al., 2004). pitfall 3 negligence with organizational factors explanation: there are no clearly defined goals (pa.07) or effective leadership (pa.08) of top management and project managers that foster continuous improvement. besides, there is role overlapping (pa.09), and communication is flawed (pa.10). only organization 4 had no role overlapping. however, agile culture hinders the acceptance of the new processes. this difficulty also occurred in organizations 5 and 7, which already used agile methodologies before implementation. we identified two new results: dissolution of the company (nd.05) and fusion of companies (nd.06) that affected the organizational structure, resulting in spi abandonment. in organization 5, the dissolution of society (nd.05) negatively affected the spi initiative because it lost its leadership. that is, it lost the person who believed in the model. thus, the organization returned to agile methods because the remaining partners believe in agile methods' value. in organization 6, the fusion of companies impacted spi's abandonment because there was a restructuring of organizational processes. in this restructuring, the new development manager with agile methods' experience defined a new way of working with senior management support. thus, proposition p3 is confirmed (table 4). discussion: the importance of considering organizational culture in spi initiatives was reported in the research (alqadri et al. 2020) shih et al. (2010). shih et al. (2010) emphasized that sepg (software engineering process group) leaders should consider culture when a new spi approach is implemented because it may be incompatible with the existing culture. in organizations 4, 5, and 7 with organizational cultures used to working with agile methodologies, it was challenging to continue spi with maturity models. we identified that groups, such as the process group and the quality assurance group, made the most effective support and leadership to sustain spi. our results are consistent with the research by uskarci and demirörs (2017). regarding the new findings, it was possible to observe the influence that the organizational structure has on spi initiatives and how they are related to knowledge and previous experience in process methodologies and decision making. in organizations 4, 5 and 6, the choice was made to use agile methods due to the organization's previous experience of managers with decision-making power. pitfall 4 negligence with process factors explanation: regarding the existence of a nonbureaucratic process (pa.11), we found that all organizations adjusted and simplified their processes after the official assessment. in organizations 1, 2, 6, and 7, the process is partially used (quality assurance and measurement are not performed). organizations 4 and 8, which have an agile culture, abandoned the processes thoroughly. notably, only organization 3 (which participates in bidding processes) continued to use and monitor the processes (pa.12). however, it had not renewed the maturity level because they experienced financial struggles by the interview time. we found that some organizations abandoned spi with maturity models due to adherence to agile methodologies (nd.07), as was the case with organizations 6 and 8. these are organizations that started using agile methods after the evaluation. its sponsors reported satisfaction with using these methodologies due to the reduction of bureaucracy and documentation costs. thus, proposition p4 is confirmed as can be seen in table 4. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 discussion: the results showed that abandoning the spi program does not mean not using the organizational processes at all. organizations 1, 2, and 3 have adapted and simplified their processes to meet their new business needs. these results align with the spi literature, which reports that processes tend to be simplified, stabilizing in a minimum process (coleman et al., 2008). organizations 4, 5, 6, 7, and 8 have been looking for other ways to mature the process using agile methods (fontana et al., 2015). it is worth mentioning that it is possible to implement an spi initiative with agile methodologies and maturity models. however, in the context of this research, only organization 4 tried to make this tailoring but was unsuccessful due to the boycott of agile teams. pitfall 5 negligence with external factors explanation: we identified external factors that impact the support of top management. we identified the negative impact of outsourcing (nd.01) it projects on organization a (a large public company). project managers reported difficulty in applying their processes to outsourced organizations. the main reason was the high turnover that made learning difficult and hindered the use of the processes. the country's economic crisis (nd. 03) has restricted investments in resources for spi. also, we found that regular changes in the state government (nd.02) demotivates process managers from adhering to the changes made by top management because the company's board can change every four years and, therefore, potentially change the internal software process quality policies. the lack of external pressure from customers (nd.04) is another factor that discouraged some organizations that had the commercial motivation to adopt spi with maturity models, that is, the interest in participating in public biddings. however, currently in the country, this requirement has not been made by all public bodies. organizations working in the private sector have reported no requirements to use an officially evaluated process. discussion: unlike the literature, our study identified new findings negatively influencing spi, called external factors. outsourcing (nd.01) impacted the lack of use of the improvement process due to the lack of standardization of outsourced contracts. this indicates that it is vital for the organization's top management to define procedures for managing third-party contracts. regarding the regular changes in the state government (nd.02), the results show that consistency in quality policies is necessary. the frequent change in the use of software process methodologies, or the definition of work procedures, may demotivate organization members at any organizational level. it is quite possible that this lack of managerial constancy may demotivate members in private organizations as well. here, it is a point worth investigating. the country's economic crisis (nd.03) has been affecting organizations' economic instability. these organizations have a reactive action to decrease their resources, prioritizing the resources that develop the software and dismissing the quality team. finally, the lack of external pressure from the client (nd.04) indicates that the organizations that adopted the spi for purely commercial reasons and not improving processes themselves tend to be frustrated with the results because the public sector has changed its way of acquiring software development services. thus, we formulated a new proposition: p5. there are external factors that influence the abandonment of the improvement program. 6 limitations and threats to validity to evaluate the research quality and research validity, we used the guidelines defined by yin (2017) and runeson et al. (2012) regarding quality criteria for empirical research. regarding construct validity, the propositions are based on the research carried out by albuquerque et al. (2018). propositions and analysis points were validated in a workshop held with experienced professionals in spi programs. regarding internal validity, grounded theory procedures were followed: the propositions were investigated using only the data collected from the interviews. the first author analyzed the interviews and built the networks. the other authors (professionals with experience in maturity models implementation and assessment) reviewed and analyzed quotes, codes, and categories. regarding external validity, we interviewed participants from eight different software organizations. we included organizations of various sizes, locations, and businesses. three organizations do not participate in biddings, and only one is a public company. some organizations only provided one participant for the interview (due to high turnover). still, we were careful to select those who effectively participated since the maturity model implementation. as expected in in-depth qualitative research, the results cannot be broadly generalized (eisenhardt, 1989) but present relevant evidence on how abandonment occurs after valid spi appraisals. nonetheless, we plan to replicate the research in more organizations. finally, to ensure research reliability, all the research protocol and data analysis steps were defined and followed. 7 conclusion this study aimed to understand how abandonment occurs in spi programs after successful assessments based on maturity models. results from four organizations (1, 2, 3, and 4) were published in albuquerque et al. (2020), who indicated that abandonment occurs when there is negligence to factors internal to the organization (human, organizational, spi project and processes) and factors external to the organization (outsourcing nd.01, political software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 change nd.02 and economic crisis of the country nd.03). in this paper, results from four more organizations (5, 6, 7, and 8) were presented. concerning internal factors, they all corroborated our previous research (albuquerque et al., 2020). however, new findings were identified: two organizational factors (dissolution of the company nd.05 and merger of companies nd.06) and a process factor (adherence to agile methodologies nd.07). concerning external factors, this research confirmed the negative influence of the country's economic crisis on spi and identified a new external factor (lack of external demand for certification nd.04). another point that draws attention is that some organizations carried out management activities during the spi project until the official assessment. after that, some of them neglect the proper management of the spi project. moreover, other organizations neglect management activities since the beginning of the spi project. considering that the literature and our expirence state that adequate management is a critical success factor to the spi project, it is not surprising that such organizations will fail to continue the spi activities carried out so far. as a contribution, we highlight the practical applicability of our results for the software industry. industry professionals can use this study's results to learn about their initiatives to avoid pitfalls that can lead to abandoning spi. for example, before starting an spi initiative, evaluate the organization's business, and assess whether it is the best time to invest in process improvement. evaluate if the organizational structure is appropriate if there is a flow of ongoing projects to avoid the investment restriction with training; and reduce the team, such as the quality team. before starting an spi initiative, know the improvement model that will be implemented, and be aware that the results come in the long term. it is also essential to involve the development team in selecting the process improvement model and the process definition to avoid resistance. the consultancy will only help define a valuable process for the organization, but the development team's commitment will lead to spi success. the technical skill of the consultancy is useless without the spontaneous participation of the team members. effectively combining agile methods and maturity models requires experienced consultants to overcome this integration's natural barriers. a balanced process can combine agile methods and model requirements in a sustainable path as future work, we are starting to replicate this study in other software organizations that use maturity models (mpssw and cmmi), considering different sizes, maturity levels, companies capital, and organizational contexts. our goal is to deepen our understanding of the movement organizations makes after the official appraisal. acknowledgments we thank the financial support provided by the araucária foundation (fa). agreement number: 001/2017. we also thank unirio for its financial support (edital ppq-unirio 2019 and 2020). references albuquerque, r., fontana, r.m., malucelli, a., reinehr, s. (2019). agile methods and maturity models assessments: what's next? in: proceedings of the systems, software and services process improvement (eurospi), edinburgh, scotland, pp 619-630. albuquerque, r., malucelli, a., reinehr, s. (2018). software process improvement programs: what happens after official appraisal. in: proceedings of the international conference on software engineering and knowledge engineering (seke), san francisco, usa. albuquerque, r., santos, g., malucelli, a., reinehr, s. (2020). abandonment of a software process improvement program: insights from case studies. in: proceedings of the brazilian symposium on software quality (sbqs), maranhão, brazil. almeida, c.d.a., albuquerque, a.b., macedo, t. c. (2011). analysis of the continuity of software processes execution in software organizations assessed in mps.br using grounded theor. in: proceedings of the international conference on software engineering and knowledge engineering (seke), miami, florida, usa. alqadri, y., budiardjo, e. k., ferdinansyah, a., rokhman, m. f. (2020) the cmmi-dev implementation factors for software quality improvement: a case of xyz corporation. in: proceeedings of the 2nd asia pacific information technology conference (apit), pp.34-40. anastassiu,m., santos, g. (2020). resistance to change in software process improvement an investigation of causes, effects and conducts. in: proceedings of the brazylian symposium on software quality (sbqs), maranhão, brazil. canedo, e. d., santos, g. a. (2019). factors affecting software development productivity: an empirical study. in: proceedings of the xxxiii brazilian symposium on software engineering (sbes), september in brazil. p.3017-316. cmmi institute (2018). cmmi for development v2.0. available at: https://cmmiinstitute.com/products/cmmi/cmmi-v2products. cmmi institute (2019). radix: delivers results with cmmi and behavioral driven development in agile environment. submitted by: cmmi institute. published: 25 july, 2019. coleman, g., o'connor, r. (2008). investigating software process in practice: a grounded theory perspective. journal of systems and software, v.81, issue 5, p.772-784. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 eisenhardt, k. (1989). building theories from case study research. academy of management review, v. 14, issue 4, pp. 532-550. fontana, r.m., meyer, jr. v., reinehr, s., malucelli, a. (2015). progressive outcomes: a framework for maturing in agile software development. journal of systems and software, v. 102, pp. 88-108. guerrero, f., eletrovic, y. (2004). adopting the sw-cmm in a small it organization, ieee software, v.21, issue 4, july-aug. 2004, pp.29-35. iso/iec (2015). iso/iec 33020:2015: information technology process assessment – process measurement framework for assessment of process capability, geneve: iso. iso/iec (2017). iso/iec/ieee 12207:2017 systems and software engineering. software life cycle processes. kalinowski., m., weber, k., franco, n., zanetti, d., santos, g. (2014). results of 10 years of software process improvement in brazil based on the mps-sw model. in quality of information and communications technology (quatic) in portugal, p. 28-37. montoni, m.a., rocha., a. r. c. (2011). using grounded theory to acquire knowledge about critical success factors for conducting software process improvement implementation initiatives. international journal of knowledge management, v.7, issue 3 (jul 2011), pp. 43– 60. doi: 10.4018/jkm.2011070104. nalepa, g., fontana, r.m., reinehr, s., malucelli, a. (2019). using agile approaches to drive software process improvement initiatives. in: proceedings of the systems, software and services process improvement (eurospi), edinburgh, scotland, pp 495-506. narciso, h; allison, i. (2014). overcoming structural resistance in spi with change management. in: proceedings of the international conference on the quality of information and communications technology (quatic), p.8-17. o'connor, r (2012). using grounded theory coding mechanisms to analyze case study and focus group data in the context of software process research. published in the united states of america by information science reference (an imprint of igi global), 2012. cap.13, p.256 270. doi: 10.4018/978-1-4666-0179-6.ch013. peixoto, d.c.c., batista, v. a., resende, r.f., isaías, c. (2010). how to welcome software process improvement and avoid resistance to change. in: proceedings of the international conference on software process (icsp), alemanha, p.138-149. reinehr, s., pessôa, m s. p., burnett, r.c. (2008). software product lines in the financial sector in brazil. in: proceedings of the xxviii national congress on production engineering (enegep). rio de janeiro, brazil. runeson, p., host, m., rainer, a., regnell, b. (2012) . case study research in software engineering: guidelines and examples.. march 2012 256 pages. shin, c.c., huang, s.j. (2010). exploring the relationship between organizational culture and software process improvement deployment, in information & management, v.47, p.271–281. society for the promotion of brazilian software excellence – softex (2020). mps general guide to software. http://www.softex.br/mpsbr. strauss, a., corbin, j. (1998). basics of qualitative research, 2ª ed.: sage publications, thousand oaks, london new delhi, 1998, 312p. sulayman, m., urquhart, c., mendes, e., seidel, s. (2012). software process improvement success factors for small and medium web companies: a qualitative study, in information and software technology v.54, p.479–500, 2012. uskarci, a., demirörs, o. (2017). do staged maturity models result in organization-wide continuous process improvement? insight from employees. in computer standards & interfaces, v.52 p.25–40. yin, r. (2017). case study research: design and methods (applied social research methods), 6th edn. los angeles: sage publications. software process improvement programs: what are the pitfalls that lead to abandonment? 1 introduction 2 related works 3 research method 3.1 context 3.2 data collection 3.3 data analysis 4 results 4.1 analysis of individual cases 4.1.1 organization 1 4.1.2 organization 2 4.1.3 organization 3 4.1.4 organization 4 4.1.4 organization 5 4.1.4 organization 6 4.1.4 organization 7 4.1.5 organization 8 4.2 cross-analysis 5 discussion 6 limitations and threats to validity 7 conclusion acknowledgments references journal of software engineering research and development, 2023, 11:5, doi: 10.5753/jserd.2023.2582 this work is licensed under a creative commons attribution 4.0 international license.. naming practices in object-oriented programming: an empirical study remo gresta [ federal university of são joão del-rei | remoogg@aluno.ufsj.edu.br ] vinicius durelli [ federal university of são joão del-rei | durelli@ufsj.edu.br ] elder cirilo [ federal university of são joão del-rei | elder@ufsj.edu.br ] abstract currently, research indicates that comprehending code takes up far more developer time than writing code. given that most modern programming languages place little to no limitations on identifier names, and so developers are allowed to choose identifier names at their own discretion, one key aspect of code comprehension is the naming of identifiers. research in naming identifiers shows that informative names are crucial to improving the readability and maintainability of programs: essentially, intention-revealing names make code easier to understand and act as a basic form of documentation. poorly named identifiers tend to hurt the comprehensibility and maintainability of software systems. however, most computer science curricula emphasize programming concepts and language syntax over naming guidelines and conventions. consequently, programmers lack knowledge about naming practices. this article is an extension of our previous study on naming practices. previously, we set out to explore naming practices of java programmers. to this end, we analyzed 1,421,607 identifier names (i.e., attributes, parameters, and variables names) from 40 open-source java projects and categorized these names into eight naming practices. as a follow-up study to further investigate naming practices, we examined 40 open-source c++ projects and categorized 1,181,774 identifier names according to the previously mentioned eight naming practices. we examined the occurrence and prevalence of these categories across c++ and java projects and our results also highlight in which contexts identifiers following each naming practice tend to appear more regularly. finally, we also conducted an online survey questionnaire with 52 software developers to gain insight from the industry. all in all, we believe the results based on the analysis of 2,603,381 identifier names can be helpful to enhance programmers awareness and contribute to improving educational materials and code review methods. keywords: naming identifiers, program comprehension, mining software repositories 1 introduction reading and comprehending source code plays a vital role in software development (allamanis et al., 2014). evidences suggest that choosing proper names to identifiers in software systems can positively impact code comprehension (lawrie et al., 2007b; fakhoury et al., 2018; oliveira et al., 2020). although giving meaningful names to identifiers is a widely accepted best practice, coming up with proper names is challenging (deissenboeck and pizka, 2006). as stated by host and ostvold (2007), even though naming is part of daily life for programmers, it entails a great deal of time and thought: names should convey to others the purpose of the code (martin, 2008) and reflect the meaning of domain concepts (marcus et al., 2004). meaningful identifier names are key to bridging the gap between intention and implementation (wainakh et al., 2021). therefore, given that poorly chosen identifier names might hinder source code comprehension (schankin et al., 2018), using meaningful identifier names is a recommended practice present in several coding style guides and conventions. according to the java language naming conventions1, names should be “short yet meaningful”. in a similar fashion, google c++ style guide2 states that names should be “as descriptive as possible”. martin (2008) argues that programmers should choose intention-revealing names as a way 1oracle.com/java/technologies/javase/ codeconventions-namingconventions.html 2google.github.io/styleguide/cppguide.html to avoid disinformation. he also advocates that names have to contain meaningful distinctions and be descriptive (not abbreviated). the gnu coding standards3 posit that programmers should not “choose terse names – instead, [they should] look for names that give useful information about the meaning of the variable”. although programming communities and internationally renowned experts have proposed best practices related to naming identifiers, little is known about the extent to which programmers follow these naming practices (arnaoudova et al., 2016). we argue that without proper guidance, programmers are more prone to resort to less than ideal naming practices as using number series or noise words. for example, bad naming practices can foster the sense that names as person person1 and person person2 are intuitive and understandable. careless naming practices might hinder not only code comprehension but also overall team communication. therefore, we argue that it is crucial for software engineering researchers to learn how to support programmers by understanding how naming practices are used “in the wild” and, through this better understanding, defining naming guidelines for educational materials (charitsis et al., 2021) and code review (nyamawe et al., 2021). in our previous study (gresta et al., 2021), we set out to investigate naming practices in the context of java programs, thus we looked only into java programmer’s name attributes, parameters, and variables. this article is an extension of our previous work on naming practices in which we also inves3www.gnu.org/prep/standards/ https://orcid.org/0009-0007-7178-6759 mailto:remoogg@aluno.ufsj.edu.br https://orcid.org/0000-0002-5768-1850 mailto:durelli@ufsj.edu.br https://orcid.org/0000-0003-1464-2314 mailto:elder@ufsj.edu.br oracle.com/java/technologies/javase/codeconventions-namingconventions.html oracle.com/java/technologies/javase/codeconventions-namingconventions.html google.github.io/styleguide/cppguide.html www.gnu.org/prep/standards/ gresta et al. 2023 table 1. java programs used in our experiment. project loc contributors commits kings median ditto cognome diminutive shorten index total total % total. % total. % total. % total. % total. % total. % aeron 108,442 86 14,409 606 6.34 450 4.71 5,205 54.46 933 9.76 1,932 20.21 114 1.19 318 3.33 9,558 androidutilcode 39,030 32 1,317 179 7.74 21 0.91 1,170 50.56 385 16.64 73 3.15 77 3.33 409 17.68 2,314 archunit 100,276 49 1,499 91 3.07 16 0.54 1,744 58.86 596 20.11 303 10.23 9 0.30 204 6.88 2,963 boofcv 650,019 14 4,520 7,483 23.19 1,696 5.26 1,573 4.87 266 0.82 880 2.73 1,354 4.20 19,017 58.93 32,269 butterknife 13,279 97 1,016 135 21.95 8 1.30 358 58.21 68 11.06 14 2.28 4 0.65 28 4.55 615 corenlp 581,374 107 16,280 2,372 9.53 831 3.34 4,281 17.20 3,864 15.52 610 2.45 1,622 6.52 11,310 45.44 24,890 dropwizard 74,215 364 5,789 53 1.85 14 0.49 1,993 69.64 343 11.98 269 9.40 29 1.01 161 5.63 2,862 dubbo 179,477 386 4,681 754 6.39 81 0.69 6,983 59.19 1,096 9.29 644 5.46 369 3.13 1,870 15.85 11,797 eventbus 8,369 20 507 4 1.33 0 0.00 195 65.00 59 19.67 23 7.67 1 0.33 18 6.00 300 fastjson 179,996 158 3,863 8,205 49.88 77 0.47 4,255 25.87 1,264 7.68 243 1.48 387 2.35 2,019 12.27 16,450 glide 76,418 129 2,583 105 2.77 22 0.58 2,442 64.47 629 16.61 194 5.12 45 1.19 351 9.27 3,788 guice 72,980 59 1,931 178 2.85 46 0.74 3,871 61.92 1,043 16.68 216 3.45 51 0.82 847 13.55 6,252 hdiv 30,631 11 1,086 106 9.72 11 1.01 573 52.52 63 5.77 177 16.22 31 2.84 130 11.92 1,091 ical4j 24,130 35 2,303 132 11.22 15 1.28 682 57.99 167 14.20 48 4.08 2 0.17 130 11.05 1,176 j2objc 1,810,274 75 5,284 5,523 10.13 866 1.59 9,302 17.06 4,750 8.71 1,276 2.34 3,978 7.30 28,827 52.87 54,522 jenkins 175,150 654 31,156 658 6.15 161 1.51 3,273 30.61 794 7.43 314 2.94 185 1.73 5,308 49.64 10,693 jtk 204,105 9 1,373 2,627 13.03 4,557 22.60 1,008 5.00 55 0.27 37 0.18 1,068 5.30 10,813 53.62 20,165 junit4 31,242 151 2,474 55 3.15 18 1.03 985 56.38 248 14.20 32 1.83 47 2.69 362 20.72 1,747 keywhiz 23,337 32 1,538 89 5.67 23 1.46 1,036 65.99 178 11.34 90 5.73 14 0.89 140 8.92 1,570 libgdx 272,510 505 14,661 49,315 47.83 21,653 21.00 11,800 11.44 1,831 1.78 2,041 1.98 2,252 2.18 14,215 13.79 103,107 litiengine 75,877 20 3,324 316 11.86 46 1.73 771 28.94 448 16.82 253 9.50 21 0.79 809 30.37 2,664 lottie-android 16,258 102 1,292 80 7.41 104 9.64 442 40.96 145 13.44 126 11.68 21 1.95 161 14.92 1,079 mockito 55,751 220 5,523 234 9.87 12 0.51 1,288 54.35 285 12.03 126 5.32 38 1.60 387 16.33 2,370 mpandroidchart 25,232 69 2,068 134 6.85 36 1.84 385 19.69 232 11.87 155 7.93 38 1.94 975 49.87 1,955 nutch 141,710 43 3,215 236 7.68 28 0.91 1,353 44.01 467 15.19 113 3.68 164 5.34 713 23.19 3,074 okhttp 48,465 235 4,848 455 16.01 39 1.37 1,902 66.92 161 5.67 126 4.43 21 0.74 138 4.86 2,842 orienteer 55,681 12 2,274 63 2.68 27 1.15 1,122 47.77 584 24.86 395 16.82 22 0.94 136 5.79 2,349 picasso 9,136 97 1,368 64 8.82 36 4.96 546 75.21 27 3.72 10 1.38 7 0.96 36 4.96 726 rest-assured 73,511 105 2,020 121 5.85 32 1.55 1,440 69.57 288 13.91 107 5.17 14 0.68 68 3.29 2,070 rest.li 523,972 89 2,617 2,158 9.26 533 2.29 10,054 43.16 4,712 20.23 3,458 14.84 237 1.02 2,143 9.20 23,295 retrofit 26,513 152 1,865 60 2.49 7 0.29 1,691 70.14 352 14.60 18 0.75 6 0.25 277 11.49 2,411 riptide 27,072 18 2,131 4 0.52 0 0.00 650 85.08 22 2.88 46 6.02 8 1.05 34 4.45 764 rxjava 468,957 277 5,877 2,371 10.25 34 0.15 4,275 18.48 573 2.48 115 0.50 373 1.61 15,387 66.53 23,128 spring-boot 343,138 804 32,096 443 2.74 95 0.59 10,868 67.24 1,354 8.38 3,002 18.57 91 0.56 309 1.91 16,162 tomcat 343,703 61 23,140 1,142 6.68 263 1.54 7,374 43.16 1,675 9.80 696 4.07 846 4.95 5,089 29.79 17,085 twelvemonkeys 99,418 42 1,334 379 8.43 123 2.73 912 20.28 808 17.96 588 13.07 327 7.27 1,361 30.26 4,498 unirest-java 15,979 43 1,603 12 1.75 1 0.15 310 45.19 58 8.45 23 3.35 22 3.21 260 37.90 686 webmagic 12,926 40 1,119 28 2.87 3 0.31 763 78.26 80 8.21 27 2.77 10 1.03 64 6.56 975 xchart 24,406 50 1,451 119 7.93 31 2.07 628 41.84 338 22.52 50 3.33 26 1.73 309 20.59 1,501 zxing 107,064 109 3,582 208 9.78 137 6.44 695 32.68 267 12.55 108 5.08 157 7.38 555 26.09 2,127 total 7,111,470 5,519 217,869 87,297 20.79 32,153 7.65 110,198 26.24 31,508 7.50 18,958 4.51 14,088 3.35 125,688 29.93 419,890 tigate name practices in the context of c++ programs. to investigate how c++ and java programmers name attributes, parameters, and variables we carried out an empirical study in which we analyzed 1,421,607 identifier names from 40 open-source java projects and 1,181,774 identifier names from 40 open-source c++ projects. we performed repository mining to determine how often eight categories of naming practices are within and across these projects. we also looked at how prevalent these naming practices are in certain code contexts (i.e., attribute, parameter, method, for, while, if, and switch). in this extended version, our results are based on two large samples of programs: the previous version of this study analyzed 40 open-source java programs, and results from this extended version of the article also include the analysis of 40 open-source c++ projects. moreover, to understand the industry practices, we conducted an online survey questionnaire to gain insight from software programmers. throughout a survey, we gathered quantitative data on programmers’ perceptions about the use and occurrence of the investigated naming practices. the online survey questionnaire ran from november 2021 to january 2022 and had 52 responses. this extended version of our study makes the following contributions: • our results show that the naming practice categories (kings, median, ditto, diminutive, cognome, shorten, index and famed) appear in all 80 open-source projects and are prevalent in practice; • we identified the most common names across projects. the top-3 recurrent names are: value; result; and name. many single-letter names are also commonly used in projects (e.g., i, e, s, c). we also observed that the majority of common names are associated with integer or string values; • we perceived that programmers naming practices are context-specific. single-letter names (index and shorten) seem to be more present in conditional or loops statements (if, for, while). in contrast, identifiers with the same name as her types tend to appear in largescope contexts (e.g., attribute); • we noted that, in general, the project’s characteristics might not impact the prevalence of one particular naming category practice: there is no representative correlation between size, number of contributors, or number of commits and the predominance of some naming category practice; • we also noted that, in general, the project’s characteristics might not impact the prevalence of one particular naming category practice: there is no representative. • finally, we observed that diminutive is the most adopted naming category practice by survey respondents and median is the least one. this result seems to align well with our observation about the prevalence of the naming practices in 80 open-source object-oriented programs. the remainder of this paper is organized as follows. the section 2 presents the background and related work on naming practices. section 3 details how we carried out our study. gresta et al. 2023 table 2. c++ programs used in our experiment. project loc contributors commits kings median ditto cognome diminutive shorten index total total % total. % total. % total. % total. % total. % total. % asio 196,656 53 3,034 135 3.65 27 0.73 1,664 44.99 32 0.87 657 17.76 220 5.95 964 26.06 3699 assimp 614,926 462 10,934 78 6.76 74 6.41 739 64.04 10 0.87 94 8.15 13 1.13 146 12.65 1,154 bitcoin 541,474 853 32,661 46 4.58 27 2.69 621 61.79 8 0.80 11 1.09 39 3.88 253 25.17 1,005 bluematter 812,822 2 5 3,972 29.20 1,350 9.92 1,893 13.91 1,560 11.47 506 3.72 685 5.03 3,639 26.75 13,605 calligra 1,602,456 263 101,573 47 3.41 2 0.15 743 53.92 137 9.94 267 19.38 14 1.02 168 12.19 1,378 chaste 587,473 25 5,384 2,954 40.46 882 12.08 673 9.22 667 9.14 470 6.44 14 0.19 1,641 22.48 7,301 citra 428,966 222 9,141 27 5.11 19 3.60 255 48.30 4 0.76 36 6.82 27 5.11 160 30.30 528 clickhouse 1,422,903 921 83,445 114 4.13 40 1.45 2,228 80.78 66 2.39 108 3.92 14 0.51 188 6.82 2,758 core 9,262,610 25 3,058 4,044 5.29 1,516 1.98 45,465 59.47 10,741 14.05 10,799 14.13 420 0.55 3,459 4.52 76,444 freecad 4,842,675 383 27,647 528 6.94 210 2.76 4,705 61.83 100 1.31 513 6.74 181 2.38 1,372 18.03 7,609 gacui 504,062 3 2,238 8 0.62 50 3.91 576 45.00 44 3.44 294 22.97 15 1.17 293 22.89 1,280 gecko-dev 28,303,180 4,910 785,724 1,116 4.57 1,548 6.34 11,737 48.11 2,567 10.52 4,805 19.69 311 1.27 2,314 9.48 24,398 godot 4,976,013 1,590 41,538 525 9.87 270 5.08 1,711 32.17 128 2.41 1,934 36.36 107 2.01 644 12.11 5,319 gromacs 1,680,900 74 20,825 89 5.03 104 5.88 994 56.16 38 2.15 250 14.12 54 3.05 241 13.62 1,770 grpc 717,441 708 50,493 76 3.40 49 2.19 799 35.75 68 3.04 842 37.67 44 1.97 357 15.97 2,235 kdenlive 205,469 94 15,645 4 0.43 0 0.00 671 72.93 66 7.17 36 3.91 34 3.70 109 11.85 920 kdevelop 338,648 245 42,650 52 4.70 3 0.27 723 65.37 61 5.52 93 8.41 10 0.90 164 14.83 1,106 krita 983,754 336 57,706 80 5.93 12 0.89 573 42.48 109 8.08 216 16.01 44 3.26 315 23.35 1,349 lammps 1,626,808 185 29,307 281 11.35 56 2.26 1,272 51.37 199 8.04 169 6.83 85 3.43 414 16.72 2,476 mediapipe 235,825 2 111 11 1.54 47 6.58 511 71.57 13 1.82 1 0.14 26 3.64 105 14.71 714 mlir 75,845 2,285 415,644 9 5.70 18 11.39 83 52.53 24 15.19 8 5.06 2 1.27 14 8.86 158 mongo 5,015,374 571 63,227 917 3.17 381 1.32 14,644 50.66 761 2.63 2,770 9.58 2,019 6.99 7,412 25.64 28,904 mysql-server 3,733,193 88 170,220 803 6.94 124 1.07 7,941 68.60 713 6.16 949 8.20 141 1.22 904 7.81 11,575 obs-studio 482,886 477 10,466 22 3.42 9 1.40 429 66.72 57 8.86 59 9.18 5 0.78 62 9.64 643 opencv 2,166,493 1,360 31,603 1,598 11.96 859 6.43 5,672 42.45 367 2.75 376 2.81 730 5.46 3,761 28.14 13,363 openoffice 6,894,647 21 7,657 3,977 5.82 1,703 2.49 39,683 58.06 9,796 14.33 9,453 13.83 335 0.49 3,397 4.97 68,344 percona-server 3,777,210 238 185,334 849 7.35 127 1.10 7,887 68.32 712 6.17 913 7.91 142 1.23 914 7.92 11,544 proxysql 121,989 90 4,680 7 1.38 12 2.37 219 43.20 10 1.97 46 9.07 37 7.30 176 34.71 507 pytorch 1,792,819 2,155 43,944 56 2.10 111 4.15 1,472 55.07 35 1.31 164 6.14 115 4.30 720 26.94 2,673 qtbase 2,714,097 783 55,238 185 4.51 89 2.17 2,403 58.54 258 6.29 229 5.58 132 3.22 809 19.71 4,105 rocksdb 497,140 628 10,766 41 1.66 52 2.10 1,494 60.36 21 0.85 34 1.37 59 2.38 774 31.27 2,475 server 1,967,124 300 195,145 22 1.59 2 0.14 874 63.01 40 2.88 172 12.40 33 2.38 244 17.59 1,387 tensorflow 3,284,592 3,068 125,560 778 5.67 747 5.45 8,108 59.13 235 1.71 279 2.03 499 3.64 3,067 22.37 13,713 terminal 360,717 313 2,855 159 3.69 49 1.14 2,640 61.20 118 2.74 311 7.21 124 2.87 913 21.16 4,314 vtk 3,690,369 352 81,218 500 7.78 216 3.36 2,167 33.74 147 2.29 1,137 17.70 503 7.83 1,753 27.29 6,423 winget-cli 305,116 317 539 64 2.56 62 2.48 1,252 50.00 65 2.60 111 4.43 312 12.46 638 25.48 2,504 xbmc 1,094,954 785 59,641 42 9.77 2 0.47 208 48.37 29 6.74 83 19.30 20 4.65 46 10.70 430 yarp 1,029,531 77 17,416 45 2.25 18 0.90 1,021 51.13 91 4.56 352 17.63 65 3.25 405 20.28 1,997 yuzu 488,099 203 20,860 30 19.61 7 4.58 76 49.67 0 0.00 6 3.92 3 1.96 31 20.26 153 zerotierone 137,784 58 5,409 34 2.05 64 3.85 975 58.70 12 0.72 62 3.73 56 3.37 458 27.57 1,661 total 99,515,040 25,525 2,830,541 24,325 7.28 10,938 3.27 177,801 53.24 30,109 9.01 39,615 11.86 7,689 2.30 43,444 13.01 333,921 the section 4 outlines the results of our empirical study and provides a general discussion. section 5 describes the threats to the validity. finally, section 6 presents some concluding remarks. 2 background and related work this section presents some background about names and related studies on naming identifiers. we introduce this section by presenting an overview of the role of names in software development. 2.1 naming names identify classes, attributes, methods, variables, and parameters (lawrie et al., 2006). they were originally designed to be pieces of code used to represent values in memory (tofte and talpin, 1997) and now they have become the primary source of information in software development (lawrie et al., 2006; ratiu and deissenboeck, 2006): programmers rely on existing names in their code comprehension journey (takang et al., 1996). indeed, high-quality names have a significant influence on the comprehension of source code (avidan and feitelson, 2017). arnaoudova et al. (2016) have acknowledged the critical role that the source code lexicon plays in the psychological complexity of software systems and coined the contradictory expression “linguistic antipatterns” (las) to denote poor practices in the naming, documentation, and choice of identifiers that might hinder program understanding. they argue that poor practices might lead programmers to make wrong assumptions and waste time understanding source code (arnaoudova et al., 2016). deissenboeck and pizka (2006) characterized a name as being a fully spelled word or even an abbreviation. names can also be composed of two or more words, might include words that do not exist, or even be single alphabetical characters. however, the proper use of words in names is a significant issue in software development (feitelson et al., 2020). in martin’s book (martin, 2008), tim ottinger drew a series of simple rules to guide programmers on naming identifiers. according to ottinger, programmers have to focus on creating intention-revealing names (the name by itself should be capable of informing what it does). they also have to avoid using non-informative words (e.g., words with multiple meanings, words with little differentiation between themselves or number series). ottinger also advocates that names should be pronounceable and searchable. for instance, it is impractical to discuss any source code composed of words that programmers cannot pronounce in a code review session. coding style guides and conventions also aim to address the naming identifiers’ challenges (dos santos and gerosa, 2018). however, they are usually hard to enforce rules, as others discussed in martin’s book (clean code) martin (2008). caprile and tonella (2000) proposed an approach for improving the meaningfulness of identifier names. the approach entails the following steps: (i) extracting identifier names; (ii) normalizing identifier names; and (iii) applying the changes gresta et al. 2023 to the source code. the proposed rules for creating meaningful names aim to guarantee that each word composing a name must belong to a dictionary of standard words and be compliant with existing grammar. deissenboeck and pizka (2006) proposed a set of precise rules for constructing concise and consistent names. in the interest of preserving consistency, the authors advocate that a single name must represent only one concept. the rules, therefore, ensure that one concept will not be taken into consideration in multiple identifier names. in order to preserve conciseness, the rules ensure that names chosen by programmers stand for the concepts they are indeed trying to convey. more recently, feitelson et al. (2020) suggested a threestep method to help programmers to systematically come up with meaningful names. the model encompasses the following steps: (i) selecting the concepts to include in the name; (ii) choosing the words to represent each concept; and (iii) creating a name from these words. the authors demonstrated that programmers could use the model to guide choosing names that are superior (in terms of meaningfulness) over randomly chosen names. 2.2 names in software quality there have been many studies that examine how names affect comprehension and programmer’s efficiency. avidan and feitelson (2017) conducted an experiment involving ten programmers in hopes of understanding the impact of identifier names in program comprehension. they observed that, when changing identifiers names from fully spelled words to single-letter ones, the fully spelled version was perceived as more understandable. hofmeister et al. (2017) also concluded that abbreviations and single-letter names decrease code comprehension and could indicate low-quality code as observed by butler et al. (2010) and kawamoto and mizuno (2012). butler et al. (2010) showed that source code containing poor quality identifiers names were associated with findbugs warnings. kawamoto and mizuno (2012) also observed that concise identifier names have a substantial effect on the fault-proneness in netbeans. takang et al. (1996), based on a survey conducted with 89 computer science students, concluded that the combination between identifier names and comments in the code provides a minor improvement in code comprehension. hence, improving identifier names seem to be a better option than including comments in the code. spending more time choosing meaningful identifier names can result in less work during software maintenance (lawrie et al., 2007a). lowquality names can affect code negatively by causing confusion and misinformation. the study conducted by lawrie et al. (2007a) found that the quality of identifier names improves over time and is also related to the software license. modern software systems contain more high-quality names, and proprietary ones include more abbreviations than opensource projects. moreover, a study investigating the semantic nature of identifier names in four large-scale open-source projects showed that the number of commits and contributors tended to influence the quality of names. projects with a high number of commits and contributors tend to have more identifier names presenting a large text-corpora of existing words (gresta and cirilo, 2020). 3 empirical study setup this section describes the empirical study design. we conducted an empirical study to characterize how c++ and java programmers name attributes, parameters, and variables. specifically, we analyzed 1,421,607 identifier names (i.e., attributes, parameters, and variables names) from 40 java projects and categorized these names into eight naming practice categories. afterwards, we expanded our analysis by selecting a sample of 40 c++ projects. upon analyzing this sample, we found 1,181,774 identifier names, which we then categorized according to the aforementioned eight naming practice categories. we used the results of categorizing identifier names from these two samples to provide answers to the research questions discussed in the next subsection. 3.1 goal and research questions we set out to probe into how common eight naming practices are “in the wild” (i.e., in real world software systems) – see section 3.2. more specifically, our goal is to contribute towards a better understanding of their prevalence in attributes, parameters, and variables naming in java. we believe a more insightful interpretation of the results of our study can be obtained from the standpoint of a researcher interested in helping programmers by defining naming guidelines for educational material and code review. our main goal is to provide answers to the following research questions (rqs): • rq1: how prevalent are the eight naming practice categories? we set out to investigate whether identifier names in open-source projects can be categorized according to eight naming practices categories and how common these naming practices are across c++ and java projects; • rq2: are there context-specific naming practices categories? we set out to examine if specific naming practice categories tend to occur more often in certain contexts (e.g., attribute, parameter, method, if, for, while, switch); • rq3: do the naming practice categories carry over across different c++ and java projects? we attempt to explore the prevalence of the categories spanning multiple c++ and java projects and identify any correlation between software metrics and programmer’s naming practices; • rq4: what is the perception of software developers about the investigated naming categories? we set out to probe into programmers’ perceptions regarding the use and occurrence of the eight investigated naming practices. 3.2 naming practice categories the categories presented in this subsection are a compilation of programmers’ practices reported in several studies (argresta et al. 2023 naoudova et al., 2016; beniamini et al., 2017; alsuhaibani et al., 2021) and books (martin, 2008; dileo, 2019). inspired by antipattern templates (brown et al., 1998), in order to explain the naming practice categories, we frame the discussion of each category in terms of the following elements: category name, examples, motivation (why), consequences of the naming practice, and recommendations. 3.2.1 kings this category represents identifier names composed by numbers at the end. example: string name1 and string name2 or integer arg1 and integer arg2 represent arbitrary distinctions as number series. why: programmers often opt to employ names that fall into this category to distinguish between identifiers that appear in the same scope. consequences: names with numbers at the end, however, are not very informative and do not represent intentional naming (martin, 2008; dileo, 2019). recommendation: usually, identifiers represent different things; whenever that is the case, they should be named accordingly (martin, 2008). 3.2.2 median this category is a variation of the kings category and comprises identifier names composed of numbers in the middle. example: the names fastuint64tobuffer and base64bytes contain numbers that might be representing 64 bits values. why: numbers in the middle, in general, are used to denote the value stored in the attribute/variable or even to provide some distinction among similar identifier names. consequences: names with numbers in the middle can potentially be harder to search for in the source code, hard to pronounce, and also can be very similar to other names that differ only in terms of the numbers that appear somewhere in the middle (martin, 2008). recommendations: programmers should use numbers only when necessary and surround numbers with pronounceable words (martin, 2008). 3.2.3 ditto the category ditto consists of identifier names spelled in the same way as their types. example: timezone is spelled as its type timezone in the same way that the name object has the same name as its type (object). why: naming identifiers according to the respective type is an easy option to avoid mental mapping (which usually are associated with the problem domain concepts). consequences: this naming practice might result in names that are harder to map to their purposes when used in larger scopes, and tend to cause misinformation when the type name changes but the identifier names do not (martin, 2008; alsuhaibani et al., 2021). recommendations: avoid using ditto based names in very large scopes and/or in contexts in which other names can conflict with them (martin, 2008). 3.2.4 diminutive this category encompasses identifier names that are a chunk of their respective type name. example: listener is an example of a name in this category when its associated type is named enginetestlistener. the name nfruleset ruleset is also considered as a chunk of its type. why: developers usually rely on short names to avoid overloading the reader with many concepts. consequences: when used in large-scope contexts, names that fall into this category might impair code comprehension (martin, 2008). recommendations: programmers should use names that properly convey the identifier’s purpose within the local context and scope (martin, 2008). 3.2.5 cognome identifier names in this category contain as an additional suffix or prefix the name of the respective type. example: an identifier namestring includes in its name the the respective type name (string). why: usually programmers resort to adding suffixes in names to help them remember the types. consequences: encoding type into names might place an extraneous cognitive load on the programmer martin (2008); dileo (2019). recommendations: give identifiers names that are meaningful without having to resort to adding its type information to the names martin (2008). 3.2.6 index and shorten these categories represent similar naming practices: naming an identifier with a single-letter word. the index category represents names with one arbitrary letter. names in the shorten category are the starting letters that correspond to their respective types. example: the names integer i and integer j falls into the index category and person p and string s are examples of shorten names. why: singleletter names are traditionally used to identify counters in loops. consequences: single-letter names usually are not easy to locate in the source code (unsearchable) and, when employed in large scopes, can be hard to be understood (martin, 2008; dileo, 2019; beniamini et al., 2017). recommendations: use single-letter names only in local and small scopes; otherwise, intent-revealing names are better (martin, 2008). 3.2.7 famed this category includes very common names; that is, when naming become arbitrary and programmers need to come up convenient defaults. famed names appear in almost every source code, potentially, in similar contexts, such as in loop statements (e.g., for). example: the word i is a recurrent identifier name used in loops to denote counters. why: very popular identifiers are part of the programmer mindset and can be quickly remembered and understood. implications: when used in an indiscriminate fashion, they may cause misinformation martin (2008); alsuhaibani et al. (2021). recommendations: use intent-revealing names even in shortscope contexts martin (2008); alsuhaibani et al. (2021). 3.3 data extraction and analysis projects selection our sample comprises 40 open-source java projects and 40 c++ projects hosted on github. these gresta et al. 2023 projects are listed in tables 1 and 2. we included widely used projects, most of which have been under development for at least five years (e.g., fastjson, jenkins, junit4, mockito, retrofit, spring-boot, tomcat, pytorch, and tensorflow). also, some projects were taken into account because they appear in a curated list of “awesome” projects.4 table 1 and 2 give an overview of the examined projects. as shown in these tables, our java and c++ samples cover somewhat small codebases (with less than 10k loc) and large-scale ones (with over 100k loc). overall, we selected heterogeneous java and c++ projects from a broad range of domains: e.g., software testing, game design, web applications development, image manipulation, and natural language processing. the selected projects also have a reasonable number of attributes, parameters, and variables names and were developed collaboratively by a diverse group of programmers. therefore, we consider that we have selected a somewhat representative set of java and c++ projects. the java projects were collected in july 2021 from github by cloning and storing their respective repositories. in a similar fashion, we extracted the information from the selected c++ projects in january 2022. after storing the repositories, we extracted three common software metrics: (i) the total lines of code (we excluded non-functional code such as comments and white-spaces); (ii) the number of commits; and (iii) the number of contributors. to answer rq3, we correlated these metrics with the prevalence of the categories in projects. names extraction in order to extract identifier names from each project, we created a parser based on the srcml tool collard et al. (2013). srcml is a multi-language parsing tool for the analysis and manipulation of source code. srcml turns source code into a document-oriented xml format (srcml5), which allows for queries using xpath. for example, the srcml format contains structural information (markup tags) about identifier declarations (), associated types (), and context (). we extracted 2,603,381 names from the 80 collected projects. after applying the naming categorization (see section 3.2), we get a total of 753,811 identifier names distributed across the categories (kings, median, ditto, diminutive, cognome, index, shorten) as shown in tables 1 and 2. the experimental package is available in github 6. to investigate and get an overview of the elements in the famed category, we used the entire dataset extracted from both programming languages. we examined the name of each extracted identifier and the associated type to answer rq1 and rq3. therefore, for each naming category practice we report the occurrences in the studied projects and across them. to answer rq2 we analyzed the context where identifiers were declared. survey design and sampling to answer rq4 we designed an online questionnaire containing fifteen closed4java-lang.github.io/awesome-java 5srcml.org 6github.com/rng-lab/naming-practices-analysis ended questions related to naming practices. a brief description (in portuguese) and an example accompanied these questions (see appendix a). we also included two initial questions to collect the demographic information of the respondents. the respondents had to point out their experience in software development as a single choice from four options: under two, two to five, six to 10, or over ten years; and also their education level (undergraduate, graduate, postgraduate). we selected the web-based questionnaire to conduct our survey because it maximizes the number of possible respondents. the google forms7 was chosen to host the questionnaire and enable data collection and pre-processing. the questionnaire was first trialed within the authors’ organizations, with one of the authors registering possible observed issues. some minor adjustments were made to ensure the consistency and clarity of the questions. finally, the questionnaire link was posted to multiple websites (e.g., forums) and online groups (e.g., discord, whatsapp). 4 experimental results in this section, we present the results of our empirical study around the rqs described in the previous sections. 4.1 rq1: how prevalent are the naming practice categories? to answer rq1, we analyzed the categories kings, median, ditto, diminutive, cognome, shorten, and index regarding how commonly they appear in the projects in our samples. tables 1 and 2 list how common each of these categories are table 3. the top 10 names in ditto category names num. num. repetitions projects ditto in java programs url 2,421 24 list 1,464 32 file 1,444 32 method 1,044 29 context 1,042 25 object 991 29 uri 968 25 node 844 21 type 593 30 date 526 25 ditto in c++ programs t 1,227 34 string 1,134 18 uint8_t 564 15 args 247 22 t 231 20 std 143 19 type 141 19 handle 96 17 mode 45 16 7www.google.com/forms gresta et al. 2023 across the 80 investigated projects. considering the identifier names in the chosen java projects, 20.79% are composed by numbers at the end (kings), 7.65% have numbers in their middle (median), 26.24% are spelled the same as their types (ditto), 7.50% contain the hole types as a sub-part (cognome), 4.51% have in their spelling a sub-part of their respective types (diminutive), 3.35% are single-letter names composed of the first letter of their types (shorten), and 29.93% are arbitrary single-letter names (index). as for the c++ projects in our sample, only approximately 7.28% of the identifier names fall into the kings category, 53.24% of the identifiers are named according to their respective types (ditto), around 9% follow the cognome naming practice, 11.86% of the c++ identifier names are diminutive, only 2.3% belong to the shorten category, and approximately 13% of the c++ identifier names are single-letter names (index). these results indicate that the use of single-letter names (index) is a widespread naming practice adopted in objectoriented programming. indeed, beniamini et al. (2017) have observed that single-letter names account for 9–20% of names in java programs. as stated by them, the most commonly occurring single-letter name is i, and in some cases, j is also highly used. in addition, we observed that single-letter names representing contractions of their respective type are not so common (shorten), but are prevalent across projects (see section 4.3). programmers seem to be conscious about single-letter names implications (hofmeister et al., 2017), and thus avoid choosing such naming practice: this category table 4. the most common names (famed) names num. num. common num. num. repetitions projects type occurrences different types famed in java programs value 16,940 40 string 3,345 598 result 12,975 39 int 1,924 887 name 11,374 40 string 10,208 116 i 11,172 39 int 9,794 139 e 10,225 40 throwable 1,851 589 index 8,224 38 int 7,184 83 key 7,696 35 string 3,187 205 s 7,442 35 string 2,771 318 c 7,337 35 int 1,468 441 t 6,989 37 throwable 1,210 336 a 6,970 34 float 739 575 b 6,511 38 int 983 486 type 6,162 40 class 1,523 315 input 6,008 37 string 565 277 p 5,256 35 int 381 443 source 5,025 37 string 765 263 n 5,010 34 int 2,930 165 request 4,719 32 request 1,489 212 context 4,437 37 context 1,042 241 id 4,216 36 string 1,523 104 famed in c++ programs i 5,421 40 int 2,362 151 value 3,912 40 double 427 268 x 3,856 36 double 858 250 result 3,771 40 t 448 231 index 3,106 38 int 869 88 n 3,027 37 int 729 159 ctx 2,964 22 opkernelconstruction 622 105 name 2,545 37 string 950 187 type 2,534 40 int 306 426 b 2,370 39 bool 386 219 p 2,351 37 void* 190 412 size 2,285 39 size_t 619 119 context 2,279 34 opkernelconstruction 501 133 s 2,254 35 status 427 243 len 2,101 34 uint32 463 47 node 2,093 30 node 154 286 v 1,983 38 double 118 253 data 1,832 37 void* 441 211 val 1,821 35 int 192 199 c 1,776 38 char 246 199 gresta et al. 2023 figure 1. naming practices distribution over java programming statements 0 statements pe rc en ta ge kings median ditto diminutive cognome index shorten attr 30.84% 13.56% 29.01% 6.20% 9.01% 10.63% 0.76% for 17.49% 10.60% 13.15% 2.38% 6.32% 45.70% 4.36% if 7.98% 2.07% 13.28% 2.84% 6.31% 52.99% 14.53% method 18.64% 3.46% 27.31% 5.43% 9.36% 32.78% 3.04% param 19.53% 8.90% 29.10% 2.95% 4.81% 32.46% 2.24% switch 12.16% 2.11% 14.32% 8.59% 4.62% 44.77% 13.42% while 9.51% 0.91% 13.43% 2.11% 5.98% 55.79% 12.27% kings median ditto diminutive cognome index shorten represents only 3.35% (14,088) of the examined java names and 2.3% ( 7,689) of the identifier names in c++ projects. names that fall into the ditto naming practice category make up the lion’s share of all identifier names in c++ (53.24%) projects and are the second most common naming practice in java (26.24%) programs. even though it might be argued that ditto is a sound naming practice given that it leads to pronounceable names and many ides suggest names that include the identifier type, in most cases, the practice does not lead to the creation of intention-revealing names. table 3 lists the five most reoccurring names in such a category for java and c++ projects. according to table 3, the use of identifier names as list, object, args, unit8_t and t are common, but these names do not reveal intentions. when the context is not explicit or broad, programmers have to trace back what kinds of data are in an identifier named as list or t. these names are generic and hurt the reader’s understanding. moreover, whether the type name changes, then the identifier names will be misleading as in cases such as string and type. according to avidan and feitelson (2017), the evil face of names is misleading names. the habit of choosing names that represent arbitrary sequential distinctions also revealed a common practice among java and c++ programmers (kings). however, numberseries is considered a bad practice in object-oriented programming when creating meaningful names. number-series naming is a non-informative option, which might disturb code comprehension and maintainability. the use of numbers in the middle of names, although prevailing in the studfigure 2. naming practices distribution over c++ programming statements 0 statements pe rc en ta ge kings median ditto diminutive cognome index shorten attr 9.34% 9.16% 20.75% 24.27% 32.68% 3.41% 0.38% for 21.98% 6.39% 26.81% 4.96% 4.78% 32.42% 2.65% if 13.31% 3.75% 17.01% 6.42% 2.97% 44.80% 11.75% method 22.64% 6.14% 25.17% 8.44% 5.18% 28.61% 3.81% param 3.98% 1.68% 65.86% 10.54% 5.65% 10.32% 1.97% switch 9.59% 1.26% 20.25% 5.33% 1.26% 51.74% 10.56% while 8.71% 3.42% 17.17% 7.49% 4.88% 48.90% 9.44% kings median ditto diminutive cognome index shorten ied names, does not appear to be a recurrent naming practice. we observed that the most common numbers used in the middle of names are: (i) 0, 1, 2, 3, 4, 5, and 6 – as well as meaning some distinction; and (ii) 8, 16, 32 and 64 – meaning identifiers which might be representing 8, 16, 32 or 64 bits values, respectively. the scenarios in which programmers choose names that are variants of their type are also common. for example, names that contain sub-parts of their type (cognome) account for 7.50% of the identifier names in java projects and around 9% in c++ programs. often, these identifier names represent prefix/suffix (noise words) conventions, such as: streetstring; listpersons; floatarg. noise words are redundant and should never appear in names. in general, streetstring is not better than street. short names are in general easier to comprehend and one of the first things a programmer can do to keep identifier names short is to avoid adding unnecessary information. in contrast, names that are part of their type are not so common. these names are hard to search for and are not very meaningful in most contexts. 4.1.1 very common names in feitelson et al. (2020), the authors observed that the probability of two programmers choosing the same name is low: the median probability was only 6.9%. at the same time, when a specific name is chosen, it is usually understood and often used by most programmers (avidan and feitelson, 2017; swidan et al., 2017). in fact, we observed that there are some frequently used names. the top-3 most comgresta et al. 2023 mon names in java programs are (see table 4): (i) value (16,940 occurrences); (ii) result (12,975 occurrences); and (iii) name (11,374 occurrences). it might be expected that i is a widespread name (beniamini et al., 2017), but many other single letter names are also commonly used across java projects (e.g., e, s, c, t, a, b, p, n). most of them are in the top-10 most common names. another interesting observation is index and key as part of the top-10 most common names. overall, some of the common identifier names in table 4 are popular in programmer’s vocabulary: value, result, name, index, key, type, input, source, request, context, id. as for c++ programs, the three most common identifier names are (i) i (5,421 occurrences), (ii) value (3,912 occurrences), and (iii) x (3,856 occurrences). according to our results, many of the identifier names shown in table 4 are widely common in programs written in java and c++: value, result, name, index, type, context, i, b, n, p, and s. it turns out that value appears among the top three most used identifier names both in java and c++. java programmers seem to have a slight preference for the names result and name in comparison to c++ programmers. as mentioned, some single-letter names are widely used by programmers in both languages, being i the most commonly used single-letter name in java and c++. further analysis of the names in table 4 and their corresponding most common types led to interesting results about programmers’ rationale when programming in java and c++. as noted by beniamini et al. (2017), analyzing this link yields interesting results because it is possible to understand the meaning related to names frequently used by programmers, especially single-letter names. we can observe most identifier names are associated with int variables (e.g., result, i, index, c, b, p, n) or string types (e.g., value, name, key, s, input, source, id). as shown in a survey conducted by beniamini et al. (2017), single-letter names such as i and j are understood as counter variables (integer values) and most of the time used as loop control variables. there are other interesting findings. for example, in java programs the single-letter name e, is usually correlated with error and exception (beniamini et al., 2017). our results show that e is mainly associated with the throwable type. in the same way, s is a single-letter name essentially associated with string (see table 4). however, we also found some counter-intuitive results. for instance, contrary to our expectations, we observed that in programs written in java the single-letter name b is not linked with boolean values (beniamini et al., 2017) but with integer values. additionally, the identifier name t is mainly associated with throwable; which is somewhat counter-intuitive because t is also often used to name and convey the idea of time-related constant values and variables or variables that hold temporary values (beniamini et al., 2017). other names that seem to have meaningful associations are the following: type, which is generally associated with the class type; context and request, which are often associated with the context and request types. our results would seem to suggest that the underlying meaning of the identifier names vary a lot. for example, the name result was associated with 855 different types. the name i, which intuitively is associated with index (int), also assumes other 139 different types. nevertheless, in most cases (9,794 out of 11,172), this name is associated with integer values. the name name seems to be usually associated with the string type: 10,208 out of 11,374 occurrences are associated with string. 4.2 rq2: are there context-specific naming practices categories? to answer the rq2, we investigated the predominance of the naming practice categories over particular contexts (attribute, parameter, method, for, while, if, and switch). the results are present in figure 1 and 2. we found that while some naming conventions (allamanis et al., 2014) acknowledge the use of single-letter words (index and shorten) to name a local, temporary or loop variable, this practice is much more pervasive than any other. except for naming attributes java and c++, in which case java programmers prioritize the use of ditto and kings naming practices while c++ programmers tend to use cognome, ditto, and diminutive. surprisingly, names with numbers at the end appear 30,655 times in our study as java attributes and only 4,066 in class attributes in c++ projects. especially in largescope contexts, kings names should always be avoided by programmers. in contrast, using ditto names in such a case seems to be a reasonable choice. ides (e.g., eclipse and intellij idea) usually analyze the scope and generate suggestions from the current context and these suggestions often include information regarding the respective type. focusing on particular contexts, we might see that programmer’s practices are context-specific. for example, the table 5. spearman correlation category loc commits commiters java c++ java c++ java c++ corr p-value corr p-value corr p-value corr p-value corr p-value corr p-value kings 0.337 0.038 0.391 0.014 0.150 0.365 0.199 0.222 0.053 0.748 0.090 0.583 median 0.254 0.123 0.004 0.978 0.054 0.743 -0.197 0.226 -0.081 0.627 0.070 0.668 ditto -0.517 0.001 -0.049 0.763 -0.216 0.191 0.074 0.649 0.101 0.545 -0.041 0.801 diminutive -0.021 0.898 0.335 0.037 0.008 0.959 0.225 0.166 -0.171 0.304 -0.025 0.875 cognome -0.227 0.169 0.268 0.098 -0.300 0.066 0.188 0.250 -0.178 0.283 -0.103 0.532 index 0.341 0.036 -0.330 0.040 0.133 0.421 -0.311 0.054 -0.098 0.554 0.010 0.950 shorten 0.387 0.016 -0.196 0.229 0.124 0.453 -0.110 0.501 -0.068 0.681 0.128 0.435 gresta et al. 2023 use of practices that might result in meaningful names (e.g., ditto) is more common in long-scope contexts (attribute and method) than in short-scope ones (if, for, while, switch). especially in c++ projects, ditto makes up for the lion’s share of the parameters names. java and c++ programmers seem to adopt less descriptive names in the context of switch and while statements. as shown in figures 1 and 2, index names appear more often inside contexts surrounded by if, for, switch, and while statements, where their occurrence is widely and accepted (kernighan and pike, 1999; beniamini et al., 2017). however, as observed by avidan and feitelson (2017), hiding the plural names using single-letter words may camouflage the meaning of the respective identifier. it might not be a natural interpretation that the identifier stores more than one object. the predominance of kings and index as parameter names do not agree with the findings of avidan and feitelson (2017). their experiment indicated that parameter names contribute more to code comprehension than any other names (e.g., attributes or local variables). since parameters are part of the method header and the starting point of the comprehension task, programmers pay special attention to parameter names in order to better understand the method behavior (avidan and feitelson, 2017). however, every naming practice category we studied are used to name parameters, although, as observed by avidan and feitelson (2017), parameter names are often more carefully chosen by programmers. 4.3 rq3: do the naming practice categories carry over across different java and c++ projects? in hopes of answering the rq3, we analyzed the prevalence of naming practice spanning multiple projects. tables 1 and 2 list the categories by projects. all selected projects turned out to have problematic names, which suggests that the investigated naming category practices are probably not uncommon. even the most popular projects have naming practices which might result in meaningfulness names (e.g., fastjson, jenkins, junit4, mockito, retrofit, spring-boot, tomcat, tensorflow, and pytorch). as highlighted in tables 1 and 2, ditto and index are very common naming practices. especially, these practices are dominant (representing more than 50% of analyzed identifiers) in some projects. for example, ditto names are widely used in java and c++ programs, accounting for 85.08% in riptide (java), 80.78% of the identifier names in clickhouse (c++), 78.26% in webmagic (java), 72.93% of the names in kdenlive (c++), 68.60% in mysql-server (c++), 68.32% in percona-server (c++), 65.99% in keywhiz, and 54.46% in aeron. the problem with ditto is that when the type changes, the identifier name might lose its meaning (scalabrino et al., 2017). index names appear to be more common in java programs. for instance, these identifier names account for 58.93% of all identifiers in boofcv (java) and 66.53% in rxjava (java). it would seem that index names are not very common in c++: proxysql which is the program in which index names are most common, has around 34.7% of the identifier names following this naming practice. rocksdb and citra also include a substantial amount of identifiers named according to the index naming practice: 34.71% and 30.30%, respectively. in some isolated cases, some name practice seems to be dominant, as kings in fastjson (49.88%) and libgdx (47.83%). on the other hand, the naming practices cognome, diminutive and shorten are not dominant in any specific project. specifically, shorten seems to be a naming practice that most programmers try to avoid: programmers avoid naming identifiers using the first letter of the type. as mentioned, shorten names usually are not easy to search for in the source code and, when employed in large-scope contexts, they tend to be hard to understand. to better comprehend whether the project’s characteristics may influence the prevalence of one practice, we looked at the correlation between common software metrics (e.g., lines of code, number of contributors, and number of commits) and the predominance of the naming practice categories. table 5 summarizes the spearman test results. the results show no representative correlation between the investigated project characteristics and the categories of naming practices. overall, we can observe a low correlation between the number of contributors and the prevalence of any category. one might surmise that an increase in the number of programmers might be beneficial towards removing bad naming practices. however, this does not seems to be the case. the same rationale might be employed to the number of commits: whether the project evolves, the quality of the identifiers names might evolve or decay. though, in contrast to deissenboeck and pizka (2006), which stated that identifiers names are subject to decay during software evolution, the results show that it might not seem to be the case. especially observing loc, we might observe some compelling correlations. for example, there is a negative correlation (rho -0.517) between size and the category ditto (for java programs). therefore, names spelled in the same way as their respective types tend to be way more common in small projects. on the other hand, large java projects might tend to contain names involving practices such as index (rho 0.341) and shorten (rho 0.387). figure 3. respondents demographics 5.8% 38.5% 32.7% 23.10% less than 2 years between 2 and 5 years between 5 and 10 years more than 10 years (a) respondents experience in software development 26.9% 44.2% 28.8% undergraduate graduate graduand (b) respondents education level gresta et al. 2023 figure 4. naming practices distribution over programming statements 0 frequency pe rc en ta ge kings never rarely occasionally often kings 30.8% 48.1% 17.3% 3.8% 0.0% median 73.1% 19.2% 7.7% 0.0% 0.0% ditto 50.0% 9.6% 17.3% 15.4% 7.7% diminutive 11.5% 11.5% 46.2% 21.2% 9.6% cognome 36.5% 28.8% 19.2% 13.5% 1.9% index 25.0% 21.2% 26.9% 17.3% 9.6% shorten 40.4% 26.9% 19.2% 13.5% 0.0% never rarely occasionally often veryoften as shown in table 1, ditto and index are the most dominant practice across java projects. considering only the two categories, they account for 235,886 identifier names, representing 56.17% of all analyzed names in java projects. these results are consistent with the findings of beniamini et al. (2017). although code conventions and style guides may constrain identifier naming practices, programmers seem to be heavily influenced by ides content assist capabilities. as programmers work in the editor, content assist analyzes their code and recommended elements to complete partially entered statements. therefore, it is indispensable to provide more sophisticated and context-aware capabilities to assist programmers in naming and renaming identifiers jiang et al. (2019); isobe and tamada (2018); peruma et al. (2018, 2019). finally, programmers would seem to prioritize single-letters names in contexts where they are widely accepted (see section 4.2). 4.4 rq4: what is the perception of software developers about the investigated naming categories? this section presents the results of our survey with 52 programmers. we start by characterizing the respondents (section 4.4.1). next, we assess the relevance of the naming practice categories by how often they are used by programmers (section 4.4.2). we then analyze how naming practice categories adoption varies according to programming statements (section 4.4.3). 4.4.1 respondents’ demographics figure 3 depicts the respondents’ experience in software development and the corresponding frequencies and percentages. a total of 5.8% of the respondents have less than two years of experience, while 55.8% have more than five years of experience, suggesting that most survey respondents are experienced programmers. moreover, we seem to have collected a reasonably balanced distribution of programmers in figure 5. naming practices distribution over programming statements kings median ditto diminutive cognome index shorten 14 4 19 27 21 4 10 24 11 23 39 30 6 19 8 0 13 20 10 43 13 4 0 12 18 12 13 8 16 39 22 7 17 9 22 attribute method loop conditional none gresta et al. 2023 terms of education level. figure 3 shows the respondents’ education level. as the majority of the respondents (73%) have a graduate degree, we claim that it increases our confidence in the validity of the responses. 4.4.2 most commonly used naming practices the respondents were queried about how often they choose identifier names conforming to the naming practice categories. a five-point likert scale was used to capture respondent opinions ranging from “never” to “very often”. figure 4 shows how frequently respondents have been using each naming practice category. in our sample, diminutive is the most frequently used naming practice category (i.e., used “often” or “very often”), followed by index and ditto. this result seems to align well with our observation about the prevalence of the naming practices in open-source objectoriented programming (see section 4.1). notably, from the survey, we can make the following observations: • all the respondents adopt at least one naming practice category “occasionally” or “often”, with 26% (13) of the respondents claiming to adopt at least one naming practice “very often”. • diminutive is the most adopted naming category practice by respondent. however, as we could observe, this naming practice category is not so prevalent in the analyzed object-oriented projects (see section 4.1) as claimed by the survey programmers. • median is the least adopted naming practice category (see figure 4), with just 26% (14) of the respondents using it “rarely” or “occasionally”. the lower use of this naming practice corroborates our observation that programmers seem to be conscious of this harmful practice in object-oriented programming. • ditto is not a widespread naming practice among the survey respondents. only 12 out of 52 programmers (23%) indicated a tendency to write identifier names spelled in the same way as their types; which do not ratify our previous observations about the prevalence of ditto across java and c++ projects (see section 4.3). this contrasting result suggests that programmers might be not aware of their general use of naming practices. moreover, this might also be a sign that naming assistant features present in modern ides do not influence the respondents. 4.4.3 most commonly used naming practices according to context in order to specify the location in which programmers mainly observe the occurrences of the naming practice categories, the respondents were allowed to select multiple locations (attribute, method, loop, conditional, and none). this is expected to be done by remembering instances of naming practice categories encountered by respondents in their software development works. the two most common answers from the respondents were: attribute and method (see figure 5). these findings share similarities with those presented in section 4.2, wherein 56% of the names occur as attribute or are declared in the context of method. one notary exception is index, in which case, 43 out of 52 respondents indicated that this naming practice occurs mainly inside contexts surrounded by loop statements (for or while). indeed, as observed by beniamini et al. (2017), single-letter names can be used safely in a short-scope context. finally, as expected, the majority of respondents (39 out of 52) indicated that they usually do not observe median in their day-life (see figure 5). 5 threats to validity as with most empirical studies, our study also has some practical limitations, i.e., it is also subject to some threats to its validity. in this section, we present potential threats and how we tried to mitigate some of those issues. conclusion & external validity one potential threat is that the samples we used in our study might not be representative of the target population: our analysis took into account 40 open-source java projects and 40 c++ projects. to mitigate this threat concerning the conclusion and generalization of the study results, we tried to select a heterogeneous sample. we think the impact of this threat is minimal for three reasons: (i) java and c++ are two popular programming languages;8 (ii) our sample covers somewhat small code-bases (with less than 10k loc) and large-scale ones (with over 100k loc), and (iii) we selected projects from a broad range of domains. thus, we argue that our study can be seen as an initial step towards identifying trends java and c++ programmers follow when picking identifier names. however, given the sizes of our samples, we cannot rule out the possibility that our results do not reflect how java and c++ programmers name identifiers. that is, the results might not be generalizable beyond the study samples and the participants that took part in our survey. to understand the prevalence of naming categories across java and c++ projects, we employed a set of metrics: program size (loc), number of commits, and number of contributors. nevertheless, as with many software metrics, one potential threat is that these measurements might not be sophisticated enough for our investigation. thus, our findings might not carry over to other settings and similar programming languages. it is also worth emphasizing that context and scope would seem to play an important role in determining identifier names. for instance, some of the most common identifier names listed in table 3 would seem to be context-dependent, e.g., node. we surmise that is the case because programmers might want to include relevant domain information when turning concepts into names. although we tried our best to maximize the sample heterogeneity during sample selection, we cannot rule out the fact that the most common domains (e.g., xml file parsing) from which the programs in our sample were extracted might have an impact on variable naming. finally, the representativeness of the survey respondents cannot be guaranteed. our target population was programmers, but we did not take any measures to verify the identity 8www.tiobe.com/tiobe-index/ gresta et al. 2023 of the respondents. however, we have included two initial questions, which might have permitted us to filter out individuals not belonging to our target population. there might also exist some other factors that bias our conclusions. one example is the environment in which the respondents worked. another one is whether or not respondents have a correct understanding of each category. to mitigate the latter, we included in the questionnaire a brief description and an example of the categories. future studies can ask respondents to consider this factor and evaluate how it impacts the naming practice category adoption. construct & internal validity a threat to the construct validity of our study comes from the number of identifier names we analyzed in our study. it might be argued that a more significant amount of names may lead to better and more conclusive results. to mitigate this threat we analyzed 2,603,381 identifier names in highly diverse sets of java and c++ projects. additionally, another potential threat has to do with how well the naming practices we identified reflect extant research and current industry practices. we tried to mitigate this threat by drawing from previous research, which helped us to get a better understanding regarding whether or not some of the naming practices we identified are indeed recurring practices. we also conducted a survey with 52 participants in order to gather programmers’ perceptions about the use and occurrence of the investigated naming practices. we tried to minimize possible construct and internal validity associated with the survey by disseminating it online through multiple websites and online groups; and introducing a brief description and an example of each question. 6 conclusion coming up with proper identifier names is challenging (brooks, 1983). as stated by host and ostvold (2007), even though programmers have to name identifiers on a daily basis, it still entails a great deal of time and thought. to make matters more challenging, identifier names are pivotal for program comprehension: developers have to go over identifier names to comprehend the code that they need to update and poorly chosen names might hinder source code comprehension (avidan and feitelson, 2017). given that it has been estimated that identifiers contribute to about 70% of a software system’s codebase (deissenboeck and pizka, 2006), it cannot be disputed that there is a need to define what makes up a good identifier as well as to assist developers in naming identifiers. similarly, identifying practices that result in poor identifier names might enhance programmers’ awareness and contribute to improving educational materials and code review methods. as an initial foray into creating an approach to optimal identifier naming (i.e., how to assign the proper words to an identifier), we investigated eight naming practices categories “in the wild”. the categories provide examples of naming practices from real-world software projects. we illustrated their possible consequences and also outlined their prevalence across projects and code contexts (i.e., attribute, parameter, method, for, while, if, and switch). our results based on 2,603,381 identifier names extracted from 80 real-world java and c++ projects and on a survey, would seem to suggest the following: • the eight categories are recurrently found in practice, but two are more common in java and c++ projects: naming identifiers with the same name as her type (ditto) and use single-letter names denoting counters (index). specifically, index and ditto are by far the most frequently occurring naming practices across java projects: index occurrences account for approximately 30% of all naming practice occurrences in the examined java projects, while ditto occurrences amounted to roughly 27%. as for c++ programs, ditto is the most widely used naming practice, which accounts for around 54% of all naming practice occurrences. index and diminutive are also popular among c++ coders, accounting for 13% and 11% of all naming practice occurrences. shorten seems to be the least used naming practice both by java and c++ programmers. additionally, programmers seem to be hardly influenced by ide-like features that help them to choose identifier names, although only 12 out of 52 surveyed programmers (23%) acknowledged a tendency to write identifier names spelled in the same way as their types; • there are several very common names (e.g., value; result; and name) and recurrent single-letter names (e.g., i, e, s, c) used in practice. the lion’s share of these names are used to denote identifiers that store either integer or string values. according to our results, single-letter identifiers are more commonly used by java programmers: i, e, s, c, t, a, b, p, and n would seem to be widely used by programmers. in c++ (in contrast to java), coders tend to prefer a smaller set of single-letter names: i, e, s, s, c, t, a, b, p, and n. thus, differently from java, in c++ e, c, t, and a do not rank among the most common single-letter identifier names; • the programmers naming practices are context-specific: single-letters names (index and shorten) seem to be more common in short-scope contexts (if, for, while), although they can also be found in large-scope contexts (e.g., attribute). results from our survey questionnaire showed that programmers acknowledge that the index naming practice occurs mainly inside contexts surrounded; • diminutive is the most adopted naming category practice by survey respondents and median is the least used naming practice. all the respondents adopt at least one naming practice category “occasionally” or “often”. • we could benefit from including poor naming practices in code reviews. the current practices follow extensive checklists, but no one addresses naming issues. a more nuanced take is to consider variable names that depart from commonly used naming practices as elements that can lead to a source of problems. we believe our results have the potential to inspire several future research directions. our work highlights the need for further research on how naming practices are prevalent in source code and how better names can be chosen. in this gresta et al. 2023 direction, an aspiring goal would be to devise tools capable of automatically evaluating and suggesting renaming opportunities during code review. similarly, code generation tools can capitalize on commonly used naming practice to generate names automatically. additionally, since our results would seem to suggest that some identifier names are contextdependent, we believe that tools (e.g., ide-based identifier name recommendation system) can take advantage of context information during software development by constantly monitoring how programmers name identifiers so that it can help developers new to a given project through the automated recognition of contextand project-specific naming conventions. therefore, this automated identifier naming assistant can support developers by identifying inappropriate naming choices and making recommendations. as a result, our longterm goal is to support the identification of opportunities to rename identifiers and understand more about programmers naming practices. finally, as future work, we plan to perform a qualitative study on commits, code changes, and review discussions. another possible future research avenue would be to account for the role of human factors in choosing identifier names by exploring how programmer experience, team size, and mood influence naming practices throughout different software projects. although our results give practitioners and researchers alike a good glimpse into the most common options for naming identifiers in c++ and java, we did not investigate how each naming practice contributes, if at all, to improving code comprehension. therefore, future research efforts should aim to better understand how these commonly used naming practices influence readability during code comprehension. references allamanis, m., barr, e. t., bird, c., and sutton, c. (2014). learning natural coding conventions. in international symposium on foundations of software engineering. alsuhaibani, r. s., newman, c. d., decker, m. j., collard, m. l., and maletic, j. i. (2021). on the naming of methods: a survey of professional developers. in international conference on software engineering. arnaoudova, v., di penta, m., and antoniol, g. (2016). linguistic antipatterns: what they are and how developers perceive them. empirical software engineering, 21(1):104– 158. avidan, e. and feitelson, d. g. (2017). effects of variable names on comprehension: an empirical study. in 25th international conference on program comprehension. beniamini, g., gingichashvili, s., orbach, a. k., and feitelson, d. g. (2017). meaningful identifier names: the case of single-letter variables. in international conference on program comprehension, pages 45–54. brooks, r. (1983). towards a theory of the comprehension of computer programs. international journal of manmachine studies, 18(6):543–554. brown, w. h., malveau, r. c., mccormick, h. w. s., and mowbray, t. j. (1998). antipatterns: refactoring software, architectures, and projects in crisis. john wiley & sons, inc., usa, 1st edition. butler, s., wermelinger, m., yu, y., and sharp, h. (2010). exploring the influence of identifier names on code quality: an empirical study. in 2010 14th european conference on software maintenance and reengineering, pages 156–165. ieee. caprile, b. and tonella, p. (2000). restructuring program identifier names. in icsm, pages 97–107. charitsis, c., piech, c., and mitchell, j. (2021). assessing function names and quantifying the relationship between identifiers and their functionality to improve them. in conference on learning@ scale. collard, m. l., decker, m. j., and maletic, j. i. (2013). srcml: an infrastructure for the exploration, analysis, and manipulation of source code: a tool demonstration. in 2013 ieee international conference on software maintenance, pages 516–519. ieee. deissenboeck, f. and pizka, m. (2006). concise and consistent naming. software quality journal, 14(3):261–282. dileo, c. (2019). clean ruby. dos santos, r. m. and gerosa, m. a. (2018). impacts of coding practices on readability. in internation conference on program comprehension. fakhoury, s., ma, y., arnaoudova, v., and adesope, o. (2018). the effect of poor source code lexicon and readability on developers’ cognitive load. in international conference on program comprehension. feitelson, d., mizrahi, a., noy, n., shabat, a. b., eliyahu, o., and sheffer, r. (2020). how developers choose names. ieee transactions on software engineering. gresta, r. and cirilo, e. (2020). contextual similarity among identifier names: an empirical study. in workshop de visualização, evolução e manutenção de software, pages 49– 56. sbc. gresta, r., durelli, v., and cirilo, e. (2021). naming practices in java projects: an empirical study. in xx brazilian symposium on software quality, pages 1–10. acm. hofmeister, j., siegmund, j., and holt, d. v. (2017). shorter identifier names take longer to comprehend. in 2017 ieee 24th international conference on software analysis, evolution and reengineering (saner), pages 217–227. ieee. host, e. w. and ostvold, b. m. (2007). the programmer’s lexicon, volume i: the verbs. in international working conference on source code analysis and manipulation. isobe, y. and tamada, h. (2018). are identifier renaming methods secure? in international conference on software engineering, artificial intelligence, networking and parallel/distributed computing. jiang, l., liu, h., and jiang, h. (2019). machine learning based recommendation of method names: how far are we. in international conference on automated software engineering. kawamoto, k. and mizuno, o. (2012). predicting fault-prone modules using the length of identifiers. in 2012 fourth international workshop on empirical software engineering in practice, pages 30–34. ieee. kernighan, b. w. and pike, r. (1999). the practice of programming. addison-wesley longman publishing co., inc. gresta et al. 2023 lawrie, d., feild, h., and binkley, d. (2007a). quantifying identifier quality: an analysis of trends. empirical software engineering, 12(4):359–388. lawrie, d., morrell, c., and feild, h. (2007b). effective identifier names for comprehension and memory. innovations syst softw eng, 3(1):303–318. lawrie, d., morrell, c., feild, h., and binkley, d. (2006). what’s in a name? a study of identifiers. in 14th ieee international conference on program comprehension. marcus, a., sergeyev, a., rajlich, v., and maletic, j. i. (2004). an information retrieval approach to concept location in source code. in 11th working conference on reverse engineering, pages 214–223. ieee. martin, r. c. (2008). clean code: a handbook of agile software craftsmanship. nyamawe, a. s., bakhti, k., and sandiwarno, s. (2021). identifying rename refactoring opportunities based on feature requests. international journal of computers and applications, pages 1–9. oliveira, d., bruno, r., madeiral, f., and castor, f. (2020). evaluating code readability and legibility: an examination of human-centric studies. in international conference on software maintenance and evolution. peruma, a., mkaouer, m. w., decker, m. j., and newman, c. d. (2018). an empirical investigation of how and why developers rename identifiers. in 2nd international workshop on refactoring. peruma, a., mkaouer, m. w., decker, m. j., and newman, c. d. (2019). contextualizing rename decisions using refactorings and commit messages. in international working conference on source code analysis and manipulation. ratiu, d. and deissenboeck, f. (2006). programs are knowledge bases. in 14th ieee international conference on program comprehension (icpc’06), pages 79–83. ieee. scalabrino, s., bavota, g., vendome, c., linares-vásquez, m., poshyvanyk, d., and oliveto, r. (2017). automatically assessing code understandability: how far are we? in international conference on automated software engineering. schankin, a., berger, a., holt, d. v., hofmeister, j. c., riedel, t., and beigl, m. (2018). descriptive compound identifier names improve source code comprehension. in international conference on program comprehension. swidan, a., serebrenik, a., and hermans, f. (2017). how do scratch programmers name variables and procedures? in international working conference on source code analysis and manipulation (scam), pages 51–60. takang, a. a., grubb, p. a., and macredie, r. d. (1996). the effects of comments and identifier names on program comprehensibility: an experimental investigation. j. prog. lang., 4(3):143–167. tofte, m. and talpin, j.-p. (1997). region-based memory management. information and computation, 132(2):109– 176. wainakh, y., rauf, m., and pradel, m. (2021). idbench: evaluating semantic representations of identifier names in source code. in international conference on software engineering. gresta et al. 2023 appendix a survey questionnaire education level ◦ undergraduate ◦ graduate ◦ graduand experience in software development ◦ under two years ◦ two to five years ◦ six to ten years ◦ over ten years 1. how often do you choose identifier names with numbers at the end? examples: people people1; people people2 ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see identifier names with numbers at the end? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 2. how often do you choose identifier names with numbers in the middle? example: char int2char ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see identifier names with numbers in the middle?? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 3. how often do you name identifiers after their type names? examples: string string, people people ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see identifier names spelled in the same way as their types? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 4. how often do you name identifiers as chunk of their respective type name? examples: engineexecutiontestlistener listener ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see identifier names as chunk of their respective type name? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 5. how often do you includes in identifier names an additional suffix or prefix that is the name of the respective type? examples: string namestring ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see identifier names containing an additional suffix or prefix that is the name of the respective type? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 6. how often do you choose single-letter identifier names? examples: integer j ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see single-letter identifier names? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 7. how often do you name identifiers with the starting letters that correspond to their respective types? examples: people p ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see names which are the starting letters that correspond to their respective types? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none introduction background and related work naming names in software quality empirical study setup goal and research questions naming practice categories kings median ditto diminutive cognome index and shorten famed data extraction and analysis experimental results rq1: how prevalent are the naming practice categories? very common names rq2: are there context-specific naming practices categories? rq3: do the naming practice categories carry over across different java and c++ projects? rq4: what is the perception of software developers about the investigated naming categories? respondents' demographics most commonly used naming practices most commonly used naming practices according to context threats to validity conclusion survey questionnaire journal of software engineering research and development, 2022, 10:12, doi: 10.5753/jserd.2022.2576 this work is licensed under a creative commons attribution 4.0 international license. understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers barbara beato ribeiro [ universidade federal do estado do rio de janeiro | barbara.ribeiro@edu.unirio.br ] catarina costa [ universidade federal do acre | catarina.costa@ufac.br] rodrigo pereira dos santos [ universidade federal do estado do rio de janeiro | rps@uniriotec.br] abstract merge conflicts are very common in collaborative software development, which is supported mainly by the use of branches that can be potentially merged. in this context, several studies have proposed mechanisms to avoid conflicts whenever possible and some identified factors that lead to conflicts. in this article, we report on an investigation of factors that can lead to conflicts or that can somehow reduce the chances of conflict from the developers’ perspective. to do so, based on related work, we conducted two empirical studies with brazilian software developers to both understand and analyze factors that affect merge conflicts. firstly, we conducted survey research with 109 software developers to understand how they use branches, the occurrence of conflicts and the resolution process, and factors that can lead to or avoid conflicts. results showed that the use of branches is very common and mostly has the purpose of creating a new feature or fixing a bug. according to the participants, in most projects, developers have the autonomy to create new branches and sometimes conflicts happen. the main factors that can lead to conflicts are “the time a branch is isolated” and “lack of communication”. on the other hand, the factors cited as good practices to avoid conflicts were “improve team communication” and “less branching duration”. secondly, we conducted a field study based on interviews with 15 software developers to analyze those factors to understand better what leads to or avoids conflicts in a merge. finally, this work allowed us to conclude that communication with the team, checking code updates, shorter branch duration, and management are important for software developers, especially when they think about what increases and decreases merge conflicts. keywords: version control, merge conflicts, survey research, field study, software developers 1 introduction version control systems (vcs) allow the creation of parallel branches in a simplified way. however, there is a cost regarding merge conflicts, which are common in collaborative software development. developers usually combine the work they have performed in parallel and may have changed the same parts of a specific file. although the solution is frequently present in one or both conflicting versions, it does not necessarily mean that it is a trivial task (ghiotto et al., 2018). conflict resolution might degrade the quality of the merged code and requires a deeper understanding of the program’s structure and goals (shihab et al., 2012; brindescu et al., 2020a). the person in charge may not have all the necessary knowledge to make the best decision or not feel comfortable making decisions by himself/herself over source code that was coded by other developers (shihab et al., 2012; costa et al., 2014). in some cases, it may be necessary to verify the knowledge of developers in the changes made in the branches to choose one or more developers to resolve the conflict (costa et al., 2019). in this context, recent studies (leßenich et al., 2018; owhadi-kareshk et al., 2019; dias et al., 2020; menezes et al., 2020, 2021; vale et al., 2020) have investigated factors, indicators and attributes that can lead to merge conflicts. such studies have found evidence that some factors can impact merge conflicts more than others. therefore, we decided to use this knowledge as a reference to verify the software developer’s perspective in relation to factors that can lead or help to avoid merge conflicts. as such, based on related work, we conducted two empirical studies to both understand and analyze factors that affect merge conflicts. firstly, we conducted survey research with 109 brazilian software developers to understand the way they use branches, the occurrence of conflicts and the resolution process, and factors that can lead to or avoid merge conflicts. the following three research questions guided our survey: • rq1 (branches): how often are branches created in software projects? • rq2 (merge conflicts): what factors lead to merge conflicts? • rq3 (resolve conflicts): which practices do developers generally adopt to avoid merge conflicts? we found that the main factors that can lead to conflicts are “the time a branch is isolated” and “lack of communication”. this communication refers to the awareness of parallel changes: sometimes developers forgot to communicate what they were changing, resulting in two developers changing the same functionality or something very close. on the other hand, the factors cited as good practices to avoid conflicts were “improve team communication” and “less branching duration”. others mentioned by the participants were “divide the work among the team”, “small changes”, and “frequent commits”. we also identified that the main reasons to create a branch are “create new features” and “bug fixes”, and participants mentioned that developers create branches “frequently”. secondly, we conducted a field study based on interviews https://orcid.org/0000-0002-5215-2845 mailto:barbara.ribeiro@edu.unirio.br https://orcid.org/0000-0002-8851-1563 mailto:catarina.costa@ufac.br https://orcid.org/0000-0003-4749-2551 mailto:rps@uniriotec.br understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 with 15 brazilian software developers to analyze those factors to obtain a better understanding of what leads to or avoids merge conflicts. the following two new research questions guided our field study: • rq4 (produce conflicts): how do the factors identified in the survey research mostly contribute increasing merge conflicts? • rq5 (avoid conflicts): how do the factors identified in the survey research mostly contribute decreasing merge conflicts? we deepened the factors highlighted in the survey and observed that most software developers agree with them and went through some situation that reinforces their opinion. furthermore, time of experience was mentioned, highlighting that the experience can modify the software developer’s perception regarding the question and, that the technology itself could evolve in this time, improving the work. this article is an extended version of a conference paper (costa et al., 2021) in which we answered the first three research questions, focused on the characterization of software developer’s perceptions of factors related to merge conflicts. we complement our previous work by adding two new research questions analyzing how developers see these factors and if and how they contribute to increasing and/or decreasing the chances of a merge conflict occurring. this article is organized as follows. we explain the merge conflict scenario and discuss related work in section 2. in section 3, we describe the research method. we present the studies conducted in this work, as well as their results and findings in sections 4 and 5. discussion and implications are presented in section 6. section 7 refers to threats to validity and credibility. finally, section 8 concludes this paper with some final remarks and opportunities for future work. 2 background in this section, we discussed the concepts of merge conflicts and other works that also investigated factors or attributes that can lead to conflicts. 2.1 merge conflicts textual or physical conflicts occur due to simultaneous modifications (e.g., addition, removal or editing) over the same physical parts of a file (e.g., same line) by several developers. direct conflicts are detected by a vcs and require resolution from a developer or a project team. figure 1 shows an example of a conflicting chunk detected by git where each part of the chunk has a version of a function to sum two values in python programming language. in this case, a developer in charge must choose one of the versions, since they have the same intention. ghiotto et al. (2018) verified how developers resolved conflicting chunks across 2,731 java projects. the authors found that the resolution of conflicting chunks is frequently present in one of the versions and three quarters of the conflicting figure 1. conflict detected by vcs chunks were resolved by choosing one of the versions version 1 (50%) or version 2 (25%). in some cases, it was necessary a concatenation (3%), or a combination (9%), or even a new code (13%). this does not necessarily mean that it is a trivial task, the person in charge must understand the conflicting intentions and generate a single version. vale et al. (2021) investigated the influence of some factors on conflict resolution time and found that the number of chunks, lines of code, conflicting chunks, developers involved, conflicting lines of code, conflicting files, and the complexity of the conflicting code influence the merge conflict resolution time. accioly et al. (2018) found that merge conflicts happened in 9.38% of their data set. the authors also mentioned that merging branches is not likely to be a simple task, since one needs to understand and merge contributions performed by different developers, probably working on different assignments (accioly et al., 2018). menezes et al. (2020) found that merge conflicts happened in 7.11% of their data set, but the number of merge conflicts is more than 20% in some projects. kasi and sarma (2013) analyzed a set of projects and found that merge conflicts ranged from 7.6% to 19.3%. in the study conducted by brun et al. (2011), 17% of merge operations required human assistance to resolve a textual conflict. as conflicts can be common, their consequences can be a problem for the quality of some projects. as mentioned by brindescu et al. (2020a), this situation can affect the code quality, given that developers can follow an established process of peer review of code submissions. however, a solution with a lower quality can be produced during the resolution of the merge. in fact, merge conflicts are widely discussed in the literature. some works (sarma et al., 2008; brun et al., 2011; sarma et al., 2011; guimarães and silva, 2012; estler et al., 2013) aim to prevent conflicts by monitoring workspaces and notifying developers of the potential conflicts. such approaches are important initiatives, but they do not guarantee conflict-free merges, mainly due to the adoption of branches. others (cavalcanti et al., 2015; mckee et al., 2017; accioly et al., 2018; ghiotto et al., 2018) try to characterize merge conflicts in order to learn more about the topic and support initiatives that help to reduce the number of conflicts. on the other hand, researchers (leßenich et al., 2018; owhadikareshk et al., 2019; dias et al., 2020; menezes et al., 2020, 2021; vale et al., 2020) have started looking at factors, attributes and indicators that can lead to or avoid conflict more recently. understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 table 1. related work attributes (leßenich et al., 2018) (owhadikareshk et al., 2019) (vale et al., 2020) (dias et al., 2020) (menezes et al., 2020) (menezes et al., 2021) abstract syntax tree (ast) nodes changed x – – – – – changed chunks x – x – – x changed files – x x x x x changed files in both branches (intersection) x x – – x x changed lines of code x – x x – x changes inside class declarations x – – – – – commit density x x – – – x commits x x x x x x communication measures – – x – – – developers x x x x x x duration x x x x x x files with merge conflict – – – x x x length of commit messages – x – – – – merge conflict occurrence x – x x x x modularity – – – x – – predefined keywords in commit messages – x – – – – programming language – – – – – x self-conflict – – – – x x 2.2 related work the studies (leßenich et al., 2018; owhadi-kareshk et al., 2019; dias et al., 2020; menezes et al., 2020, 2021; vale et al., 2020) that investigated factors, attributes or indicators that may lead to conflicts analyze timing and size attributes of merge scenarios, such as commits, committers, lines of code, files, and others. these studies and the factors, attributes or indicators are summarized in table 1. leßenich et al. (2018) investigated indicators to predict the number of merge conflicts. such indicators were inferred from a survey with 41 developers. in the survey, developers mentioned what causes merge conflicts: formatting changes, large-scale refactoring, structural changes in longliving forks, and import statements. next, the authors conducted an empirical study with 163 open source projects, including 21,488 merge scenarios. they investigated the correlation of some indicators (commits, files, chunks, lines of code, developers, and others) with the number of conflicts. for example, they explored the commit density, with the hypothesis that “many commits within a small time span are more likely to produce conflicts than the same number of commits over longer time spans”. they did not observe any strong correlation with the number of conflicts and rejected this hypothesis. in fact, they found that no indicator analyzed in work can predict the number of merge conflicts, as suggested by the survey. owhadi-kareshk et al. (2019) also investigated if conflict prediction is feasible. so, they designed a classifier for predicting merge conflicts. the authors conducted an empirical study with 744 open source projects, including 267,657 merge scenarios, written in seven programming languages. they created and used a set of potentially predictive features for merge conflicts based on the literature on software merging. similarly to the work of leßenich et al. (2018), they also investigated the commit density, with the intuition that “lots of recent activity may increase the chance of conflicting changes”. moreover, they did not find a correlation between their feature sets and conflicts, but they were able to indicate merge scenarios that are not likely to have conflicts. dias et al. (2020) investigated the effect of modularity, size, and timing of developer’s contributions on merge conflicts. the authors conducted an empirical study with 125 open source projects, including 73,504 merge scenarios, written in two programming languages. they found that “conflict occurrence significantly increases when contributions to be merged are not modular”. they also mentioned that “conflict occurrence increases when contributions to be merged have more developers, commits, and changed files” and “contributions developed over longer periods of time are more likely associated with conflicts”. in a previous study, we also investigated size and timing attributes that can lead to conflicts (menezes et al., 2020). we conducted an empirical study with 80 open source projects, including 182,273 merge scenarios, written in ten programming languages. we performed statistical tests and mined association rules. we found that some attributes in the branch that is being integrated (branch 2) have more influence than the same attributes in the other branch. for example, committers, commits, and changed files in branch 2 have a large impact on the occurrence of merge conflicts. timing attributes, commits in branch 1, and changed files in branch 1 have a small influence. it is relevant to mention that this work calculated the metrics (except the timing attributes) by branch. the timing attributes were calculated by merge scenario, as well as the other attributes of the other works described here. menezes et al. (2021) verified more attributes (chunks, changed lines of code, commit density, programming language) in a second study. the attributes that presented a understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 higher relation to the occurrence of merge conflicts were changed files, commits, and committers in the branch b2 (as in the first study), and changed lines of code in b2. vale et al. (2020) investigated the role of communication activity in the occurrence or avoidance of merge conflicts. the authors conducted an empirical study with 30 open source projects involving 19,000 merge scenarios. they mined and linked contribution (git) and communication (github) data. they quantified the amount of github communication in merge scenarios (the communication of all active contributors awareness-based, means of pull requests, and related issues pull-request-based, and the communication mapped to artifacts that have been changed in the merge scenario changed-artifact-based). the authors found no significant relation between communication measures and number of merge conflicts. they also performed a multivariate analysis using merge scenarios’ characteristics, such as size, number of developers, and duration. against their expectations, they did not find a strong correlation between the size of merge scenario code changes and the occurrence of merge conflicts. finally, related work investigated similar attributes, although they reached different results. it is worth mentioning that they used different analysis techniques, projects and languages, and some attributes are calculated differently as well. however, the important implication of such related studies to the present work is the possibility of gathering some knowledge and investigating the developers’ perspective through a qualitative method focused on empirical studies addressing characteristics of open source projects. 3 research method based on related work identified as the first step of this work (section 2), we conducted two empirical studies to both understand and analyze factors that affect merge conflicts. firstly, we conducted survey research with 109 software developers to understand the way they use branches, the occurrence of conflicts and the resolution process, and factors that can lead to or avoid conflicts. secondly, we conducted a field study based on interviews with 15 software developers to analyze those factors to obtain a better understanding of what contributes to increase or decrease merge conflicts. we conducted survey research with brazilian software developers. the survey aimed to collect opinions on the actions that software developers usually take when they need to create or work in branches and merge code files. the study was directed to software developers who used any vcs to coordinate changes in their projects. next, we performed a field study with 15 developers based on conducting interviews. the field study aimed to deepen and detail the answers obtained in the survey research. these studies allowed us to organize a discussion and point out implications to researchers and practitioners in the field. 4 understanding factors that affect merge conflicts in this section, we present details on the survey planning and execution, as well as information about the survey participants. finally, we answer our first three research questions. 4.1 planning and execution we adopted the following steps to run the survey based on the principles presented by pfleeger and kitchenham (2001): (1) setting specific and measurable objectives, (2) planning and scheduling the survey, (3) preparing the data collection instrument, (4) validating the instrument, (5) selecting participants, (6) analyzing the data, and (7) reporting the results. we planned and constructed our questionnaire from the first three research questions presented in section 1 and based on the factors mentioned in related work (leßenich et al., 2018; owhadi-kareshk et al., 2019; dias et al., 2020; menezes et al., 2020; vale et al., 2020), mainly in the survey provided by leßenich et al. (2018). this questionnaire was divided into three sections: (1) basic information and professional experience, (2) use of branches, and (3) merge conflicts. our previous work and survey responses in portuguese are publicly available on github1. we performed a pilot with four software development practitioners aiming at validating the questionnaire and estimating response time. based on the answer and suggestions, we adjusted and improved the questionnaire. we sent out the questionnaire to developers via email, together with some contextual information such as the research objective, expected knowledge in version control, and estimated time to answer (5 minutes). as we used mailing lists and asked developers to share the survey with colleagues, we cannot compute a response rate. open and closed questions were used in the survey. the questions included in the survey are: 1. age (less than 24 years old, between 25 and 34 years old, between 35 and 44 years old, between 45 and 54 years old, more than 55 years old); 2. level of education (high school, technical education, bachelor’s degree, specialization degree, masters’ degree, phd); 3. job sector (private sector, public sector, both, selfemployed); 4. experience (between 1 and 5 years, between 6 and 10 years, between 11 and 15 years, between 16 and 20 years, more than 20 years); 5. average size of the project teams (between 1 and 5 people, between 6 and 10 people, between 11 and 15 people, more than 15 people); 6. version control tools (clear case, cvs, git, jazz, mercurial, pvcs version manager, rsc, subversion, team foundation server, visual source safe, others: ); 7. branch creation frequency (rarely, sometimes, frequently, very frequently, always); 1https://github.com/catarinacosta/mactool/blob/master/surveyanswerssbes2021.xlsx understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 8. reason (test, bug fixes, release, new features, refactoring, others: ); 9. branch creation policy (developers have autonomy to create new branches, only the project manager or the person who maintains the software, the team decides, others: ); 10. conflicts frequency (rarely, sometimes, frequently, very frequently, always); 11. factors that contribute to the occurrence of conflicts (number of changed files, number of changed lines, number of commits, number of developers, branching duration, lack of communication, developer working in several branches, others: ); 12. time to resolve a merge conflict (some hours less than 24 hours, some days 1 to 6 days, one week, more than a week); 13. difficulty in resolving a merge conflict (very easy, easy, medium, difficult, very difficult); 14. practices to avoid conflicts (team communication, less branching duration, small changes, frequent commits, divide the work among the team, others: ). we adopted the card sorting approach (spencer, 2009; zimmermann, 2016) to analyze the answers to the openended questions (in this questionnaire, optional questions 6, 8, 9, 11, and 14, in which the participants could enter other data) and obtained some answers not listed in the initial survey options. to do so, we grouped similar responses to the open-ended questions into codes. the coding was performed by two researchers who discussed the codes and categories and then were reviewed by another researcher with 10 years of experience in qualitative studies. an example of the coding is presented in figure 2, in which the codes are first extracted and the categories emerge after checking the similarity. figure 2. example of coding 4.2 results from the 109 brazilian software developers that answered the questionnaire, 38.5% are between 25 and 34 years old, and 33% are between 35 and 44 years old. less than 24 years old are 12.8%, and between 45 and 54 years old are 11%. finally, more than 55 years old are only 5%. 35.8% have bachelor’s degree, masters’ degree (29.4%), or specialization degree (22%). we asked participants where they worked and how much experience in software development they had. regarding the experience as a developer, 27.5% have between 11 and 15 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% git subversion team foundation server mercurial cvs visual source safe figure 3. experience with version control systems years of experience, and also 27.5% have between 6 and 10 years of experience. moreover, 23.9% have between 1 and 5 years of experience, 11% have more than 20 years of experience, and 10.1% have between 16 and 20 years years of experience. additionally, we asked participants on the number of people in the last project they participated (or on average in their career). 38.5% answered that they worked in teams from 1 to 5 members, 36.7% worked in teams from 6 to 10 members, and 15.6% worked with more than 15 members. finally, 9.2% answered that they worked in teams from 11 to 15 members. we also wanted to identify which tools developers used to adopt for version control. in this question, the participant was allowed to mark more than one answer. 105 (96.3%) developers marked that they have experience with git, 39 (35.8%) have experience with subversion, and 18 (16.5%) have experience with team foundation server. mercurial, cvs, and visual source safe were also mentioned. the developers were free to include other types of vcs not listed in the questionnaire, but no one answered anything different from the list. the information is shown in figure 3. 4.2.1 rq1 (branches): how often are branches created in software projects? we asked participants how often they create branches on software development. in the case of named branches, we believe that there is a scenario that may be more likely to conflict and be more complex to resolve. we would like to know the reason for the creation as well as the policies adopted to do so. respondents could answer: rarely, sometimes, frequently, very frequently, or always. the prevalent answer was “always” (45.9%). developers also chose “very frequently” (19.3%), and “frequently” (15.6%). therefore, we can say branching is a very common practice among the participants. results are shown in figure 4. we also verified that developers create branches “always” in projects of private companies (64.2%) more than in government projects (26.5%). we verified the main reasons for creating branches. in this question, developers were allowed to mark more than one answer. 94 (86.2%) participants answered that the main reason is “to create new features”, 81 (74.3%) answered “to fix bug”, and 46 (42.2%) mentioned “refactoring”. “test” (35.7%) and “release” (35.7%) were also chosen by 39 respondents as understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 45.9% 19.3% 15.6% 11.9% 7.3% always very frequently frequently sometimes rarely figure 4. frequency of branching the main reasons. participants could also use the open field to write other reasons. two developers mentioned “proof of concept”, and one mentioned “enable the collaboration of different people”. the results are shown in table 2. some software developers (hereafter referred to as sd) used the open field not to add a new reason, but to explain the selected reasons: “we usually want to implement new features and this ends up generating a new branch, (...) many times to make releases for the client we have to use a new branch.” (sd48) “refactoring is what we do most in the private company i work for.” (sd59) others developers also mentioned a different reason: “test new features and create proofs of concept.” (sd06) “for different people to be able to participate in the collaborative development.” (sd83) moreover, we evaluated policies adopted by the participants’ projects for creating a new branch. they responded that developers have “autonomy to create new branches” (68.8%). in contrast, others answered that the “team decides when a new branch will be created” (23.9%). only 7.3% marked “only the project manager or the person who maintains the software”. participants could also use the open field for other response options. four developers mentioned the “use of git flow”, a set of guidelines and a tool for creating and standardizing the use and name of branches in a project. one developer also pointed that “branches are automatically created by code review and pipeline systems”, and another mentioned that the policy is “not use branch”. the results are shown in table 3. some developers used the open field not to add a new policy, but to explain the selected policy, as exemplified in the following: “the developers create as many branches as they think it is necessary, but each one is responsible to constantly update the branch and integrate with the work of the others or exclude it if it does not have a well-defined purpose.” (sd48) table 2. reasons for creating branches reasons # % new features 94 86.2% bug fixes 81 74.3% refactoring 46 42.2% testing 39 35.7% release 39 35.7% reasons also mentioned # proof of concept 2 enable the collaboration of different people 1 table 3. policies for creating branches policies # % developers have autonomy to create new branches 75 68.8% team decides when a new branch will be created 26 23.9% only the project manager or the person who maintains the software 8 7.3% policies also mentioned # git flow 4 automatically created (by code review and pipeline/continuous delivery systems) 1 commit to a master branch (no branch) 1 “the team always discuss when it is really worth creating a new branch, managing new branches is difficult and if we are not in control something can be wrong.” (sd87) two developers selected the “team decides when a new branch will be created” and mentioned the “git flow strategy”. two developers selected that “the developers have autonomy to create new branches”, and also mentioned the “git flow”. as such, although the projects adopt a similar strategy and tool, some projects give more autonomy to members and others prefer to discuss each decision in depth: “we use git flow, where both developers and managers have responsibilities when creating branches.” (sd108) “a production and a development branch, based on the concept of git flow.” (sd28) answer to rq1: branches are created frequently. developers have the autonomy to decide when to create the branch. the main reasons are to create new features and bug fixes. 4.2.2 rq2 (merge conflicts): what factors lead to merge conflicts? we found that the use of branches is very common. however, as mentioned by shihab et al. (2012), such level of isolation sometimes implies a cost of having to resolve integration conflicts. to measure how often merge conflicts occur, we asked understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 developers to estimate the frequency: rarely, sometimes, frequently, very frequently, or “all the time”. for 45% of the participants, conflicts occur “sometimes”. the second most chosen option was “frequently” (24.8%). in turn, the third most chosen option was “rarely” (16.5%). it is important to mention that for 13.8%, conflicts occur “very frequently”. this measure leads us to conclude that conflict occurrence is common to be between “sometimes” and “frequently”. results are shown in figure 5. we also found that conflicts are more common in government projects (52.9% of developers working for the government scored “frequently” or “very frequently”) than in projects of private companies (24% of developers working for private companies selected the option “frequently” or “very frequently”). 13.8% 24.8% 45.0% 16.5% always very frequently frequently sometimes rarely figure 5. frequency of conflict occurrence. we also checked the factors that lead to the occurrence of conflicts. in this question, participants were allowed to mark more than one answer. 76 (69.7%) developers marked the option “branching duration”, i.e., the time a branch is isolated. the “lack of communication” among a team’s members was also chosen by 64 (58.7%), and the “number of changed files” was cited by 53 (48.6%). the “number of developers” was also chosen by 42 (38.5%). developers could also use the open field for other response options. five developers said that “not synchronizing the repositories” can lead to conflicts. three developers mentioned the “difference in the code formatting” as a reason that can lead to conflicts and two developers mentioned “tasks not mapped correctly”. results are shown in table 4. some developers used the open field to explain the selected factors that can lead to conflicts and also to add more factors. they selected and mentioned factors such as “time between the branch and the merge” and “lack of communication”, but they rather mentioned the fact that “repositories are not kept up to date” and “complex functionalities” can lead to conflicts, as exemplified next: “generally the longer the time between the branch and the merge, the more files tend to be changed (...), resulting in greater possibilities of conflicts. another point that influences is the nonpractice of constant rebase, leaving the branch out of date with respect to its origin (usually the master). complex functionality can also influence branches that take longer to merge.” (sd04) “the lack of communication is the worst of them. because if the team communicates daily, one table 4. factors that lead to conflicts factors # % branching duration 76 69.7% lack of communication 64 58.7% number of changed files 53 48.6% number of developers 42 38.5% number of lines of code 31 28.4% same developers in many branches 28 25.6% number of commits 24 22.0% factors also mentioned # do not keep repositories up to date 5 code formatting 3 tasks not mapped correctly 2 coupling level of the code 1 complex features 1 long time to deploy 1 many features in development 1 tasks not correctly mapped/broken into small pieces 1 technical debt 1 knows what the others are up to, and conflicts are mitigated/reduced. if conflicts are not easy, you need more communication or more frequent integration (minor merge).” (sd55) “whenever there’s a conflict, it is because developers forgot to communicate what they were changing, resulting in two developers changing the same functionality or something very close.” (sd16) some developers mentioned unlisted factors, such as the difference in the “code formatting”, “tasks not mapped correctly” and “technical debt”. as conflicts occur in modifications in the same code region, criticism of minor issues such as code writing and style is understandable: “lack of configuration in the editors, which change between indentation with tab/space, amount of space, line break... just opening and saving the file, and lack of style in the code, where each one writes the code in a different way, and another developer adds/removes spaces, parentheses (...) this causes a change in one line to reflect the entire file.” (sd26) “tasks not mapped correctly/broken into small enough parts correctly that lead to interfering with the same pieces of code. accumulated technical debt that requires changes in many places, for example regarding code formatting, use of depreciated techniques, etc...” (sd76) answer to rq2: conflicts sometimes occur. the main reasons that can lead to merge conflicts are the time a branch is isolated and lack of communication. 4.2.3 rq3 (resolve conflicts): which practices do developers generally adopt to avoid merge conflicts? as conflicts are common, we asked participants about the difficulty in resolving a conflict and which practices they beunderstanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 lieve may contribute to avoiding merge conflicts. to verify the time to resolve a conflict, we asked developers to estimate the duration: some hours, some days, one week, or more than a week. most of them (80.7%) answered that they spent “less than 24 hours” to resolve the conflict. some of them (17.4%) answered that they spend “some days” (1 to 6) to resolve the conflict. only 2 (1.9%) participants answered “one week”. we verified the difficulty level in resolving a merge conflict from their perspective, as greiler et al. (2022) cite: “factors may impact a specific developer’s experience and depends on his/her personal, team, organization, and project contexts”. some developers answered “easy” (32.1%) and “medium” (32.1%) and some of them answered “very easy” (22.9%). the results about the time to resolve the conflict and the level of difficulty are shown in table 5. table 5. time to resolve a conflict and difficulty level time to resolve conflict # difficult level # very easy 25 easy 31 less than 24 hours 88 medium 27 difficult 5 very difficult very easy easy 4 some days (1-6) 19 medium 6 difficult 9 very difficult very easy easy one week 2 medium 2 difficult very difficult more than one week 0 finally, we investigated practices to avoid conflicts. developers were allowed to mark more than one answer. the two most frequent answers that may contribute to reducing merge conflicts occurrence were: “improve team communication” by 78 (71.5%) participants and “less branching duration” by 75 (68.8%) participants. these factors really seem to be very important, given that “branching duration” and “lack of communication” were the most cited factors that can lead to conflicts. participants also selected “divide the work among the team” (57.7%), “small changes” (54.1%), and “frequent commits” (52.2%) as good practices to avoid conflicts. developers could use the open field for other response options. some participants informed that they do “not use new branches”, do commits directly on the main branch, “adopt code style” and “git flow tool”, and always keep the “workspace branch up to date with the remote repository”. results are shown in table 6. moreover, some developers used the open field to explain the selected factors that can avoid conflicts or even add unlisted factors: “always check for code updates in the master / trunk / main.” (sd34) “improve communication channels and also use table 6. factors to avoid conflicts factors # % improve team communication 78 71.5% less branching duration 75 68.8% divide the work among the team 63 57.7% small changes 59 54.1% frequent commits 57 52.2% factors also mentioned # do not use branches 3 adopt code style tool 2 keep the branch up to date with the master/trunk/main 2 git flow 2 adopt awareness tool 1 architecture patterns (more cohesion and less coupling) 1 branch by task 1 continuous integration 1 gui to interact with repository 1 frequent deploy 1 feature flags 1 keep only experts 1 language syntax 1 other awareness tools to know what each one is changing.” (sd68) “i encourage people on my team to avoid branches as much as possible and implement techniques such as feature flags for everyone to always work at master/main. rather than dealing with conflicts, i would like ‘devs’ to become more experienced in trunk based development.” (sd31) “use of techniques like git flow.” (sd77) “adopt a tool that validates the code style.” (sd26) answer to rq3: developers take more than some hours to resolve a conflict since it is usually easy to do. the main factors to avoid conflicts refer to improve team communication and less time of isolation. 5 analyzing factors that affect merge conflicts the first study (survey research) was grounded (planned and constructed) on the factors for merge conflicts mentioned in solid related work, as mentioned previously, focusing on the brazilian software developers’ perceptions. based on the quantitative results, we decided to deepen the understanding of how the factors contribute to increasing or decreasing merge conflicts based on interviews in a qualitative study (field study). in this section, we present details on the planning and execution of this qualitative study with semistructured interviews and we answer the two additional research questions. understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 5.1 planning and execution we grounded this study on singer et al. (2008)’s work. according to the authors, a field study aims to investigate practitioners in the context of any task or activity and, based on a specific technique, identify how they cope with their work in practice or how they solve some problems in their contexts. based on singer et al. (2008)’s recommendations, software developers with at least one year of experience were invited to answer a set of questions regarding two major aspects: how the factors identified in the survey research contributes to increasing or decreasing merge conflicts. these participants were invited from the researchers’ networks and considered their availability to participate in an interview session. we planned and constructed our interview questions (iq) from the results of the first study (survey research) and focused on the last two research questions presented in section 1. as such, we took the main factors pointed out by the software developers in the survey research and designed eight questions for the interview sessions for the field study: four questions about factors that contribute to increase merge conflicts (table 4) and four questions about those that contribute to decrease merge conflicts (table 6). the goal is to understand why these factors are so important and how time of experience (years of working, studying, dealing with merge conflicts and collaborating with other developers) influences and modifies the software developers’ perceptions over the years. the eight grouped questions are listed below. regarding factors that contribute to increase merge conflicts, we asked the following questions: • iq1: questions on the factors – do you agree with the table presenting the factors that can lead to merge conflicts? – how was your experience in coping with merge conflicts early in your career? – how about coping with merge conflicts nowadays? • iq2: in your opinion, what makes the branching duration so important? • iq3: questions on lack of communication – what is your opinion on the lack of communication? – does the lack of communication affect other factors presented in the table? • iq4: questions on negative effects – which factor brings more negative effects? why? – has the time of experience changed your answer? regarding factors that contribute to decrease merge conflicts, we asked the following questions: • iq5: what is your opinion on the factors “lack of communication” and “branching duration” changing their positions in the table? • iq6: questions on the team influence – how did communication with the team influence the way to avoid these conflicts? – have you felt any improvement over time? • iq7: questions on past experiences – have you been able to cite or work on a project that covered any of these factors? – was it a successful experience or not? • iq8: how do you see that “less branching duration” contributes to decrease merge conflicts? a total of 15 software developers participated in the second study (hereafter referred to as fd from field study developer identifier). they were only brazilians and answered about merge conflicts and their perceptions. the goal was to deepen the understanding of how branches are adopted, as well as conflict resolution in this context. with this in mind, an interview session of about 30 minutes was conducted with each software developer and recorded via the zoom platform. all data and information were collected anonymously and treated specifically for academic purposes, as explained to the participants in the email invitation, informed consent form, and in the conversation before each interview. as mentioned above, the interviewees were contacted by email. the main selection criterion was to have used some version control systems (e.g., git, subversion, version manager), which means that he/she has probably already faced some merge conflict. they were informed that they could withdraw at any time, they were allowed to not answer some questions (if they wanted to), and all video and sound data that would be collected would not be public (just collected to the study analysis purposes. before starting an interview, we asked about the developer’s time of experience, how long he/she has been dealing with merge conflict situations (which is not necessarily linked to professional experience), what kind of industry sector he/she works in (or has worked to). in table 7, the interviewees’ time of experience is presented. from the 15 developers interviewed, 7 (46.6%) have about ten or more years of experience. the first four questions of each interview referred to factors that contribute to increase merge conflict. from tables 4 and 6, we could deepen this discussion by also understanding if the interviewees agree with, and/or would like to add more factors (or correlated factors). in turn, factors that contribute to decrease merge conflicts were covered by the last four questions. the goal was the same as the previous question, but we also would like to know if the software developers’ time of experience affects their perceptions over time. firstly, a pilot was run with four software developers to verify the interview session protocol and duration. the pilot helped us to check the questions (if they were clear enough) as well as to the how long a slot would last in average to avoid stressing the interviewee and to lose the focus in complex questions. 5.2 results this section presents the results obtained from the interviews in the field study and the answers to the last two research questions of our work, as mentioned in section 1. to do so, for each research question, we took as the main codes from the interviewees’ answers the most frequent factors that contribute to increasing and decreasing merge conflicts based on understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 those reported in the first study (survey research). this strategy allowed us to get a better understanding of how those factors have an impact on practice. table 7. interviewees’ time of experience developer time of experience fd01 almost 6 years fd02 4 years fd03 6 years fd04 2.5 years fd05 12 years fd06 7 years fd07 24 years fd08 14 years fd09 3 years fd10 10 years fd11 12 years fd12 1 year fd13 14 years fd14 4 years fd15 15 years 5.2.1 rq4 (produce conflicts): how do the factors identified in the survey research mostly contribute to increase merge conflicts? regarding the factors that contribute to increase merge conflict and their order of importance, most of the software developers (9) agree (fd03, fd04, fd05, fd06, fd07, fd09, fd10, fd12, and fd14) with the table presented during the interview session (table 4) and some (4) partly agree (fd01, fd02, fd08, and fd13). only two interviewees (fd11 and fd15) declared that they do not completely agree with the table with the factors. the interviewees who did not completely agree with the table mentioned that maybe a factor should be considered as more meaningful for a specific context or scenario. “number of lines of code” was highlighted by three software developers (fd03, fd10, and fd14) as something of greater importance. one interviewee reported an experience about one of the critical factors that contributes to increase merge conflicts: “i had a lot of problems when i worked, even when the team was small (...) four people developing (...) it was like parallel editing of the same code and people did not have much experience in sharing code (what they did, what they edited...).” (fd11) we invited the interviewees to comment a little bit more about their career in order to compare their beginning against their current perception. six software developers (fd01, fd04, fd08, fd11, fd12, and fd14) explained that they did not work with either git or other repositories early in their career. additionally, they were running academic projects that were characterized as small and without merge problems: “[starting with academic projects] is common in our career. as far as your projects are more and more scaling, even the culture of the company where you work, you may have merges that end up being complicated to cope with.” (fd14) four developers (fd7, fd10, fd13, and fd15) with more years of experience pointed out how important the evolution of version control tools is, especially for assisting situations regarding merge conflict resolution and for detecting conflicts as well. according to one of those interviewees: “as time goes by (...), the existing tools started to carefully address this kind of activity (merge) (...) a diff not correctly done was complex for us at the first years of research and practice in version control systems (...) there was a free tool, but it was complicated to work with it considering a lot of existing bugs...” (fd07) team communication/behavior was mentioned by three developers (fd02, fd03, and fd06) as something noticed both early in their careers and also currently in their work. fd03 even highlighted that the size of a project a factor mentioned direct and indirectly by more than one interviewees also affects merge conflicts. fd06 pointed out that he noticed a programming language barrier in open source projects. on the other hand, long branch duration was referred to as important due to the changes made over time, i.e., how much code have been modified/moved in a project (fd01, fd04, fd09, fd12, and fd13). according to one of these interviewees: “i believe that more long branches you have, more changes and modifications of code you have to cope with, implying in implementation of new features, deprecating other functions of the program and methods, and so on. as such, long branches bring bigger merge conflict problems...” (fd04) the interviewees often argue their concern on not only what brings merge conflict problems in long branches, but also on why small branches would be more convenient. they mentioned some cases, such as a branch is outdated when it is compared to the main branch/another branch (fd05 and fd06), or a branch is associated with a sprint/short time interval (fd1, fd14, and fd15), as presented next: “i believe that shorter branches (...) can decrease the number of merge problems.” (fd04) “... if you are running an agile method and stories that are better defined, broken into sub-tasks (...), for example, you do not have this problem, because (theoretically) you have a story in a certain subscope of the development of your project that will be somehow isolated.” (fd01) “...if we have a branch with a very long, very extensive time frame, (...) sometimes we cannot collect such an accurate feedback from the business area and this would be what we really need to change for production.” (fd03) understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 communication was called “essential” by one interviewee, “important” by another, and “fundamental” by a third one. the lack of communication was related to problems not only previously described in table 4, but also declared as a behavioral problem, as exemplified next: “(...) because it leads to merge conflicts and frequently it makes some behaviors in the software development keep happening throughout the project and this affects code, functionalities, (...) the implementation of the project as a whole.” (fd02) some developers (fd01, fd03, and fd09) mentioned that communication problems go beyond the technical aspect. this fact is highlighted in the following fragment: “the lack of communication will lead to conflicts, not only in git, but also any way of working. here it leads mainly (...) to cases in which you will end up messing with something that someone else was already working on...” (fd03) moreover, agile methodology was also cited by 2 interviewees as a strategy to support communication in the software development project. this is pointed in the next fragment: “it is clear the difference between those who use the [agile] methodology or not.” (fd06) it is worth highlighting that the factors mentioned in table 4 were also reinforced by the interviewees. one of them stated that: “the lack of communication usually causes problems regarding the branching conflict, in merge, (...) within the development team itself (...). this is critical for the understanding the time the story is started and that you are doing a part of a whole, (...) for example, not keeping the repository updated (...) will affect the parallel editing of the same code.” (fd01) tools such as configuration management tools (fd10), project management tools (fd10), task organization tools (fd10), change tracking tool (fd05) and screen sharing tools (fd13) were cited as kinds of support for communication: “...depending on the tool for change tracking, continuous change etc., i believe that (verbal) communication (...) helps you eliminate the problem a little bit. you can see ‘someone’ (...) touching exactly such and such point of the system (...) and you can verify where you can touch or not, and i believe that this impacts less on merge problems.” (fd5) communication problems can also lead to rework (fd11, fd12, fd13, and fd14), either by an added feature or by a change not communicated to the team: figure 6. factors that negatively affect merge conflicts “lack of communication (...) you end up having to redo what you did. you thought it was right, but it was not what was supposed to be done. this generates so much rework for the developer, stress for the manager...” (fd11) when we asked about which factors they consider the most negative ones, nine different answers were obtained, as shown in figure 6. communication was the most mentioned factor in the opinion of six interviewees (fd02, fd04, fd06, fd09, fd11, and fd14). “...they [conflicts in merge] occur because the distraction of several people, myself included of course. it can lead to some error that will lead to a headache until we can solve it ....” (fd09) in this context, two problems were pointed out by two interviewees each: outdated repository (fd03 and fd07) and a long branch duration (fd8 and fd15). some interviewees selected other problems: many developers in the same branch (fd05); long implementation time (fd07); number of commits (fd08); and number of lines of code (fd12). only one of them (fd13) pointed out that it was hard to solve conflicts in the beginning of his career, but he would not see it as a problem at the present moment, but as something expected from the learning process over the years. when the interviewees were asked to remember their experiences from the beginning of their career, only three of them (fd02, fd10, and fd11) believe they did not respond to the questions differently. the majority, 12 developers (fd01, fd03, fd04, fd05, fd06, fd07, fd08, fd09, fd12, fd14, and fd15), believe they would think differently on what contributes to increase merge conflicts over time. as a conclusion, seven factors were perceived as the main problems regarding merge conflicts according to the interviewees, but that they have been rethought over time: number of modified files (fd01 and fd06); communication and duration of the branches (fd03); organization (fd04); tools used in the projects (fd07); lack of attention (fd09); and number of lines modified in the projects (fd14). when the interviewees were asked to compare their perception at the beginning of their career against their current perception, it is not clear if there is a pattern of “x-answers from the beginning changed to y-answers at the present moment”. the interviewees mentioned situations they had experienced to justify the choice of understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 a factor that affected their work early in their career versus others impacting their current projects. answer to rq4: the interviewees mostly agree with the factors that lead to merge conflicts, presented in table 4. long branches, software development methodology and communication problems were pointed out as some of the main factors to be considered in this context. 5.2.2 rq5 (avoid conflicts): how do the factors identified in the survey research mostly contribute to decrease merge conflicts? when we asked the interviewees if communication should be in the first position in table 4 (factors that lead to merge conflicts) as well as in the first position of table 6) (factors that avoid merge conflicts), 10 interviewees (fd02, fd03, fd04, fd05, fd06, fd09, fd10, fd11, fd12, and fd13) agreed with the greater importance of this factor. this relevance is exemplified in the following fragment: “...because communication is in fact very important and it impacts on several factors...” (fd03) in addition, the interviewee fd05 emphasizes the effective stimuli and use of communication tools (e.g., email, chat, awareness-support systems etc.) contributes to decrease merge conflicts. the interviewee fd09 argued that good communication reduce rework and the interviewee fd6 mentioned that the problems are more related to human aspects in the software development process. as such, there was no specific type of communication specifically recommended in the interviews. the answers referred to talking more and better before working on a project, using some computational tools and applying agile methodology as a way to improve and keep communication frequent. some interviewees (fd07, fd08, fd14, and fd15) believe that “branch duration” is a major factor even when the goal is to decrease merge conflict. the interviewee fd08 mentioned that some factors in this context may be related to inexperience or “post-conflict” thinking. the interviewee fd14 raised a concern on to which extent we notice communication surrounding us: “... i think it is normal to change your mind, and i think what happens is that you start thinking about the situations you have been through on the team, then you start thinking ‘if that guy had talked to me, it would be less torturous to resolve the conflict’...” (fd14). the interviewee fd01 reported that communication was highlighted because of cultural reasons. in other words, it refers to the idea of pointing out that there is a problem and that “lack of communication” would be the main factor on this subject, being similar to “pointing the finger” to someone. by indicating communication as a strategy to decrease merge conflict, people feel more comfortable in communicating to each other: “the communication) as something to avoid a problem rather than being the problem itself.”(fd01) all interviewees mentioned the communication in the team, either based on an previous project (past experience), or on the one they are currently work on. in both scenarios, nine of them (fd01, fd02, fd05, fd06, fd07, fd09, fd10, fd11, and fd14) notices improvements regarding effective communication and its positive effects. as some suggestions for improving communication, some interviewees cited management, planning, and infrastructure: “there are several factors that will influence that aspect of improving communication. i think the first one is management.” (fd03) “there is also that question related to technological limitation. if we think about the current pandemic scenario, (...) several companies have adapted their infrastructure with resources to foster and ensure good communication.” (fd03) “communication works from the moment you plan how that communication is going to be done.” (fd04) the improvements mentioned by the interviewees resulted in problem-solving (fd01), collaboration between team members (fd04), less rework (fd11) and less time doing merge (fd13). an example of this report is presented next: “the fact that you have a person with more knowledge helps a lot who is there working on that project and who may not have enough knowledge, especially related to the business in which that project is inserted.” (fd04) the interviewee fd15 mentioned that communication would help to resolve, rather than avoid, a merge conflict. this fragment is presented as follows: “i don’t think there has been much change regarding this topic in the last few years (...) i think all of that is still a problem related to our inability to clearly record or summarize the developer’s intention at the time he/she writes a particular piece of code.” (fd15) finally, it is worth mentioning that not everyone may have faced a situation in which they realized that communication would be the key factor for avoiding a merge conflict. this is indicated in the following fragment: “i did not have maturity on this subject before [i.e., thinking about communicating to avoid conflict]. so, if this ability is improved over time, i could only have the notion of its importance to avoid merge conflicts currently...” (fd04) understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 from the 15 software developers who were interviewed in our field study, 12 have some experience in implementing practices to address the factors they completely or partially agree with. based on the factors listed in table 6, communication (fd01, fd02, fd03, and fd09), task division (fd01, fd02, fd11, and fd15), repository up to date (fd01 and fd05), task-based branch (fd01 and fd05), branching strategy (fd01 and fd08), more frequent commits (fd01), small changes (fd01) and short branches (fd15) were the most prominent. devops culture was also mentioned (fd01 and fd13). some interviewees also included some of the previously mentioned factors, e.g., training (fd04) and some support tools (fd05, fd07, fd08, fd09, fd12, and fd13), such as trello, discord, and vscode. these tools were mentioned when the interviewees talked about aspects regarding communication, change history, standardization, and task division, as exemplified next: “we standardized vscode as our ide. there were many people who used other environments.” (fd13) “... it required improvements. communication should to be a frequent concern. you cannot have communication only when there is a problem, i.e., it has to be a daily target.” (fd01) another factor mentioned by some interviewees was the branching duration (fd03 and fd08), especially because of the business. this was also mentioned by the interviewee fd11: “...branching duration is directly linked to the business. what comes in and leaves depends on the business, the owner of the company, the client (...) and there is a little margin for negotiation.” (fd08) factors that contribute to decrease merge conflicts also mentioned by the interviewees refer to continuous integration (fd01 and fd08), project management environment (fd10), and more frequent commits (fd14). branching duration was confirmed as an important factor to avoid conflicts by nine interviewees (fd04, fd05, fd06, fd07, fd10, fd11, fd13, fd14, and fd15). other three (fd02, fd09, and fd12) did not know how to evaluate it, and the other three (fd01, fd03, and fd08) commented that it is not really the duration of a branch itself that prevents conflict. in this regard, we found arguments for a shorter branching duration, such as the repository being up to date (fd04, fd07, and fd13), less time for code changing (fd05), less divergence (fd06), memory of what has been done and what is not affected (fd10), the speed of development (fd11), less chance of conflicts (fd13), and faster merges (fd14). the interviewee fd15 explained somehow the mentioned factors: “the longer you are isolated in a branch, the more likely another developer will come and change the code that is in parallel with you (...). you will generally remember less and less about it. knowing how the code was before and having fewer developers working in the same code area as you are factors that help you solve (...) or avoid a merge conflict.” (fd15) answer to rq5: the interviewees mostly agree with the factors that avoid merge conflicts, presented in table 6. communication (from simply conversations to those based on computational tools), team management, and infrastructure were pointed out as some of the main factors to be considered in this context. 6 discussion and implications in this section, we present the main findings of this research on factors that affect merge conflicts based on a quantiqualitative method. 1) branches are very common and developers have the autonomy to create new branches: most software developers create branches frequently or all the time. the use of branches is very common according to the developers’ perspective collected from the survey questionnaire, but no participant in the field study interviews mentioned if he/she did not use to do it. only a few developers marked the option that they discuss the creation of new branches with their teams. in a large study at microsoft, shihab et al. (2012) identified that developers should be careful about branch creation, since it may lead to an increase in the likelihood of failures. they suggest aligning branching structure according to architectural structure and with the organizational structure of their teams (shihab et al., 2012). as mentioned by bird et al. (2011), branches do not come without a price, given that it is normally integrated into others at some point. 2) new features and fixing bugs are the main reasons for creating branches: our results confirm the findings of other studies. zou et al. (2019) found similar results in their investigation with 2,923 projects developed on github branches are mainly used to implement new features, conduct version iteration, and fix bugs. owhadi-kareshk et al. (2019) and vale et al. (2020) address that developers often use branches to add features or fix bugs. according to bird et al. (2011), branches are created to implement a feature, perform a maintenance exercise, do continued maintenance on a subsystem, or fix several related bugs. premraj et al. (2011) mentioned that branches help developers, architects, build managers, testers, and others people to change software artifacts. additionally, the agile methodology was cited by some software developers in the field study interviews as one of the strategies to cope with the creation of a new branch without contributing to increasing or decreasing merge conflicts. 3) branching duration and lack of communication are the main problems: based on the related work (table 1), attributes related to the branching duration are very common, but only two studies mention the branch duration as an indicator of conflict. understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 dias et al. (2020) and menezes et al. (2020) found a relation between the duration of the merge scenario and the conflict occurrence. dias et al. (2020) mentioned that “contributions developed over longer periods of time are more likely associated with conflicts”. menezes et al. (2020) found that the timing attributes have a (small) impact on the conflicts. vale et al. (2020) verified the relation between github communication and the occurrence of merge conflicts. the authors found no significant relation between communication measures and number of merge conflicts. however, the communication recorded by the authors was based only on the communication extracted from github. so, they extract the communication of all active contributors, means of pull requests, and related issues, and the communication mapped to artifacts that have been changed in the merge scenario. the communication mentioned by the software developers who responded to the survey questionnaire is regarding the awareness of parallel changes. sometimes developers forget to communicate what they are changing, resulting in two developers changing the same functionality or something very close. in the field study interviews, some software developers also suggested that keeping shorter branches is the best decision to avoid merging conflict problems, especially those related to developers’ communication and lack of memory on the changes performed in the project and team over time. 4) most of the time, conflicts are not difficult: most conflicts offer no difficulty (medium or easy) and are resolved in some hours. accioly et al. (2018), ghiotto et al. (2018) and pan et al. (2021) identified the most common conflict patterns and resolutions. accioly et al. (2018) found that 84.57% of merge conflicts happen because developers modify the same lines or consecutive lines of the same method. ghiotto et al. (2018) found that conflicting chunks generally contain all the necessary information to resolve them. pan et al. (2021) found in their study on conflict resolution that 28% of changes are of 1-2 lines for both main and forked branches and 39.5% of the resolution strategies involved concatenating the main and the forked branch’s changes. mckee et al. (2017) performed a survey and found nine factors and developers attempting to determine if the conflict is difficult, the complexity of conflicting lines of code and files, the knowledge in the area of conflicting code, and the number of conflicting lines were most cited. it is interesting to mention that some of these factors were used in some related work to predict conflict occurrence. brindescu et al. (2020b) also investigated the characteristics of merge conflicts that are associated with their difficulty. the authors found a subset of ten factors that can predict the difficulty of merging conflicts, including complexity, diffusion, size, and development pattern. the more experienced developers have appointed the improvement of the version control tool over time as a factor that has improved conflict resolution. it is worth highlighting that the field study interviews also raised that the project’s size somehow influences the resolution of merge conflicts, especially in large projects (and large teams) where the chance of merge conflicts is higher. moreover, when a developer is at the beginning of his/her career, he/she does not use to pay attention to this kind of situation, especially on the importance of communication in a project (de farias junior et al., 2022). 5) improve team communication and less branching duration can avoid conflicts: as mentioned previously, dias et al. (2020) and menezes et al. (2020) found that timing measures have an influence conflict occurrence. so, we believe that a good practice is to pay attention to the isolation time and not postpone the merge so much. when developers are less isolated, the repositories are synchronized and people are aware of what other people are doing. software developers in the survey questionnaire mentioned the importance of knowing what parts others are working on to avoid conflicts. communication and relation with merge conflict are investigated mainly in studies addressing awareness. some specific studies (sarma et al., 2008; brun et al., 2011; sarma et al., 2011; guimarães and silva, 2012; estler et al., 2013) focus on the prevention of conflicts through awareness, i.e, detecting conflicts early. basically, these tools monitor workspaces and inform developers of ongoing parallel changes in other workspaces. as also mentioned by some software developers in the field study interviews, it is relevant to improve communication channels and also use awareness tools to know what each one is changing. moreover, other points related to the factors referred to having more and better conversations before starting a branch (or even a project), based on computational tools, such as trello, as well as applying agile methodology as a strategy to reduce time span. 6) qualitative analysis findings: the answers to our survey questionnaire and field study interviews show that software developers also use a branch to create proofs of concept and git flow seems to be a good strategy to coordinate the use of branches. in addition, they suggest that not keeping the repository up to date can cause problems, so developers need to bring up the changes constantly. attention regarding the code formatting is important. accioly et al. (2018) noticed that part of the merge conflicts is simply caused by changes to code indentation or consecutive line edits. regarding this problem, some software developers suggest adopting a code style tool. furthermore, as good practices to avoid conflicts, some of them also mention the option of not using branches and adopting techniques such as feature flags. they also cited always communicating with the team and checking for code updates in the master/trunk as a good practice, as noticed in the field study interviews. 7 threats to validity and credibility this work applied a quanti-qualitative method. therefore, there are two different empirical studies (survey research and field study) and each of them has specific threats and limitations. each subsection below informs their threads as well as strategies to mitigate them. 1) survey research: a) protocol. we adopted some predefined answers to some closed questions, given that they were grounded on previous studies published in the literature (owhadi-kareshk et al., 2019; leßenich et al., 2018; dias et al., 2020; menezes understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 et al., 2020; vale et al., 2020). moreover, we also leave an open field allowing a participant to comment on different factors not listed in the question. we develop the questionnaire very carefully. as it would be our main source for all sections of this study, we discussed and took a long time to construct our questionnaire. in addition to our experience on the subject, we spent a lot of time looking at the literature and building our survey based on the pieces of evidence from these studies and some similar initiatives (condina et al., 2020; kamei et al., 2020). we also conducted a pilot with four developers and asked for feedback on the questions, and whether they were understandable and relevant to the study. b) sample. the software developers who responded to the questionnaire were invited by email via contact lists and they were asked to share with their colleagues with experience in software development (snowballing invitation). we tried to make sure that only people with experience in the use of vcs answered the questionnaire, either in the invitation or in the survey description, or even in the question specifically referring to the use of any vcs. such approach was important to avoid any participant with lack of experience or knowledge. c) context. we only had the participation of brazilian software developers in our study. results may not be generalized to the context of all software developers all over the world. some results confirmed the findings presented in related work, but others require more in-depth investigation. in addition, according to smith et al. (2013), high-quality research on the human side of software engineering requires real software developers, but getting high levels of participation remains a challenge for researchers. nonetheless, it is relevant to emphasize that our results reflect the perspective of a large group (109 participants). 2) field study: a) protocol. we used the results from the survey research as the input for the questions prepared to the interview sessions, considering the main factors that affect merge conflicts according to the brazilian software developers who answered the survey questionnaire. the developers who were interviewed were invited by email and they were requested to share with their colleagues with some experience in resolving merge conflicts (snowballing invitation). only brazilian software developers participated in the field study interviews. therefore, the results may not be generalized, especially considering the interpretive validity of a qualitative study, i.e., the possibility, even without the researcher’s intention, to put his/her perception instead of really understanding, to perceive what the interviewee meant. b) sample. our intention was to have at least 20 interviewees, based on guest et al. (2006)’s work regarding the occurrence of saturation with at least 12 interviews given that this research has “the aim is to understand common perceptions and experiences among a group of relatively homogeneous individuals”. moreover, steglich et al. (2019) and greiler et al. (2022) conducted field studies with software developers considering guest et al. (2006)’s work and reinforce that the main important criteria is the saturation, i.e., when any new interview with relatively homogeneous individuals do not provide any new data or information. for example, steglich et al. (2019) reached saturation with 11 interviews. in our study, 15 developers were able to participate in the period when the field study was run. based on the interviews, the saturation was obtained with 12 interviews and this is in accordance with guest et al. (2006)’s work. it is important to remark that the main goal of our field study was to collect the brazilian software developers’ perceptions on merge conflicts in a qualitative setting and not through a large-scale, quantitative study based on software repository analysis. c) context. the same concern point out by smith et al. (2013) is valid to the field study, i.e., “high-quality research on the human side of software engineering requires real software developers, but getting high levels of participation remain a challenge for researchers”. this includes the software developers’ feeling on how to proceed with the interview questions given the fear of leaking confidential information from their own projects and/or companies in which they work on. it is a critical barrier faced in field studies, given its qualitative, in-person nature (singer et al., 2008), especially when requesting participation. nonetheless, it is important to highlight that the result of this study reflects the vision of a group of brazilian software developers and whose focus was on deepening the understanding the results from the previous survey research (smith et al., 2013). 8 conclusion this research aimed to investigate factors that lead to or help to avoid merging conflicts. to do so, based on related work, we conducted two empirical studies to both understand and analyze factors that affect merge conflicts. firstly, we conducted survey research with 109 software developers to understand the adoption of branches as well as the occurrence and resolution of conflicts. results suggest that the main factors that can lead to conflicts are “the time a branch is isolated” and “lack of communication”. on the other hand, the factors cited as good practices to avoid conflicts were “improve team communication” and “less branching duration”. “divide the work among the team”, “small changes”, and “frequent commits” were also marked many times by the participants of the survey research. communication here refers to the awareness of parallel changes, considering the importance of knowing what others are working on. we also performed a qualitative analysis to extract codes and categories from open fields of five questions responded to the participants. we identified that git flow is a common strategy adopted to coordinate branches, synchronizing the repository constantly and paying attention to the formatting of the code to avoid conflicts. next, we conducted a field study based on interviews with 15 software developers to analyze those factors to obtain a better understanding of what contributes to increasing or decreasing merge conflicts. results show that communication with the team, checking code updates, shorter branch duration and management (which comprises software development methodology, communication strategies and awareness-support systems) seem to be key policies, not only to merge conflict resolution, but also to decrease conflict. moreover, the developers’ time of experience can change their perception on the problems faced in this context and helps to avoid or resolve a merge conflict, besides the fact understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 that version control systems have evolved to a greater extent, being also an important support in this topic. finally, this study allowed us to conclude that most of the software developers agree with the factors that lead to and the factors that avoid merge conflicts, and underlying problems and how to resolve them are still a concern for all them. in future work, we intend to evaluate the application of some good practices suggested in this work. we could evaluate supporting processes and tools to improve communication, and reduce isolated work and other mentioned factors. another opportunity is to perform a quantitative study based on mining software repositories in order to analyze some github projects against some findings of this work, for example, through a study on the projects’ branching duration, communication tactics, and merge conflict resolution. finally, this work can be executed with software developers from other contexts (e.g., different cultures, countries, genders etc.) to produce other indications and allow systemic analyses. acknowledgements we thank all the participants that answered our survey and interview. the author also thanks unirio and faperj (grant: 211.583/2019) for partial support. references accioly, p., borba, p., and cavalcanti, g. (2018). understanding semi-structured merge conflict characteristics in open-source java projects. empirical software engineering, 23:2051–2085. bird, c., zimmermann, t., and teterev, a. (2011). a theory of branches as goals and virtual teams. in proceedings of the 4th international workshop on cooperative and human aspects of software engineering, pages 53–56. brindescu, c., ahmed, i., jensen, c., and sarma, a. (2020a). an empirical investigation into merge conflicts and their effect on software quality. empirical software engineering, 25:562–590. brindescu, c., ahmed, i., leano, r., and sarma, a. (2020b). planning for untangling: predicting the difficulty of merge conflicts. in 42nd international conference on software engineering (icse), pages 801–811. brun, y., holmes, r., ernst, m. d., and notkin, d. (2011). proactive detection of collaboration conflicts. in 19th acm special interest group on software engineering symposium and the 13th european conference on foundations of software engineering (sigsoft), pages 168– 178. cavalcanti, g., accioly, p., and borba, p. (2015). assessing semistructured merge in version control systems: a replicated experiment. in 2015 acm/ieee international symposium on empirical software engineering and measurement (esem), pages 1–10. ieee. condina, v., malcher, p., farias, v., santos, r., fontão, a., wiese, i., and viana, d. (2020). an exploratory study on developers opinions about influence in open source software ecosystems. in proceedings of the 34th brazilian symposium on software engineering, pages 137–146. costa, c., figueiredo, j. j., ghiotto, g., and murta, l. (2014). characterizing the problem of developers’ assignment for merging branches. international journal of software engineering and knowledge engineering, 24:1489–1508. costa, c., figueiredo, j. j., pimentel, j. f., sarma, a., and murta, l. g. p. (2019). recommending participants for collaborative merge sessions. ieee transactions on software engineering. costa, c., menezes, j., trindade, b., and santos, r. (2021). factors that affect merge conflicts: a software developers’ perspective. in brazilian symposium on software engineering, pages 233–242. de farias junior, i., marczak, s., dos santos, r. p., rodrigues, c., and moura, h. (2022). c2m: a maturity model for the evaluation of communication in distributed software development. empirical software engineering. dias, k., borba, p., and barreto, m. (2020). understanding predictive factors for merge conflicts. information and software technology, 121:106256. estler, h. c., nordio, m., furia, c. a., and meyer, b. (2013). unifying configuration management with merge conflict detection and awareness systems. in 22nd australian software engineering conference (aswec), pages 201–210. ghiotto, g., murta, l., barros, m., and hoek, a. v. d. (2018). on the nature of merge conflicts: a study of 2,731 open source java projects hosted by github. ieee transactions on software engineering, 46:892–915. greiler, m., storey, m.-a., and noda, a. (2022). an actionable framework for understanding and improving developer experience. ieee transactions on software engineering. guest, g., bunce, a., and johnson, l. (2006). how many interviews are enough? field methods, 18:59–82. guimarães, m. l. and silva, a. r. (2012). improving early detection of software merge conflicts. in 34th international conference on software engineering (icse), pages 342–352. kamei, f., wiese, i., pinto, g., ribeiro, m., and soares, s. (2020). on the use of grey literature: a survey with the brazilian software engineering research community. in proceedings of the 34th brazilian symposium on software engineering, pages 183–192. kasi, b. k. and sarma, a. (2013). cassandra: proactive conflict minimization through optimized task scheduling. in 35th international conference on software engineering (icse), pages 732–741. leßenich, o., siegmund, j., apel, s., kästner, c., and hunsen, c. (2018). indicators for merge conflicts in the wild: survey and empirical study. automated software engineering, 25:279–313. mckee, s., nelson, n., sarma, a., and dig, d. (2017). software practitioner perspectives on merge conflicts and resolutions. in 33rd ieee international conference on software maintenance and evolution (icsme), pages 467– 478. menezes, j. w., trindade, b., pimentel, j. f., moura, t., plastino, a., murta, l., and costa, c. (2020). what causes understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 merge conflicts? in 34th brazilian symposium on software engineering (sbes), pages 203–212. menezes, j. w., trindade, b., pimentel, j. f., plastino, a., murta, l., and costa, c. (2021). attributes that may raise the occurrence of merge conflicts. journal of software engineering, 9:14. owhadi-kareshk, m., nadi, s., and rubin, j. (2019). predicting merge conflicts in collaborative software development. in 13th acm/ieee international symposium on empirical software engineering and measurement (esem), pages 1–11. pan, r., le, v., nagappan, n., gulwani, s., lahiri, s., and kaufman, m. (2021). can program synthesis be used to learn merge conflict resolutions? an empirical analysis. in 2021 ieee/acm 43rd international conference on software engineering (icse), pages 785–796. ieee. pfleeger, s. l. and kitchenham, b. a. (2001). principles of survey research: part 1: turning lemons into lemonade. acm sigsoft software engineering notes, 26:16–18. premraj, r., tang, a., linssen, n., geraats, h., and van vliet, h. (2011). to branch or not to branch? in proceedings of the 2011 international conference on software and systems process, pages 81–90. sarma, a., redmiles, d., and van der hoek, a. (2008). empirical evidence of the benefits of workspace awareness in software configuration management. in proceedings of the 16th acm sigsoft international symposium on foundations of software engineering, pages 113–123. sarma, a., redmiles, d. f., and hoek, a. v. d. (2011). palantir: early detection of development conflicts arising from parallel code changes. ieee transactions on software engineering, 38:889–908. shihab, e., bird, c., and zimmermann, t. (2012). the effect of branching strategies on software quality. in 12th acm/ieee international symposium on empirical software engineering and measurement (esem), pages 301– 310. singer, j., sim, s. e., and lethbridge, t. c. (2008). software engineering data collection for field studies, pages 9– 34. springer london, london. smith, e., loftin, r., murphy-hill, e., bird, c., and zimmermann, t. (2013). improving developer participation rates in surveys. in 2013 6th international workshop on cooperative and human aspects of software engineering (chase), pages 89–92. spencer, d. (2009). card sorting: designing usable categories. rosenfeld media. steglich, c., marczak, s., de souza, c. r., guerra, l. p., mosmann, l. h., figueira filho, f., and perin, m. (2019). social aspects and how they influence mseco developers. in 2019 ieee/acm 12th international workshop on cooperative and human aspects of software engineering (chase), pages 99–106. vale, g., hunsen, c., figueiredo, e., and apel, s. (2021). challenges of resolving merge conflicts: a mining and survey study. ieee transactions on software engineering. vale, g., schmid, a., santos, a. r., almeida, e. s. d., and apel, s. (2020). on the relation between github communication activity and merge conflicts. empirical software engineering, 25:402–433. zimmermann, t. (2016). card-sorting: from text to themes. in perspectives on data science for software engineering, pages 137–141. elsevier. zou, w., zhang, w., xia, x., holmes, r., and chen, z. (2019). branch use in practice: a large-scale empirical study of 2,923 projects on github. in 2019 ieee 19th international conference on software quality, reliability and security (qrs), pages 306–317. ieee. introduction background merge conflicts related work research method understanding factors that affect merge conflicts planning and execution results rq1 (branches): how often are branches created in software projects? rq2 (merge conflicts): what factors lead to merge conflicts? rq3 (resolve conflicts): which practices do developers generally adopt to avoid merge conflicts? analyzing factors that affect merge conflicts planning and execution results rq4 (produce conflicts): how do the factors identified in the survey research mostly contribute to increase merge conflicts? rq5 (avoid conflicts): how do the factors identified in the survey research mostly contribute to decrease merge conflicts? discussion and implications threats to validity and credibility conclusion 719-##_article-854-1-6-20200226 3 journal of software engineering research and development, 2019, 4:1, doi: 10.5753/jserd.2020.719 this work is licensed under a creative commons attribution 4.0 international license. editorial letter for cibse 2019 special edition beatriz marin [ universidad diego portales, chile | beatriz.marin@mail.udp.cl ] isabel sofia brito [ instituto politécnico de beja, portugal | isabel.sofia@ipbeja.pt ] this issue of the jserd contains seven extended and peerreviewed papers from the xxii ibero-american conference on software engineering (cibse 2019), which was held in la habana, cuba, in april 2019. cibse was conceived as a space dedicated to the dissemination of research results and activities on software engineering in ibero-america. this conference is to promote high-quality scientific research in ibero-american countries, supporting the researchers in this community in publishing and discussing their work. cibse is organized in three tracks: software engineering track (set), experimental software engineering latin american workshop (eselaw), and requirements engineering track (ret). cibse received 154 submissions, which were finally materialized in 60 papers. for this special issue, we selected the best papers from each track, which were extended and reviewed in two rounds. all papers were refereed by three well-known experts in the field. the selected papers are described as follows: the paper “supporting a hybrid composition of microservices: the eucaliptool platform”, by pedro valderas, victoria torres, and vicente pelechano, presents a hybrid solution based on the choreography of business process pieces that are obtained from a previously defined description of the complete microservice composition. to support this solution, the eucaliptool platform is presented. the authors face the challenge of defining a hybrid solution to compose microservices that combine the benefits of the choreography and orchestration approaches. https://doi.org/10.5753/jserd.2020.457 the paper “requirements engineering base process for a quality model in cuba”, by yoandy lazo alvarado, leanet tamayo oro, odannis enamorado pérez, and karine ramos, proposes a quality model for software development that contributes to raising the percentage of successful projects, in cuban´s software development organizations, regarding the fulfillment of the agreed requirements. the solution proposal contains specific requirements and support elements (graphic and textual description of the process), divided by the three levels of maturity proposed by the model. the satisfaction of the final user was also measured by implementing jadov techniques. https://doi.org/10.5753/jserd.2020.459 the paper “towards a new template for the specification of requirements in semi-structured natural language”, by raúl mazo, carlos andrés jaramillo, paola vallejo, and jhon harvey medina, addresses the problems in the specifications of the requirements of a system by means of an adaptable and extensible template for specifying requirements of different domains (application systems, software product lines, cyber-physical systems, self-adapting systems). through action research method, we could observe that the reference template must be improved and that it is possible to improve it. the authors also found that the new template could be used in industrial cases. https://doi.org/10.5753/jserd.2020.473 the paper “characterization of software testing practices: a replicated survey in costa rica”, by christian quesadalópez, erika hernandez-agüero, and marcelo jenkins, characterizes the state of the practice based on practitioners use and perceived importance of software testing practices. to make a more in-depth analysis of the software testing practices among practitioners, the authors replicated a previous survey conducted in south america. this study shows the state of the practice in software testing in a thriving and very dynamic industry that currently employs most of our computer science professionals. the benefits are twofold: for academia, it provides us with a road map to revise our academic offer, and for practitioners it provides them with a first set of data to benchmark their practices. https://doi.org/10.5753/jserd.2019.472 in the paper “specifying the process model for systematic reviews: an augmented proposal”, by pablo becker, luis olsina, denis peppino, and guido tebes, the proposed systematic literature review (slr) process considers with higher rigor the principles and benefits of process modeling backing slrs to be more systematic, repeatable and auditable for researchers and practitioners. the authors have documented the slr process specification by using processmodeling perspectives and mainly the spem language. it is a recommended flow for the slr process, since the authors are aware that in a process instantiation there might be some variation points, such as the parallelization of some tasks. https://doi.org/10.5753/jserd.2019.460 the paper “a revisited systematic literature mapping on the support of requirement patterns for the software development life cycle”, by taciana n. kudo , renato f. bulcãoneto, alessandra a. macedo, and auri m. r. vincenzi, describes a revisited systematic literature mapping (slm) that identifies and analyzes research in order to demonstrates those benefits from the use of requirement patterns for software design, construction, testing, and maintenance. the slm protocol includes automatic search over two additional sources of information and the application of the snowballing technique, resulting in ten primary studies for analysis and synthesis. results indicate that there is yet an open field for research that demonstrates, through empirical evaluation and usage in practice, the pertinence of requirement patterns at software design, construction, testing, and maintenance. https://doi.org/10.5753/jserd.2019.458 the paper “the rocs framework to support the development of autonomous robots”, by leonardo ramos, gabriel lisboa, guimarães divino, guilherme cano lopes, breno bernard nicolau de frança, leonardo montecchi, and esther luna colombini, addresses the need to organize and modularize software for robotic systems correct functioning, making the development of software for controlling robots a complex and intricate task. based on the wellknown ibm autonomic computing reference architecture (known as mapek), this work defines a refined architecture following the robotics perspective. to explore the capabilities of the proposed refinement, the authors implemented the rocs (robotics and cognitive systems) framework for autonomous robots. https://doi.org/10.5753/jserd.2019.470 we would like to thank the authors, track chairs, and members of the program committee of each track at the conference for their effort and rigorous work done in the review process, as well as the jserd editorial board for offering us the opportunity of preparing this special issue. enjoy the reading! beatriz marín isabel sofia brito 14-##_source texts-454-1-18-20190814 journal of software engineering research and development, 2019, 6:3, doi: 10.5753/jserd.2019.14 this work is licensed under a creative commons attribution 4.0 international license. towards a more in-depth understanding of the iot paradigm and its challenges rebeca campos motta [ universidade federal do rio de janeiro e lamih cnrs umr 8201 | rmotta@cos.ufrj.br] valéria martins da silva [universidade federal do rio de janeiro | vsilva@cos.ufrj.br ] guilherme horta travassos [universidade federal do rio de janeiro | ght@cos.ufrj.br ] abstract the internet of things (iot) is a new technological paradigm that brings together the physical and virtual worlds to provide software systems everywhere through daily life objects. the iot can transform how we interact with the environment surrounding us, leading to a significant multidisciplinary technological shift. however, since it is a new field of research and development, there is a lack of consensus and understanding of its concepts and features, as we observed when engineering some software systems in the field. therefore, we performed investigations to characterize iot regarding its definition, characteristics, and applications, organizing the area and revealing its challenges, and research opportunities, focusing on software engineering for the iot. a structured literature review of secondary studies supported the answering of three research questions: what is “internet of things”? which characteristics can define an iot domain? which are the areas of iot application? the structured literature review leads to 15 subsequent studies from which we recovered 34 definitions discussed in the light of the technical evolution 29 characteristics and several iot application areas. furthermore, the results include an iot characterization based on identification, sensing, and actuation capabilities, besides a discussion of the relation between iot and cyber-physical systems (cps), regarding other research areas and terms often associated with iot aiming at to bring clarification to the field. in this work, we offer an essential overview of the iot state-of-theart and a characterization, presenting issues that should be addressed to contribute to its strengthening and establishment. keywords: internet of things, systems engineering, evidence-based software engineering 1 introduction the internet of things (iot) has emerged as a new paradigm where the software systems are no longer limited to computers but to a great variety of different connected objects, or specific users’ goals and closed environments. the interaction between humans and the cyber-physical world is changing since software can be deployed everywhere and in everything, such as cars, smartphones, clothes and in different environments (atzori, iera, and morabito, 2010; kraijak and tuwanut, 2016; datta et al., 2017; wortmann, combemale and barais, 2017; cicirelli et al., 2018), characterizing the iot domain and vision. it enables a pervasive interaction between connected things enhanced with identification, sensing, actuation, and processing capabilities, which enable them to interact with the environment. together with the benefits proposed by the iot paradigm, new challenges also arise. the constant evolution of the technology, application heterogeneity and diversity of devices, and other particularities such as a lack of division of roles, scale, and different lifecycle phases differentiate iot applications from traditional ones (patel and cassou, 2015). it can challenge the current software technologies to develop iot applications and to consolidate such paradigm (skiba, 2013; zambonelli, 2016; larrucea et al., 2017). one of the recurrent difficulties regards the natural iot multidisciplinary and novelty. since iot it is a modern paradigm, some fundamental points are still under discussion and involve converging topics of different research streams (motta, de oliveira, and travassos, 2018). in our previous research regarding ubiquitous (spínola, pinto and travassos, 2008; spínola and travassos, 2012) and context-aware software systems (matalonga, rodrigues and travassos, 2017; santos et al., 2017), we have identified some gaps and the need for software technologies that can also be observed in the iot domain. however, as a constant challenge in this area, the lack of a unified iot perception together with some experiences on engineering iot software systems motivate this research as a starting point for further investigation and development activities at our research group. in this scenario, we performed a structured literature review of secondary studies on iot to understand the “internet of things” concept, as well as its characteristics and the application domains making use of it. therefore, this research aims to characterize the internet of things paradigm, considering the scenario of invisible and pervasive complex systems that support daily activities in the world. this review intends to answer the following questions: what is the “internet of things”? which characteristics can define an iot domain? which are the areas of iot application? the primary goal of this review is to strengthen the iot paradigm understanding, characterizing it based on its properties, and identifying the current iot applications (the domains that are currently getting some benefit from the iot domain) under the perspective of engineering iot software systems. we made this decision since the advancement of technologies makes society highly dependent on engineered-based software systems. we aim to discuss the software engineering scenario in the iot paradigm, being the results of this review the first step of research towards the understanding of engineering iot software systems. therefore, the intention is to promote a high-level discussion on identified iot paradigm characteristics and give an overview of the area, aiming to promote a better perception of current development needs and opportunities in the area. towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 important works from the literature review supported our discussions and the answers to the research questions. there are many definitions for iot available in the technical literature, and even though they are different, they share similar points. from this diverse content, it was necessary an understanding from our side of the iot concept and what the “things” represent in the iot context. besides the iot characterization, we discuss the relation among iot, cps and other related terms to highlight some points that lead to consider some areas as building blocks for iot or, on the other hand, dependent on its evolution. the remainder of the paper is structured as follows. in the next section, the methodology is introduced, and we explain how it was applied in this study. then, in section 3, the results of the literature review are presented. these results are then further discussed in section 4, together with the validity threats. the main conclusions from the paper are summarized in section 5. 2 research methodology the purpose of this literature review is to contribute to a more in-depth understanding of the internet of things and its challenges, identifying its definitions, characteristics and the current areas of use. 2.1 review planning before undertaking any literature review, it is essential to observe its necessity (budgen and brereton, 2006). therefore, we started with an ad-hoc search looking for any existent secondary studies on iot. considering the iot paradigm as a new motivating area for investigation, we decided to review the technical literature more systematically, adopting existing practices to compose our study plan. in our perspective, “secondary studies” are the studies which survey primary studies to present a bigger picture of a domain, the iot in this case. all secondary studies that meet the selection criteria should be included; even it does not mention its research protocol. the research protocol followed the recommendations proposed by (budgen and brereton, 2006; de almeida biolchini et al., 2007) and, for the sake of space, have some of its details presented below. the research goal is gqmbased (basili, caldeira and rombach, 1994) defined as follows: to analyze the internet of things with the purpose of characterizing regarding its definitions, characteristics and application areas from the point of view of software engineering researchers in the context of knowledge previously organized and presented in secondary studies regarding iot available in the technical literature. from this goal, we defined the research questions (rq): (rq1) what is the “internet of things”? (rq2) which characteristics can define an iot domain? (rq3) which are the areas of iot application? with this goal, the secondary studies were searched according to the following information: search strategy the search strategy used scopus 1combined with snowballing procedures. the scopus was chosen as the search engine since it indexes several 1 https://www.scopus.com/ databases of peer-reviewed sources, covering repositories such as ieee xplorer 2for example, and favor the repeatability of the search results (matalonga, rodrigues and travassos, 2017; santos et al., 2017). in turn, backward and forward snowballing refers to using the reference list of cited papers or the citations to a paper to identify additional sources of data, complementing and extending the initial set of papers (wohlin, 2014). also, as far as our experience shows, the strategy of using scopus with snowballing procedures mitigates an eventual lack of content, avoids duplicated filtering work, and provides a representative set of papers to a characterization study such as this one (motta, oliveira, and travassos 2016; motta, oliveira, and travassos 2018). search string since the review focus is to retrieve information based on secondary studies, it was: title-abs-key (( "*systematic literature review" or "systematic* review*" or "mapping study" or "systematic mapping" or "structured review" or "secondary study" or "literature survey" or "survey of technologies" or "driver technologies" or "review of survey*" or "technolog* review*" or "state of research") and ( "internet of things" or "iot")). selection criteria works presented as articles shall be available on the web, retrieved from the search engine and written in english. as the selection criteria we have: inclusion criteria o provide an iot definition and o provide iot properties or o provide iot application areas exclusion criteria: o duplicate publication/self-plagiarism or o register of proceedings selection procedure read the title and abstract of each retrieved study and evaluate it according to the inclusion and exclusion criteria. two distinct readers evaluated each secondary study. the studies acceptance criteria happened as follows: o all two readers accept: the study is included. o one reader accepts, and one is in doubt: the study is included. o one reader accepts or is in doubt, and one reader excludes: the study is discussed. o two readers exclude: the study is not included. data extraction data extraction aims to capture information from the selected articles to answer the proposed research questions. the data extraction form was proposed during the review planning and used throughout the process. the information was extracted as presented in table 1. 2 https://ieeexplore.ieee.org towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 2.1 review planning the review process was executed according to the following steps: step 1 ad-hoc search. it is based on the researcher’s experience without providing any explicit or planned process in comparison to a systematic literature review. the primary objective of the ad-hoc search was to verify the need to carry out an initial literature review on the target topic and identifying control articles to guide formulating a search string for further searches. two researchers performed this step to identify the existence of any secondary study related to iot. the search perspective was established from the software engineering point-of-view for paper reading and analysis. since we identified secondary studies, we decided to review the existent articles instead of relying on primary studies. from the results of this ad-hoc search, three articles were selected as a starting point for the next step since they met the selection criteria (atzori, iera, and morabito, 2010; bandyopadhyay and sen, 2011; li, xu and zhao, 2015). step 2 scopus search. we organized the terms of the search string based on synonyms and similar terms. the search string was adjusted to recover the three articles which were previously selected. the total of items found was 76; the search was executed at the end of may 2017, considering the papers available in the database until this date. step 3 title and abstract reading. the list of 76 articles was reviewed to remove duplicates and proceedings, according to the selection criteria. the remaining articles were later read based on title and abstract and reviewed by a 3rd researcher with more experience in the research area. 24 articles were selected for further reading, considering the title and abstract reading, following the criteria established in the research protocol. step 4 full reading. the two researchers read the full text of the 24 articles (12 for each, with crosschecking), considering the inclusion and exclusion criteria. seven of them met the criteria, being those finally selected. step 5 snowballing. it refers to using the reference list of an article or its citations to identify additional material (wohlin, 2014). in this step, we performed backward and forward snowballing sampling, tracking down references in the seven articles selected in the previous step and their citations. the total of articles was divided, and each researcher was responsible for performing the snowballing in part of the articles. nineteen articles were identified as candidates, and the reviewers cross-checked the articles to be included considering the selection criteria. this step resulted in the inclusion of five new articles. step 6 review update. the previous five steps were carried out between march and may 2017. the update was performed on december 2018 to cover new publications made available between 2017 and 2018. we re-executed the same string in scopus and analyzed the results following the criteria previously established. the three reviewers conducted the update repeating steps 3 and 4 for the new scopus results and the forward snowballing (step 5) for the whole set. this step resulted in the inclusion of three new articles. the review steps resulted in 15 articles, composing the final set: (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; miorandi et al., 2012; gubbi et al., 2013; singh, tripathi and jara, 2014; borgia, 2014; whitmore, agarwal and da xu, 2015; li, xu and zhao, 2015; madakam, ramaswamy and tripathi, 2015; gil et al., 2016; sethi and sarangi, 2017; trappey et al., 2017; burhanuddin et al., 2017; ray, 2018; carcary et al., 2018). see the details of each step in table 2. table 1 information extraction fields. field description reference information authors, title, year and venue abstract abstract iot definition verbatim, as presented in the article (definition research-based derived or with reference) iot related terms it is associated with other definitions (ubiquitous, context-aware, pervasive, machine-to-machine, and others) iot application features characteristics of particular traits, features, properties, attributes that make iot what it is (that achieve the iot definition/concept) iot application areas the areas (and their related applications) that will benefit from the full iot idea deployment. development strategies for iot the used development strategies to build iot software (requirements analysis, design, and so on). type of study it is expected to have only secondary studies, represented by survey, systematic literature review, others. study properties protocol, research questions, search string, selection criteria. challenges open opportunities in practice or research article focus main concerns presented in the articles (architecture, security, and others) things a list of the kind of things explicitly stated in the article (coffeemaker, refrigerator, incubator, and others) towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 table 2 total of articles selected at each step of the review. step number of articles selected step 1 step 2 76 step 3 24 step 4 7 step 5 5 step 6 3 final set 15 3 results the dataset contains papers from 2010 to 2018. it is possible to observe a growing interest in the area over the years. the results show that most of the available publications on technical literature were from 2015 to 2018, considering the period of search. since it is a topic that has recently gained strength, both initiatives from industry and research are still in the early stages. table 3 presents the study types considering the classification initially presented by the authors. table 3 study types type studies systematic literature review (carcary et al., 2018) literature review (atzori, iera and morabito, 2010; miorandi et al., 2012; singh, tripathi and jara, 2014; li, xu, and zhao, 2015; gil et al., 2016; burhanuddin et al., 2017; sethi and sarangi, 2017; ray, 2018) literature survey (bandyopadhyay and sen 2011; madakam, ramaswamy, and tripathi 2015; whitmore, agarwal, and da xu 2015; trappey et al. 2017) not defined (gubbi et al. 2013; borgia 2014) despite being a current trend, our initial research did not return secondary studies conducted systematically, nor did they present the methodology followed, nor the research questions that the papers intended to answer. the papers except in (whitmore, agarwal and da xu, 2015; carcary et al., 2018) do not present the research protocol or make explicit the study properties (research questions, search strings, databases, selection criteria, selected articles, among others). for this reason, we have not performed a quality assessment since there is no methodology related information to be evaluated. therefore, not performing the quality assessment represents a threat to this study validity. from this result, it is possible to observe the need to provide research data based on sound scientific methodology. despite the evolution and enthusiasm that new technology can provide with recent developments such as iot, the lack of scientific rigor it is still one of the significant challenges to strengthen the basis of software engineering knowledge (de almeida biolchini et al., 2007). this work was conducted by following established guidelines and in a protocolled way, accounting for the strength of the evidence found and its replicability. the questions that this review seeks to answer are aligned with the objective of characterizing iot and with this result we aim to contribute to strengthening the discussions and evolution of the area. from the selected papers, seven essential topics were addressed (figure 1): concepts presenting discussions regarding the fundamentals, definitions, and visions behind the iot paradigm; articles (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; miorandi et al., 2012; gubbi et al., 2013; borgia, 2014; singh, tripathi and jara, 2014; li, xu and zhao, 2015; madakam, ramaswamy and tripathi, 2015; gil et al., 2016; trappey et al., 2017; carcary et al., 2018). technology introducing enabling technologies and solutions to develop and deploy iot applications. articles: (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; gubbi et al., 2013; borgia, 2014; whitmore, agarwal and da xu, 2015; li, xu and zhao, 2015; madakam, ramaswamy and tripathi, 2015; burhanuddin et al., 2017; sethi and sarangi, 2017; trappey et al., 2017; ray, 2018). applications describing the current state of the existing solutions and the applications of different domains as well as future possibilities to be achieved by using iot. articles: (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; bandyopadhyay and sen 2011; gubbi et al. 2013; borgia 2014; singh, tripathi, and jara 2014; li, xu, and zhao 2015; madakam, ramaswamy, and tripathi 2015; whitmore, agarwal, and da xu 2015; sethi and sarangi 2017; trappey et al. 2017). open issues and challenges presenting opportunities for research and development aiming to evolve iot. articles: (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; miorandi et al., 2012; gubbi et al., 2013; borgia, 2014; singh, tripathi and jara, 2014; li, xu and zhao, 2015; whitmore, agarwal and da xu, 2015; burhanuddin et al., 2017; carcary et al., 2018). architecture discussing possible implementations of iot based on different architectures proposals. articles: (bandyopadhyay and sen, 2011; singh, tripathi and jara, 2014; madakam, ramaswamy and tripathi, 2015; whitmore, agarwal and da xu, 2015; gil et al., 2016; sethi and sarangi, 2017; trappey et al., 2017; ray, 2018). characteristics making specific general features and requirements of iot. articles: (borgia, 2014; gil et al., 2016). initiatives research organizations, industries, standardization bodies, and governments that have an interest or put some effort into iot. articles: (miorandi et al. 2012; gubbi et al. 2013; borgia 2014; madakam, ramaswamy, and tripathi 2015). towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 figure 1 most common topics in the articles. 3.1 studies overview gil et al. (gil et al., 2016) reviewed surveys regarding iot, focusing mostly on the context-aware feature and how both topics are related. the main difference from our work is that they lack a research methodology and the discussion revolved around the general purpose of the selected articles and the context-aware iot. another work that contains an analysis of the trends and coverage of the iot literature is from whitmore et al. (whitmore, agarwal and da xu, 2015). it presents an area overview. however, it does not worry about answering research questions, describing open questions and future directions to assist researchers. it differs from our work which concerns the characterization of iot regarding its definition and characteristics. numerous iot definitions exist in the technical literature due to different visions from the research community. some authors (miorandi et al., 2012; gubbi et al., 2013) discuss iot as an overall vision, while (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; borgia, 2014; singh, tripathi and jara, 2014) describe that iot is realizable by particular visions or pillars. the conceptualization of iot is the focus of ( miorandi et al. 2012; gubbi et al. 2013; madakam, ramaswamy, and tripathi 2015). other topics are also presented, as a taxonomy for iot (gubbi et al., 2013; sethi and sarangi, 2017), and iot patents (trappey et al., 2017). the works of (burhanuddin et al., 2017; ray, 2018) focus on the critical discussion related to architectural issues and options to deal with the immense number of interconnected devices as proposed in iot. besides, they also describe fundamental requirements along with implementation challenges and future directions. the work of (carcary et al., 2018) argues that the adoption of iot is not yet widespread and examines the existing literature on key determinants (drivers, benefits, barriers, and challenges) that influence the adoption of iot by organizations. it is important to highlight that from the 15 selected secondary studies, none covers all the topics, showing that the researchers have distinct perspectives and concerns. however, together these studies provide a wealth of information to our research topic. the application of a sound research protocol in this work provides an improvement over the previous ones since some do not make clear the performed procedures. besides, we offer a research protocol that can be replicated. in this work, we further improve the current state because we not only quantitatively point out the results but provide discussions and answers to research questions grounded in data. we also would like to highlight that one can value the findings and discussions in this article since we are relying on secondary studies. in this case, several primary studies reported in these 15 secondary studies support with evidence. 3.2 answering the research questions we based our analysis procedure on textual analysis, using codes to assign concepts to a portion of data, identifying patterns from similarities and differences emergent from the data extracted. two researchers conducted it, with crosschecking to achieve a consensus with the analysis, to decrease potential misinterpretation and bias. a third researcher reviewed the extractions and findings. this process was performed through all the data extracted and lead to the discussions of the research questions proposed, presented in the following subsections. 3.3 rq1: what is the “internet of things”? the 15 selected papers supported the extraction of 34 different iot definitions. from the analysis of these 34 definitions, we noticed that they followed a specific pattern in their structure in the concern of explaining the involved actors, requirements and the consequences of relations among actors as part of a system not necessarily presented in all definitions. we considered this structure not to limit our interpretation, but to support a more thorough iot conceptual understanding and thus finding an appropriate and updated definition for this work. we organized some of the definitions found in chronological order to observe how the concept has evolved. ''an intelligent infrastructure linking objects, information, and people through the computer networks, and where the rfid technology found the basis for realization.'' defined in 2001 by (brock, 2001), cited by (borgia, 2014). in this 2001 definition, we can observe that the idea is to connect objects, information, and people, where both objects and people can be actors in the system. it makes clear the network necessity as a way to connect the actors, and the realization was limited by the rfid identification technology (finkenzeller, 2010), which represents the iot vision starting point. “internet of things as a paradigm in which computing and networking capabilities are embedded in any conceivable object. we use these capabilities to query the state of the object and to change its state if possible.” defined in 2005 by (itu, 2005), cited by (sethi and sarangi, 2017). this definition from 2005 does not propose the use of any technology, like rfid, but includes the idea of expanding the original capabilities of an object through technology to perceive changes in the object’s states; it is only possible by addressing objects first, turning them identifiable. once achieving that, it enables things to communicate automatically (dunkels and vasseur, 2008). it can be considered an evolution since this kind of requirement was not previously discussed. the next definition addresses the idea: “a world where things can automatically communicate to computers and each other providing services to the towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 benefit of the humankind.” defined in 2008 by (dunkels and vasseur, 2008), cited by (atzori, iera, and morabito, 2010; gil et al., 2016). another definition is: ''a dynamic global network infrastructure with selfcapabilities based on standard and interoperable communication protocols where physical and virtual ''things'' have identities, physical attributes, virtual personalities and use intelligent interfaces, and are seamlessly integrated into the information network'' defined by in 2009 (gusmeroli, sundmaeker and bassi, 2015), cited by (borgia, 2014; whitmore, agarwal and da xu, 2015). in this 2009 definition, we can see that the central concept of communication and integration remains. it leads to an effort to make things identifiable (network sense, not physically) and the introduction of requirements such as interoperability and integration in a seamlessly way. this definition also details what are the things in iot, as things being virtual or physical, that can have different personalities and may use different communication protocols. “the basic idea of this concept is the pervasive presence around us of a variety of things or objects such as radiofrequency identification (rfid) tags, sensors, actuators, mobile phones, etc. which, through unique addressing schemes, are able to interact with each other and cooperate with their neighbors to reach common goals.” defined in 2010 by (atzori, iera, and morabito, 2010), cited by (miorandi et al., 2012; gubbi et al., 2013; singh, tripathi and jara, 2014). this iot definition from 2010 is one of the most used. it can be considered broader regarding the “actors, relations among actors, requirements and what enables” structure. it presents the vast amount and heterogeneity of actors that can engage an interaction, and a requirement to achieve that through unique addressing schemes. in this case, new actors are included, and we can observe that the sensing and actuation are other possible behaviors that a system can possess, differing from initial definitions. therefore, these actors can cooperate to reach some goals. “interconnection of sensing and actuating devices providing the ability to share information across platforms through a unified framework, developing a common operating picture for enabling innovative applications. this is achieved by seamless large-scale sensing, data analytics and information representation using cuttingedge ubiquitous sensing and cloud computing.” defined in 2012 by (gubbi et al., 2013). once more, sensing and actuation have essential roles in iot, as presented in this definition from 2012. the vast amount of data collection and sharing among actors can be a source to compose diversified, innovative applications. it makes clear the multidisciplinary nature of iot, as the integration of different disciplines for the accomplishment of successful iot systems, as there are areas that support or leverages it, such as data analytics, ubiquitous and cloud computing. “everyday objects can be equipped with identifying, sensing, networking and processing capabilities that will allow them to communicate with one another and with other devices and services over the internet to achieve some useful objective (…). every day “things” will be equipped with tracking and sensing capabilities. when this vision is fully actualized, “things” will also contain more sophisticated processing and networking capabilities that will enable these smart objects to understand their environments and interact with people.” defined in 2015 by (whitmore, agarwal and da xu, 2015). once the everyday things can sense the environment, they become more aware of what is around them, which characterizes context-awareness. in this 2015 definition, we see again that the primary concern in iot is to leverage the connection among different things to achieve a system objective. also, the authors explain that things in the iot context are those objects equipped with identifying, sensing, networking, and processing capabilities, whereas other definitions exemplify things as being the providers of such capabilities, that is, tags, sensors, and actuators. in our understanding, things exist in the physical realm, such as sensors, actuators and anything that is equipped with identification (tag reading), sensing or actuation capabilities, which excludes entities in the internet domain (hosts, terminals, routers, among others). the things should also have communication, networking and processing functionalities varying according to the systems requirements. as one can notice, the capabilities of the things evolved over time as observed from the definitions presented and the examples in figure 2. figure 2 iot evolution. as things evolved, the understanding and discussions should also follow the changes. in the beginning, the things in iot based systems were objects attached to electronic tags, so these systems present the behavior of identification. subsequently, sensors and actuators composing the systems enabled the sensing and actuation behaviors respectively. it means that an iot system may have identification, sensing or actuation behaviors, or a combination of them. the explaining of each behavior and examples of applications can be seen in figure 3 and table 4 towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 figure 3 iot behaviors. when discussing the previous definitions, it was necessary to distinguish the meaning of “identification” referred to objects. the reason is that an object can be identifiable in the sense of connectivity (e.g., throw ip addresses) or in the sense of physical identification when objects are tagged with electronic tags containing specific information, making it possible to identify objects through tag readers. further, it is also relevant to elucidate the meaning of “actuation” as it may bring diverse interpretations. when focusing on the iot context, the adequate meaning for “actuation” is precisely the one presented in table 4. it is divergent from actions represented by methods in the object-oriented paradigm, and it is not related to objects’ processing capabilities mentioned in the iot definition discussed previously. actuation is exclusively related to the possibility of virtual intervening in the real world by mechanical means. it is important to note this distinction in iot systems due to their capabilities since it is possible to have different compositions of systems. in an industrial plant, for example, identification tags are attached to products and provide realtime location and status. dashboards with the data recovered from products and machines (from a sensing activity) keep managers updated along the production line and the company are now able to monitor and control production almost automatically (actuation), including processing capabilities. it is a real-case scenario already deployed where the three behaviors and benefits of iot can be seen, such as providing more process visibility, more accurate work and improving production effectiveness (cisco, 2014). it is interesting to structure the characteristics and applications retrieved in this review within these three behaviors because iot does not necessarily have to present all of them, but only one or a combination of them. it can clarify and delimitate iot solutions contributing as a guide for their applications engineering. to answer rq1 from the review results, iot can be defined as a paradigm that allows composing systems from uniquely addressable objects (things) equipped with identifying, sensing or actuation behaviors and processing capabilities that can communicate and cooperate to reach a goal. 3.4 rq2: which characteristics can define an iot domain? the 15 papers provided 263 excerpts, which were coded following the principles of open coding, as described in the grounded theory (strauss and corbin, 1990), from what we identified 29 characteristics (table 5). one point of discussion is that the authors do not define all the table 4 iot behaviors. behavior description example identification the primary function is to identify things, by labeling and enabling them to have an identity, then recover (through reading), and broadcast information related to the thing and its state. identifying patients with electronic tags (rfid) to be detected throughout hospitals using receivers (readers) placed in departments to accelerate the identification of empty beds (kannry et al., 2007). another example is the application of short-range identification technology for drug interaction and drug-allergy detection (alabdulhafith, sampangi, and sampalli, 2013). it operates by identifying patients (nfc tags integrated into their wristband) and drugs (nfc tags integrated), each tag holding a unique id. nurses read the patient’s and drug’s nfc tag by using the smartphone’s nfc reader. finally, the server verifies whether the patient is allergic to the drug or if there might be a potential interaction. sensing the primary function is to sense environment information, requiring information aggregation, data processing, and transmission. enables awareness, thus acting as a bridge between the physical and digital world. to illustrate the capability of the sensor in the real world, one interesting application is from the geophysics area. sensors have been deployed for long-distance volcanic monitoring, such as microphones and seismometers, collecting seismic and acoustic data on volcanic activity (werner-allen et al., 2006). actuation mechanical interventions in the real world according to decisions based on aggregated data or even upon actors’ right trigger; relay on responses to the collected information to perform actions in the physical world and change the object state. an example is the control of things, robots or even animals in the real world as in (wark et al., 2007), where actuators are used in an attempt to prevent fighting between bulls in on-farm breeding paddocks by autonomously triggering stimuli such as audio warning signals or mild electrical when one bull approaches another. towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 characteristics presented in the articles or referred to the original work defining them table 6. table 5 iot characteristics. characteristics # all characteristics identified 29 characteristics not defined 20 characteristics defined 9 the lack of definitions hinders the research and understanding of the area since we cannot know the feature´s meaning or what the authors meant by that. although some characteristics such as interoperability and scalability are well defined, it is essential to establish a common understanding of the characteristics since they inspire different concepts when contextualized to distinct domains. table 6 characteristics not defined. characteristic cited by reference accuracy (borgia, 2014; burhanuddin et al., 2017) adaptability (atzori, iera, and morabito 2010; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; ray, 2018) (nami and sharifi 2007; sampigethaya, hackmann, et al. 2008; poovendran, and bushnell 2008; lee and sokolsky 2010; azimi et al. 2011; barro-torres et al. 2012; hur and kang 2012) availability (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; gubbi et al., 2013; li, xu and zhao, 2015; madakam, ramaswamy and tripathi, 2015) (gluhak et al., 2011) connectivity (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; gubbi et al. 2013; whitmore, agarwal, and da xu 2015; gil et al. 2016; burhanuddin et al., 2017; ray, 2018; carcary et al., 2018) (weiser et al. 1999; infso d.4 2008; conti 2006; dunkels and vasseur 2008; vermesan et al. 2009) efficiency (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; madakam, ramaswamy, and tripathi 2015; sethi and sarangi 2017; trappey et al. 2017; burhanuddin et al., 2017) ( hackmann et al. 2008; sampigethaya, poovendran, and bushnell 2008; lee and sokolsky 2010; azimi et al. 2011; hur and kang 2012; barro-torres et al. 2012) extensibility (bandyopadhyay and sen, 2011; li, xu and zhao, 2015) flexibility (li, xu, and zhao 2015; sethi and sarangi 2017) manageability (bandyopadhyay and sen, 2011; borgia, 2014) modularity (bandyopadhyay and sen, 2011) performance (gubbi et al., 2013; li, xu and zhao, 2015) privacy (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; sethi and sarangi 2017; whitmore, agarwal, and da xu 2015) (xianrong zheng et al., 2014a) reliability (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; sethi and sarangi 2017) ( koren and krishna 2007; hackmann et al. 2008; hackmann et al. 2008; lee and sokolsky 2010; azimi et al. 2011; hur and kang 2012; barro-torres et al. 2012) robustness (atzori, iera and morabito, 2010; miorandi et al., 2012) (koren and krishna, 2007) scalability (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; madakam, ramaswamy, and tripathi 2015; sethi and sarangi 2017; burhanuddin et al., 2017) (gluhak et al., 2011) smartness (li, xu and zhao, 2015; ray, 2018) sustainability (borgia, 2014) traceability (atzori, iera and morabito, 2010) trust (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; borgia 2014; li, xu, and zhao 2015; sethi and sarangi 2017) ubiquity (carcary et al., 2018) visibility (atzori, iera and morabito, 2010) towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 for instance, “efficiency” is open to many interpretations even the iot domain is on the focus, which can be related to object’s data collection efficiency, energy-efficiency, security-efficiency, information processing efficiency as well as service adaptability-efficiency. it makes it challenging to characterize iot and to develop more suitable solutions that meet all the desired characteristics, since they were not defined, only listed. for the same reason, it is not possible to infer that the authors are discussing the same table 7 defined characteristics. characteristic cited by reference addressability: the ability to distinguish objects using unique ids. (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; borgia 2014) unique id: it is necessary for unique identification for every physical object. once the object is identified, it is possible to enhance it with personalities and other information and enable the control over it (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; burhanuddin et al., 2017; ray, 2018) (atzori, iera, and morabito 2010; finkenzeller 2010; gubbi et al. 2013) object autonomy: smart objects can have individual autonomy, not needing direct human interaction to perform established actions, while reacting or being influenced by real/physical world events. (atzori, iera and morabito, 2010; gubbi et al., 2013; madakam, ramaswamy and tripathi, 2015) mobility: object availability of across different locations. (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; borgia, 2014; sethi and sarangi, 2017) (akyildiz, jiang xie and mohanty, 2004; sharma, gusain and kumar, 2013) autonomy: refers to systems not needing direct human intervention to perform established actions such as data capture, autonomous behavior, and reaction. (atzori, iera, and morabito 2010; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; burhanuddin et al., 2017; ray, 2018; carcary et al., 2018) (chlamtac, conti, and liu 2003; nami and sharifi 2007; gusmeroli, sundmaeker, and bassi 2015) context-awareness: the use of context to provide task-relevant information and/or services to a user. (atzori, iera, and morabito 2010; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; whitmore, agarwal, and da xu 2015; sethi and sarangi 2017; ray, 2018) ( abowd et al. 1999; schmidt and van laerhoven 2001; nami and sharifi 2007; o’reilly and pahlka 2009; perera et al. 2014) heterogeneity: several services taking part in the system, which present very different capabilities from the computational and communication standpoints. (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; burhanuddin et al., 2017; carcary et al., 2018) (infso d.4 2008; gluhak et al. 2011; nuzzo and sangiovannivincentelli 2014) interoperability: interoperability is of three types: network interoperability that deals with communication protocols. syntactic interoperability ensures conversion of different formats and structures. semantic interoperability deals with abstracting the meaning of data within a domain. (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; madakam, ramaswamy, and tripathi 2015; sethi and sarangi 2017; burhanuddin et al., 2017; ray, 2018) (panetto and cecil 2013; jardimgoncalves et al. 2013; chengen wang, zhuming bi, and li da xu 2014; borgia 2014) security: to ensure the security of data, services and entire iot system, a series of properties, such as confidentiality, integrity, authentication, authorization, nonrepudiation, availability, and privacy, must be guaranteed. (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; madakam, ramaswamy, and tripathi 2015; whitmore, agarwal, and da xu 2015; sethi and sarangi 2017; burhanuddin et al., 2017) (sampigethaya, poovendran, and bushnell 2008; lee and sokolsky 2010; andreini et al. 2010; andreini et al. 2011; azimi et al. 2011; barro-torres et al. 2012; hur and kang 2012; cirani, ferrari, and veltri 2013; xianrong zheng et al. 2014b; chasaki and mansour 2015) towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 issues, such as efficiency for instance, which from the sources can be regarding cost, size, resources or energy. even with this lack of definition, the characteristics pointed out in table 5 are relevant for the characterization scenario of iot systems. in table 6, we retrieve the characteristics pointed out by the authors (cited by) and the original references used by them (reference) some references may have been used by more than one author and null (-) in case of no reference. this distinction is because we can value more the characteristics referenced by others since it is possible to have more sources to strengthen the results. to continue with our research, we consider only the characteristics that made explicit their definitions (table 7) these definitions came from our original material interpretation and compilation of the references cited. from the characteristics presented in table 7, we can observe that some of them are fundamental to an application in order to fulfill our iot definition: “a paradigm that allows composing systems from uniquely addressable objects equipped with identifying, sensing or actuation behaviors and processing capabilities that are able to communicate and cooperate to reach a goal”. addressability, unique id, heterogeneity, interoperability, mobility, and security are the essential characteristics necessary for an application to follow the iot paradigm. from this primary setting, an iotbased software system can be engineered with identification, sensing and/or actuation capabilities. each one of them requires new characteristics. for instance, context-awareness is required to enable sensing behavior, and autonomy is needed in actuation behavior. table 7 represents an initial set of iot characteristics as defined in the technical literature. we wish to perform more extensive research for the characterization of the three behaviors since new characteristics specific to each one of the iot applications may also be required. having a clearer and well-defined set of characteristics can aid the development of applications with higher quality and support to quality assurance and assessment. 3.5 rq3: which are the areas of iot application? several application domains will leverage the internet of things paradigm advantages. all the application domains are only examples of areas that benefit from iot or are supposed to do it in the future. as declared in whitmore et al. “the domain of the application areas for the iot is limited only by imagination at this point” (whitmore, agarwal and da xu, 2015). despite the application scenarios were described in different levels of detail, we tried to categorize some of them into the tree behaviors (table 5) as presented in table 8. atzori et al. (atzori, iera, and morabito, 2010)describe five domains: (a) transportation and logistics, (b) healthcare, (c) smart environment (home, office, plant), (d) personal/social and (e) futuristic domain (whose implementation of such applications is still too complicated). gubbi et al. (gubbi et al., 2013) describe (a) personal and home, (b) enterprise, (c) utilities, and (d) mobile domain. also, there is a classification of the applications for consumer (home, lifestyle, healthcare, transport) and business (manufacturing, retail, energy, transportation, agriculture, and others) (trappey et al., 2017). those domain categorizations can be a subpart of a categorization, which grouped the applications in three major domains (borgia, 2014): (a) industrial domain, (b) smart city domain, and (c) health well-being domain. they are not isolated from each other, but there is a partial overlapping since some applications are shared across the contexts. for example, tracking of products can be a demand for both industrial and health well-being domains. table 8 application type. behaviors application type identification touristic maps equipped with tags that allow nfc-equipped phones to browse it and automatically call web services, materials tracking to prevent left-ins during surgery (atzori, iera, and morabito, 2010); patient triage, resource management and distribution (gubbi et al., 2013); medical equipment tracking, secure access indoor environment management, personnel tracking, bike/car/van sharing, mobile tickets, luggage management, animal tracking, fast payment, warehouse management and inventory, identification of materials and goods (borgia, 2014); verifying the authenticity of aircraft, storing health records (bandyopadhyay and sen, 2011). sensing patient monitoring, remote personnel monitoring (health, location), sensors built into building infrastructure to guide first responders in emergencies or disaster scenarios or sensors built into infrastructure to monitor structural fatigue and other maintenance, sensing of water quality, leakage, usage and distribution, air pollution and noise monitoring, support to diagnoses, video/radar/satellite surveillance, road condition monitoring, product deterioration (borgia, 2014); monitoring chronic disease using wearable vital signs sensors in body sensors (bandyopadhyay and sen, 2011). actuation room lighting changing, alarm systems, remote switching off electrical equipment (atzori, iera, and morabito, 2010), temperature and humidity control (gubbi et al., 2013), irrigation control (borgia, 2014), muscle stimuli for paraplegic individuals (bandyopadhyay and sen, 2011). hybrid buildings adjusting locally to conditions while also taking into account outdoor conditions, robot taxis that respond to realtime traffic movements of the city, and are calibrated to reduce congestion at bottlenecks in the city and to service pick-up areas that are most frequently used (atzori, iera and morabito, 2010), water waste management (gubbi et al., 2013), parking system, traffic management (borgia, 2014). towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 4 discussion 4.1 the things in iot alongside the application areas, we also extracted the things, as we are interested in recovering what natural objects are currently in use under the iot paradigm. in many cases, the authors listed usage possibilities and existing solutions based on iot. forty-one different things were extracted, and figure 4 shows the ten most cited ones. figure 4 most common things in iot. these are everyday objects enhanced with identification, sensing and actuation capabilities. for example, sensors attached to vehicles can collect information about the roads (e.g., about traffic density or surface conditions) reporting back to the city center and, from thing-thing interaction, a vehicle can communicate to another enabling smart parking and faster communication of problems in traffic. extracting information on things from already deployed iot applications has helped our research group to realize the innovative potential of this paradigm better. also, the results of the real use of things and the examples of applications (such as those described in table 8) might be a contribution for practitioners working on innovative problem-solving projects as a source of possibilities for stimulating thinking, creativity and to expand initial ideas. the three well-established behaviors (identification, sensing, and actuation) can support different usage scenarios varying according to the kind of objects used, data to be collected, business requirements and users’ need. for instance, a door lock with the “acting” behavior can open/close different sort of doors in different scenarios according to rules, e.g., from authentication by electronic tag reading, eyes or fingers scanning, humans/animals/robots proximity sensing and many other possibilities. even though an iot solution is taken as a massive amount of various connected objects of our everyday life, the three behaviors highlighted in this work are expressly the basis among iot objects. identifying and elucidating this common property is another contribution for practitioners, which can consider these three behaviors and issues concerned with them when idealizing, engineering and developing iot-based systems. 4.2 iot related terms internet of things sometimes sounds like a buzzword, so some terms seem to be synonyms or even “aliases” (madakam, ramaswamy and tripathi, 2015). however not every term can be used interchangeably for it. from the analysis and interpretation, we categorize the related terms as presented in table 9. all the data extracted, and other details can be found in the research protocol (https://goo.gl/ctyzut). related technology technologies related to iot supporting its development. related areas other research areas that are frequently associated with iot because they share some similarities or are considered iot drivers. by looking at the related terms, we argue that the iot paradigm proposal is to enable a connected world, believing that different research areas can also be enablers in a joint effort for research, development, and evolution. also, there are areas which need further research to deal with the challenges of this novel paradigm. from our understanding, iot is an umbrella combining the advances of many areas, and we discuss the points that make those areas connected to iot or the convergence points that make some topics to sound as iot synonyms. the definition of the terms is out of the scope of this discussion. from the table below we discuss only the related areas. table 9 iot related terms. categories terms related technology cloud computing internet protocol communication middleware rfid universal identifier architecture wireless sensor networks related areas ambient intelligence context-aware systems cyber-physical systems human-computer interaction industry 4.0 internet of computers internet of objects internet of people intranet/extranet of things machine-to-machine interaction micro-electro-mechanical systems network of things pervasive computing social iot ubiquitous computing web of things ambient intelligence ambient intelligence is a developing technology that will increasingly make our everyday environment more sensitive and responsive (madakam, ramaswamy and tripathi, 2015). according to (miorandi et al., 2012), iot may well inherit concepts and lessons learned in ambient intelligence, enabling ambient intelligence to a larger scale. context-aware systems considering our understanding of things those that are equipped with identification and sensing capabilities and are the bridge from the physical to the virtual realm. from identification technologies such as rfid, it is possible towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 to get the identity and location of entities. sensors enable sensing environment information such as sound, temperature, humidity, among others (atzori, iera, and morabito, 2010). in our interpretation, these capabilities of things in iot make the field of iot related to context-awareness, because from sensors and tag reading the environment and entities’ context information can be perceived (not explicit input to the system). then such context information can be used to provide task-relevant information and/or services to a user (abowd et al., 1999). even though contextawareness is considered an essential aspect of iot (sethi and sarangi, 2017), it does not mean any iot system is context-aware, at least information gathered are used as relevant resources for decision-making and for dynamically taking actions, such as systems customization. cyber-physical systems (cps) cloud computing, wireless sensor network (wsn), m2m, iot, and others are all fields that collaborate somehow to reach the broad goal of cps, that is, “to bring the cyber-world of computing and communications together with the physical world” (rajkumar et al. 2010; madakam, ramaswamy, and tripathi 2015). according to (miorandi et al., 2012), “a cyber-physical infrastructure is the result of the embedding of electronics into everyday physical objects, making them ''smart'' and letting them integrate seamlessly within the global.” as discussed previously, we understand that wsns are enablers for m2m and consequently for iot. m2m systems are the precursor of cps as devices allow the bridge between the physical and virtual world, in the same manner, m2m are the basis for the internet of things. it leads us to interpret iot is a form of realizing cps, and it is consistent with (chen, 2012), who proposes that “cps is an evolution of m2m by the introduction of more intelligent and interactive operations, under the architecture of internet of things (iot)”. human-computer interaction hci is an area that needs further research to deal with this novel iot context where human intervention is low or even absent. it usually involves the study, planning, and design of the interaction between people and computers (madakam, ramaswamy and tripathi, 2015). industry 4.0 iot is described as a critical enabler for industry 4.0 (trappey et al., 2017). iot has been deployed in factories and production environment, turning them more intelligent. it is leading toward the fourth industrial revolution. internet of computers mentioned not as a synonym of iot but as an orthogonal term (gil et al., 2016). in their description internet of computers are traditional internet environments, where both leading data producers and consumers are human beings (not things). internet of objects considering some of the iot definitions found in the technical literature, we can interpret that “objects” and things are equivalent. for instance, “iot implies that objects in an iot can be identified uniquely in the virtual representations” (li, xu, and zhao, 2015). in addition, “ [iot is] the pervasive presence around us of a variety of things or objects – such as radio-frequency identification (rfid) tags, sensors, actuators, mobile phones, etc.” (wan et al., 2013) and ''a worldwide network of interconnected objects uniquely addressable, based on standard communication protocols” (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; gil et al. 2016). internet of people the internet of things is not synonymous with the internet of people as mentioned by borgia (borgia, 2014), but the author does not explain that. for this reason, we searched for works addressing this subject, and we could not find any consensus. nevertheless, (miranda et al., 2015) explain that iot technology needs people-centric enhancements to achieve the more desirable iot scenarios, that is, scenarios which consider people’s context, learning from it, reasoning and taking actions proactively. therefore, achieving those desired scenarios requires moving from the internet of things to the internet of people (iop). some essential features of iop systems are: be social, be personalized, be proactive and be predictable. intranet/extranet of things intranet/extranet of things and iot are not synonymous (borgia, 2014). however, as far as we know, they share a broad concept, but the difference is that in intranet/extranet there is a restriction of connection for restricted areas, while on the internet the connections are publicly accessible. machine-to-machine interaction m2m means no human intervention while devices are communicating end-to-end (madakam, ramaswamy and tripathi, 2015). it leads us to think that m2m and iot are similar, but m2m is more a paradigm leading towards iot (atzori, iera, and morabito, 2010). m2m refers to technologies that allow both wireless and wired systems to communicate with other devices of the same ability (wan et al., 2013). unlike devices in iot, in m2m they are meant to operate in a specific application, which means that m2m solutions do not allow the broad sharing of data or opened connection of devices into the internet (holler et al., 2014). micro-electro-mechanical systems mems technology is one of the enablers to develop miniature devices, which are capable of sensing, compute and communicate (gubbi et al., 2013). when connected, these miniature devices form a wireless sensor network, and, consequently, are crucial building blocks for developing machine-to-machine, iot, among others. network of things network of things is similar to intranet/extranet regarding connection restrictions. it is referred to operate in a restricted local, within a work environment like an enterprise-based application. only the owners use the information collected from such networks, and the data may be released selectively (gubbi et al., 2013). pervasive or ubiquitous computing these two terms are intimately connected. some authors have addressed them interchangeably (satyanarayanan 2001; baldauf, dustdar, and rosenberg 2007; spínola, pinto, and towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 travassos 2008). our interpretation of the relation between iot and ubicomp is that iot projects can be considered ubiquitous according to their adherence to ubiquity characteristics (spínola and travassos 2012). such characteristics are context-sensitivity, adaptable behavior, service omnipresence, heterogeneity of devices, experience capture, spontaneous, interoperability, scalability, privacy and trust, fault tolerance, quality of service, and universal usability (spínola and travassos 2012). that is, ubiquity becomes a transversal property of iot systems as they fulfill ubiquity characteristics. social iot the term social iot (siot) is mentioned as a new paradigm that has been proposed (atzori, iera, and morabito 2010; li, xu, and zhao 2015; gil et al. 2016; sethi and sarangi 2017). siot means that the things are seen now as “beings,” and the interconnections among them are compared to human social relations. the authors described the three main facets of a siot system: (i) the siot is navigable; (ii) a need for trustworthiness (relationship strength) is present between devices; and (iii) models to study human social networks are similar to social networks of iot devices. web of things it refers to the re-use of web standards to connect and integrate iot objects into the web (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; borgia, 2014; madakam, ramaswamy and tripathi, 2015). it is possible to observe that the evolution of some areas and the collaboration among them enable the iot paradigm realization. once it is possible to develop small devices, embed intelligence, seamless communication, thing-thing interaction, wireless connections, and others, all of these are iot enabling technologies. this discussion of terms related to the iot paradigm might be a contribution for further investigations, which might depend on grounded concepts and clarity about convergence points that make other topics seem as iot synonyms. in addition, practitioners and researchers can benefit from this discussion in the circumstances there are doubts on whether iot is indeed the right term to consider for their software projects and/or future investigations. 4.3 iot challenges to foster our discussions and research directions, one of the information extracted from the selected articles were challenges, which we understand as open opportunities in the industry or academia. the data extracted were analyzed based on grounded theory procedures (strauss and corbin, 1990). the process started by retrieving the excerpts related to iot challenges (the excerpts could be a word, a phrase or a full paragraph). the 15 papers provided 38 excerpts regarding iot challenges. the 38 excerpts were organized into seven categories (table 10). we used codes to assign concepts to a portion of data, with a constant comparative analysis to identify patterns from similarities and differences emergent from the data. this textual analysis was conducted by two researchers, with crosschecking to achieve consensus. the excerpts were organized in the categories, and we present each category with a definition and an example of an excerpt to support its comprehension. it is interesting to notice that the concerns are usually interrelated, confirming the multidisciplinary nature of iot. for example: “for technology to disappear from the consciousness of the user, the internet of things demands software architectures and pervasive communication networks to process and convey the contextual information to where it is relevant” (gubbi et al., 2013), this excerpt is coded for an architectural issue and network as well. another example is “central issues are making full interoperability of interconnected devices possible, providing them with an always higher degree of smartness by enabling their adaptation and autonomous behavior, while guaranteeing trust, privacy, and security” (ieee, 2004), which was coded both for interoperability and security issues. provide solutions to the issues presented in the technical literature can be tricky to achieve due to the diversity of concerns, variety of devices and uncertainties in the area. from the findings recovered in this review, our research perspective will be directed to support the proposed definition: iot is a paradigm that allows composing systems from uniquely addressable objects (things) equipped with identifying, sensing or actuation behaviors and processing capabilities that can communicate and cooperate to reach a goal. our focus will be on looking at the perspective of the software orchestration necessary for the composition of systems that will rise in this contemporary paradigm. despite our decision to direct the research, the article may contribute to other areas of research providing definitions, characteristics, and the challenges presented in this section. 4.4 threats to validity since only scopus was used as a search engine, it may be missing some relevant studies, but from our experience, we know that it can give a reasonable coverage when performing together with snowballing procedures (backward and forward) (matalonga, rodrigues, and travassos 2015; motta, oliveira, and travassos 2016). in addition, a recurrent issue in literature reviews regards inconsistent terminology and restrictive keywords. we searched for other reviews and observed the terms used to compose our search string to reduce the researchers’ bias. data extraction and interpretation biases were mitigated with crosschecking between two researchers and by having a third researcher to revise the results. all phases of this review were peer-reviewed; any doubt was discussed among the readers, to reduce selection bias. we have not performed a quality assessment regarding the research methodology of the selected studies due to the lack of information in the secondary reports. it is a threat to this study validity. towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 5 conclusion this work presented the research on the iot paradigm, detailing the activities performed for the literature review, and analyzing the findings and discussions to answer the following research questions: (rq1) what is “internet of things”? (rq2) which characteristics can define an iot domain? (rq3) which are the areas of iot application? as the iot concept is currently under discussion, there are still significant issues regarding its understanding that need to be clarified and established. one contribution of this work is to present an organized perspective regarding the current state-of-the-art regarding the iot paradigm. besides, it allows observing which areas of application are making use of iot (rq3). all of these findings were related and summarized to enrich the iot paradigm comprehension. from the discussion of rq1, we understand that iot is a paradigm allowing the composition of software systems from uniquely addressable objects equipped with identifying, sensing or actuation behaviors and processing capabilities that can communicate and cooperate to reach a goal. the idea of composing software systems from available components is not new, but one of the issues that set iot apart is the scale at which it can be achieved and the actors involved in these new software systems. from this, shared concerns regarding the development and evaluation of such software systems should be reframed to cover the particularities of these new types of devices. a critical step towards it is to establish what quality characteristics should be contemplated. with the second research question, we moved forward in this direction. regarding the iot characteristics (rq2), from the technical literature, we recovered 29 different attributes, from which this paper discussed nine of them with clear evidence from the sources of information. considering that the results retrieved are from secondary studies, the characteristics represented reflect more than just the 15 secondary studies, but rather the whole set of primary studies involved in them which can strengthen these results. the most commonly cited characteristics presented are efficiency, interoperability, scalability, privacy, and security that reassure the definition reached in the paper. this work is the first step towards future investigations focusing on aspects such as software development and quality control of iot. apart from that, the grounded concepts, properties and terms related to the iot paradigm can be a contribution to any future related research. besides, the identification and discussions on already deployed applications and the three behaviors of things can contribute to practitioners in the processes of idealizing, engineering and developing iot software systems. at last, it is expected table 10 iot challenges. category example architecture: issues and concerns regarding design decisions, styles and the structure of iot systems. “finding a scalable, flexible, secure and cost-efficient architecture, able to cope with the complex iot scenario, is one of the main goals for the iot adoption.” (borgia, 2014). data: it refers to the management of a significant amount of data, and how to recover, represent, store, interconnect, search, and organize data generated by iot from so many different users and devices. “this new field offers many research challenges, but the main goal of this line of research is to make sense of data in any iot environment. it has been pointed out that it is always much easier to create data than to analyze them. with this in mind, new conceptual modeling, as well as new paradigms of data mining techniques will be crucial to provide value and meaning to initially empty data.” (gil et al. 2016). interoperability: related to the challenge of making different systems, software, and things to interact for a purpose. standards and protocols are also included as issues. “the end goal is to have plug n' play smart objects which can be deployed in any environment with an interoperable backbone allowing them to blend with other smart objects around them.” (gubbi et al., 2013). management: the application of management activities, such as planning, monitoring and controlling, in the iot system will raise the interaction of different things. “from the viewpoint of the network, iot is a very complex heterogeneous network, which includes the connections among various types of networks through various communication technologies. the devices and methodologies for addressing things management is still a challenge.” (li, xu and zhao, 2015) network: technical challenges related to communication technologies, routing, access and addressing schemes considering the different characteristics of the devices. “designing an appropriate topology, routing, and mac layer is critical for scalability and longevity of the deployed network” (gubbi et al., 2013). security: issues related to several aspects to ensure data security in the iot system. for that, a series of properties, such as confidentiality, integrity, authentication, authorization, nonrepudiation, availability, and privacy should be investigated. “security issues are central in iot as they may occur at various levels, investing technology as well as ethical and privacy issues [...] this is extremely challenging due to the iot characteristics.” (borgia, 2014). social: concerns related to the human end-user to understand the situation of its users and their appliances. “for a lay person to fully benefit from the iot revolution, attractive and easy to understand visualization have to be created.” (gubbi et al., 2013). towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 the knowledge organized and presented in this paper can contribute to stimulating discussions and future investigations on providing software technologies to promote the engineering of high-quality iot software systems. 6 declarations abbreviations iot: internet of things; cps: cyber-physical systems; rfid: radio-frequency identification; mems: microelectro-mechanical systems; m2m: machine-to-machine; hci: human-computer interaction. availability of data and materials details of the protocol are available in https://goo.gl/ctyzut. authors’ contributions we present a review supported by established guidelines and aims to contribute to the iot field of with awareness, understanding of its concepts and features and a characterization regarding its definition, characteristics, and applications. we answer the research questions characterizing the area, present challenges and opportunities and we offer an essential overview of the internet of things state-of-the-art, presenting issues that should be addressed to contribute to its strengthening and establishment. acknowledgments the authors thank cnpq and capes for supporting this research. funding prof. travassos is a cnpq researcher (grant 305929/20143). this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior brasil (capes) finance code 001. competing interests the authors declare that they have no competing interests. consent for participation and publication not applicable. 7 references abowd, g. d. et al. (1999) ‘towards a better understanding of context and context-awareness’, in computing systems, pp. 304–307. doi: 10.1007/3-540-48157-5_29. akyildiz, i. f., jiang xie and mohanty, s. (2004) ‘a survey of mobility management in next-generation all-ip-based wireless systems,’ ieee wireless communications, 11(4), pp. 16–28. doi: 10.1109/mwc.2004.1325888. alabdulhafith, m., sampangi, r. v. and sampalli, s. (2013) ‘nfc-enabled smartphone application for drug interaction and drug allergy detection,’ in 2013 5th international workshop on near field communication (nfc). ieee, pp. 1–6. doi: 10.1109/nfc.2013.6482450. de almeida biolchini, j. c. et al. (2007) ‘scientific research ontology to support systematic review in software engineering,’ advanced engineering informatics, 21(2), pp. 133–151. doi: 10.1016/j.aei.2006.11.006. andreini, f. et al. (2010) ‘context-aware location in the internet of things,’ in 2010 ieee globecom workshops. ieee, pp. 300–304. doi: 10.1109/glocomw.2010.5700330. andreini, f. et al. (2011) ‘a scalable architecture for geolocalized service access in smart cities,’ in 2011 future network & mobile summit, pp. 1–8. atzori, l., iera, a. and morabito, g. (2010) ‘the internet of things: a survey,’ computer networks. elsevier b.v., 54(15), pp. 2787–2805. doi: 10.1016/j.comnet.2010.05.010. azimi, s. r. et al. (2011) ‘vehicular networks for collision avoidance at intersections’, sae international journal of passenger cars-mechanical systems, 4(1), pp. 2011-01– 0573. doi: 10.4271/2011-01-0573. baldauf, m., dustdar, s. and rosenberg, f. (2007) ‘a survey on context-aware systems,’ international journal of ad hoc and ubiquitous computing, 2(4), p. 263. doi: 10.1504/ijahuc.2007.014070. bandyopadhyay, d. and sen, j. (2011) ‘internet of things: applications and challenges in technology and standardization’, wireless personal communications, 58(1), pp. 49–69. doi: 10.1007/s11277-011-0288-5. barro-torres, s. et al. (2012) ‘real-time personal protective equipment monitoring system,’ computer communications, 36(1), pp. 42–50. doi: 10.1016/j.comcom.2012.01.005. basili, v. r., caldeira, g. and rombach, h. d. (1994) ‘goal question metric paradigm.’ borgia, e. (2014) ‘the internet of things vision: key features, applications, and open issues,’ computer communications. elsevier b.v., 54, pp. 1–31. doi: 10.1016/j.comcom.2014.09.008. brock, d. l. (2001) ‘integrating the electronic product code (epc) and the global trade item number (gtin),’ mit auto-id center, (february 1), pp. 1–25. budgen, d. and brereton, p. (2006) ‘performing systematic literature reviews in software engineering,’ in proceeding of the 28th international conference on software engineering icse ’06. new york, new york, usa: acm press, p. 1051. doi: 10.1145/1134285.1134500. burhanuddin, m. a. et al. (2017) ‘internet of things architecture: current challenges and future direction of research,’ international journal of applied engineering research, 12(21), pp. 11055–11061. carcary, m. et al. (2018) ‘exploring the determinants of iot adoption: findings from a systematic literature review,’ in zdravkovic, j. et al. (eds) ceur workshop proceedings. cham: springer international publishing (lecture notes in business information processing), pp. 113–125. doi: 10.1007/978-3-319-99951-7_8. chasaki, d. and mansour, c. (2015) ‘security challenges in the internet of things,’ international journal of spacebased and situated computing, 5(3), p. 141. doi: 10.1504/ijssc.2015.070945. chen, m. (2012) ‘machine-to-machine communications: architectures, standards, and applications,’ ksii towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 transactions on internet and information systems, 6(2), pp. 480–497. doi: 10.3837/tiis.2012.02.002. chengen wang, zhuming bi and li da xu (2014) ‘iot and cloud computing in automation of assembly modeling systems,’ ieee transactions on industrial informatics. ieee, 10(2), pp. 1426–1434. doi: 10.1109/tii.2014.2300346. chlamtac, i., conti, m. and liu, j. j. n. (2003) ‘mobile ad hoc networking: imperatives and challenges,’ ad hoc networks, 1(1), pp. 13–64. doi: 10.1016/s15708705(03)00013-1. cicirelli, f. et al. (2018) ‘a metamodel framework for edgebased smart environments’, in 2018 ieee international conference on cloud engineering (ic2e). ieee, pp. 286– 291. doi: 10.1109/ic2e.2018.00067. cirani, s., ferrari, g. and veltri, l. (2013) ‘enforcing security mechanisms in the ip-based internet of things: an algorithmic overview,’ algorithms. multidisciplinary digital publishing institute, 6(2), pp. 197–226. doi: 10.3390/a6020197. cisco (2014) leading tools manufacturer transforms operations with iot. available at: http://www.cisco.com/c/dam/en_us/solutions/industries/d ocs/manufacturing/c36-732293-00-stanley-cs.pdf. datta, s. k. et al. (2017) ‘vehicles as connected resources: opportunities and challenges for the future,’ ieee vehicular technology magazine, 12(2), pp. 26–35. doi: 10.1109/mvt.2017.2670859. dunkels, a. and vasseur, j. (2008) the internet of things : ip for smart objects, ipso alliance white paper. finkenzeller, k. (2010) rfid handbook: fundamentals and applications in contactless smart cards, radio frequency identification, and near-field communication. nj: wiley. gil, d. et al. (2016) ‘internet of things: a review of surveys based on context-aware intelligent services,’ sensors, 16(7), p. 1069. doi: 10.3390/s16071069. gluhak, a. et al. (2011) ‘a survey on facilities for experimental internet of things research,’ ieee communications magazine, 49(11), pp. 58–67. doi: 10.1109/mcom.2011.6069710. gubbi, j. et al. (2013) ‘internet of things (iot): a vision, architectural elements, and future directions,’ future generation computer systems, 29(7), pp. 1645–1660. doi: 10.1016/j.future.2013.01.010. gusmeroli, s., sundmaeker, h. and bassi, a. (2015) ‘internet of things strategic research roadmap,’ the cluster of european research projects, tech. rep, pp. 9–52. hackmann, g. et al. (2008) ‘a holistic approach to decentralized structural damage localization using wireless sensor networks,’ in 2008 real-time systems symposium. ieee, pp. 35–46. doi: 10.1109/rtss.2008.40. holler, j. et al. (2014) from machine-to-machine to the internet of things. elsevier. doi: 10.1016/c2012-0-032632. hur, j. and kang, k. (2012) ‘dependable and secure computing in medical information systems,’ computer communications. elsevier b.v., 36(1), pp. 20–28. doi: 10.1016/j.comcom.2012.01.006. ieee (2004) guide to the software engineering body of knowledge, ieee. ieee computer society press. available at: http://www.computer.org/portal/web/swebok. infso d.4 (2008) ‘networked enterprise and rfid infso g.2 micro and nanosystems’, co-operation with the working group rfid of the etp eposs, internet of things in 2020, roadmap for the future, version 1.1, 2020(4). itu (2005) itu internet report 2005: the internet of things. doi: 10.1038/nphys3028. jardim-goncalves, r. et al. (2013) ‘systematisation of interoperability body of knowledge: the foundation for enterprise interoperability as a science,’ enterprise information systems. taylor & francis, 7(1), pp. 7–32. doi: 10.1080/17517575.2012.684401. kannry, j. et al. (2007) ‘small-scale testing of rfid in a hospital setting: rfid as bed trigger,’ amia annual symposium proceedings, pp. 384–388. available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?arti d=2813671&tool=pmcentrez&rendertype=abstract. koren, i. and krishna, c. m. (2007) ‘fault-tolerant systems’. elsevier. available at: https://ebookcentral.proquest.com/lib/feupebooks/reader.action?docid=294597&query=. kraijak, s. and tuwanut, p. (2016) ‘a survey on the internet of things architecture, protocols, possible applications, security, privacy, real-world implementation and future trends,’ international conference on communication technology proceedings, icct, 2016–february, pp. 26– 31. doi: 10.1109/icct.2015.7399787. larrucea, x. et al. (2017) ‘software engineering for the internet of things,’ ieee software, 34(1), pp. 24–28. doi: 10.1109/ms.2017.28. lee, i. and sokolsky, o. (2010) ‘medical cyber-physical systems,’ in proceedings of the 47th design automation conference on dac ’10. new york, new york, usa: acm press, p. 743. doi: 10.1145/1837274.1837463. li, s., xu, l. da and zhao, s. (2015) ‘the internet of things: a survey,’ information systems frontiers, 17(2), pp. 243– 259. doi: 10.1007/s10796-014-9492-7. madakam, s., ramaswamy, r. and tripathi, s. (2015) ‘internet of things (iot): a literature review,’ journal of computer and communications, 3(5), pp. 164–173. doi: 10.4236/jcc.2015.35021. matalonga, s., rodrigues, f. and travassos, g. (2015) ‘challenges in testing context-aware software systems,’ in 9th workshop on systematic and automated software testing 2015. belo horizonte, brazil, pp. 51–60. matalonga, s., rodrigues, f. and travassos, g. h. (2017) ‘characterizing testing methods for context-aware software systems: results from a quasi-systematic literature review,’ journal of systems and software. elsevier inc., 131, pp. 1–21. doi: 10.1016/j.jss.2017.05.048. miorandi, d. et al. (2012) ‘internet of things: vision, applications and research challenges,’ ad hoc networks. towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 elsevier b.v., 10(7), pp. 1497–1516. doi: 10.1016/j.adhoc.2012.02.016. miranda, j. et al. (2015) ‘from the internet of things to the internet of people’, ieee internet computing, 19(2), pp. 40–47. doi: 10.1109/mic.2015.24. motta, r. c., oliveira, k. m. de and travassos, g. h. (2016) ‘characterizing interoperability in context-aware software systems,’ in 2016 vi brazilian symposium on computing systems engineering (sbesc). ieee, pp. 203– 208. doi: 10.1109/sbesc.2016.039. motta, r. c., de oliveira, k. m. and travassos, g. h. (2018) ‘on challenges in engineering iot software systems,’ in proceedings of the xxxii brazilian symposium on software engineering sbes ’18. new york, new york, usa: acm press, pp. 42–51. doi: 10.1145/3266237.3266263. nami, m. r. and sharifi, m. (2007) ‘a survey of autonomic computing systems,’ in intelligent information processing iii. boston, ma: springer us, pp. 101–110. doi: 10.1007/978-0-387-44641-7_11. nuzzo, p. and sangiovanni-vincentelli, a. (2014) ‘let’s get physical: computer science meets systems,’ in from programs to systems. the systems perspective in computing. springer, pp. 193–208. doi: 10.1007/978-3642-54848-2_13. o’reilly, t. and pahlka, j. (2009) ‘the web squared era,’ forbes, september 2009. panetto, h. and cecil, j. (2013) ‘information systems for enterprise integration, interoperability, and networking: theory and applications,’ enterprise information systems. taylor & francis, 7(1), pp. 1–6. doi: 10.1080/17517575.2012.684802. patel, p. and cassou, d. (2015) ‘enabling high-level application development for the internet of things’, journal of systems and software. elsevier ltd., 103, pp. 62–84. doi: 10.1016/j.jss.2015.01.027. perera, c. et al. (2014) ‘context-aware computing for the internet of things: a survey,’ ieee communications surveys & tutorials, 16(1), pp. 414–454. doi: 10.1109/surv.2013.042313.00197. rajkumar, r. (raj) et al. (2010) ‘cyber-physical systems,’ in proceedings of the 47th design automation conference on dac ’10. new york, new york, usa: acm press, p. 731. doi: 10.1145/1837274.1837461. ray, p. p. (2018) ‘a survey on internet of things architectures,’ journal of king saud university computer and information sciences. king saud university, 30(3), pp. 291–319. doi: 10.1016/j.jksuci.2016.10.003. sampigethaya, k., poovendran, r. and bushnell, l. (2008) ‘secure operation, control, and maintenance of future eenabled airplanes,’ proceedings of the ieee, 96(12), pp. 1992–2007. doi: 10.1109/jproc.2008.2006123. santos, i. de s. et al. (2017) ‘test case design for contextaware applications: are we there yet?’, information and software technology. elsevier b.v., 88, pp. 1–16. doi: 10.1016/j.infsof.2017.03.008. satyanarayanan, m. (2001) ‘pervasive computing: vision and challenges,’ ieee personal communications, 8(4), pp. 10–17. doi: 10.1109/98.943998. schmidt, a. and van laerhoven, k. (2001) ‘how to build smart appliances?’, ieee personal communications. ieee, 8(4), pp. 66–71. doi: 10.1109/98.944006. sethi, p. and sarangi, s. r. (2017) ‘internet of things: architectures, protocols, and applications’, journal of electrical and computer engineering, 2017. doi: 10.1155/2017/9324035. sharma, v., gusain, p. and kumar, p. (2013) ‘near field communication,’ setlabs briefings, 2013(cac2s), pp. 342–345. singh, d., tripathi, g. and jara, a. j. (2014) ‘a survey of internet-of-things: future vision, architecture, challenges and services,’ 2014 ieee world forum on internet of things, wf-iot 2014, pp. 287–292. doi: 10.1109/wfiot.2014.6803174. skiba, d. j. (2013) ‘the internet of things (iot),’ nursing education perspectives, 34(1), pp. 63–64. doi: 10.5480/1536-5026-34.1.63. spínola, r. o., pinto, f. c. r. and travassos, g. h. (2008) ‘supporting requirements definition and quality assurance in ubiquitous software project,’ in communications in computer and information science, pp. 587–603. doi: 10.1007/978-3-540-88479-8-42. spínola, r. o. and travassos, g. h. (2012) ‘towards a framework to characterize ubiquitous software projects,’ information and software technology, 54(7), pp. 759–785. doi: 10.1016/j.infsof.2012.01.009. strauss, a. and corbin, j. (1990) basics of qualitative research: techniques and procedures for developing grounded theory. newbury park: sage publications, inc. trappey, a. j. c. et al. (2017) ‘a review of essential standards and patent landscapes for the internet of things: a key enabler for industry 4.0’, advanced engineering informatics. elsevier ltd, 33, pp. 208–229. doi: 10.1016/j.aei.2016.11.007. vermesan, ovidiu and friess, peter and guillemin, patrick and gusmeroli, sergio and sundmaeker, harald and bassi, alessandro and jubert, ignacio soler and mazura, margaretha and harrison, m. and others (2009) ‘towards the web of things : web mashups for embedded devices’, workshop on mashups, enterprise mashups and lightweight composition on the web (mem 2009), pp. 1– 8. wan, j. et al. (2013) ‘from machine-to-machine communications towards cyber-physical systems,’ computer science and information systems, 10(3), pp. 1105–1128. doi: 10.2298/csis120326018w. wark, t. et al. (2007) ‘the design and evaluation of a mobile sensor/actuator network for autonomous animal control,’ in 2007 6th international symposium on information processing in sensor networks. ieee, pp. 206–215. doi: 10.1109/ipsn.2007.4379680. weiser, m. et al. (1999) ‘the origins of ubiquitous computing research at parc,’ ibm systems journal, 38(4), pp. 693–696. doi: 10.1147/sj.384.0693. towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 werner-allen, g. et al. (2006) ‘deploying a wireless sensor network on an active volcano,’ ieee internet computing, 10(2), pp. 18–25. doi: 10.1109/mic.2006.26. whitmore, a., agarwal, a. and da xu, l. (2015) ‘the internet of things—a survey of topics and trends,’ information systems frontiers, 17(2), pp. 261–274. doi: 10.1007/s10796-014-9489-2. wohlin, c. (2014) ‘guidelines for snowballing in systematic literature studies and a replication in software engineering,’ proceedings of the 18th international conference on evaluation and assessment in software engineering ease ’14, pp. 1–10. doi: 10.1145/2601248.2601268. wortmann, a., combemale, b. and barais, o. (2017) ‘a systematic mapping study on modeling for industry 4.0’, in 2017 acm/ieee 20th international conference on model driven engineering languages and systems (models). ieee, pp. 281–291. doi: 10.1109/models.2017.14. xianrong zheng et al. (2014a) ‘cloud service negotiation in internet of things environment: a mixed approach,’ ieee transactions on industrial informatics, 10(2), pp. 1506–1515. doi: 10.1109/tii.2014.2305641. xianrong zheng et al. (2014b) ‘cloudqual: a quality model for cloud services,’ ieee transactions on industrial informatics. ieee, 10(2), pp. 1527–1536. doi: 10.1109/tii.2014.2306329. zambonelli, f. (2016) ‘towards a general software engineering methodology for the internet of things.’ available at: http://arxiv.org/abs/1601.05569. journal of software engineering research and development, 2022, 10:10, doi: 10.5753/jserd.2022.2554 this work is licensed under a creative commons attribution 4.0 international license.. on the use of uml in the brazilian industry: a survey ed wilson júnior [ universidade do vale do rio dos sinos | edwjr7@edu.unisinos.br ] kleinner farias [ universidade do vale do rio dos sinos | kleinnerfarias@unisinos.br ] bruno da silva [ california polytechnic state university | bcdasilv@calpoly.edu ] abstract over the past decade, uml modeling has been used in the industry in software development tasks, such as documenting design decisions and promoting better communication between teams, as pointed out in recent studies. however, little is known about the factors, practitioners’ perceptions, and practices that affect uml use in realworld projects. this article, therefore, reports exploratory research focused on investigating how uml is used in practice in the brazilian software industry. in total, 376 professionals from 210 information technology companies answered an online questionnaire about the factors affecting use, difficulty and frequency of use, perceived benefits, and contextual factors that prevent the adoption of uml models. in addition, 20 professionals participated in a semi-structured interview answering basic questions about professional experience, vision on software modeling, use of tools, and other aspects of uml. the main results show that: 74% of the participants answered that they do not use uml frequently. factors such as (1) high time pressure to develop features; (2) the cost of disseminating a common model understanding among diverse audiences and; (3) the difficulty of evaluating the quality of the models affect the effective use of uml. in general, most participants know uml, but do not use it frequently (or do not use at all) in their projects. finally, this article draws some challenges, implications and research directions that can be explored in upcoming studies for promoting uml modeling in practice. keywords: uml, unified model language, practice, industry, survey 1 introduction uml models can play a crucial role in software development tasks such as documenting design decisions and promoting better communication within and across teams (omg, 2017). some previous studies (bucchiarone et al., 2021; fernándezsáez et al., 2018; chaudron et al., 2012) highlight that the use of uml modeling can provide benefits to the software development process, such as providing a common understanding among team members, understanding the details of design decisions, and ultimately making the process more efficient after all. however, in practice, such benefits are often overlooked or not observed. some studies (fernándezsáez et al., 2018; chaudron et al., 2012; störrle, 2017) argue that such benefits can be realized when there is a consistent and (in)formal application of modeling, where developers typically use uml throughout the project and have precise control over its use. as we can rarely find such a scenario, researchers (fernández-sáez et al., 2018; petre, 2014) have tried to draw a clear picture of uml use in real-world projects. today, the current literature (akdur et al., 2021; fernández-sáez et al., 2018; petre, 2014; chaudron et al., 2012) lacks a broad and exploratory understanding of practitioners’ perceptions of the factors that affect or even compromise the adoption of uml modeling in real-world projects. more specifically, little is known about how practitioners deal with software modeling in the brazilian software development industry context. previous studies (petre, 2014, 2013) have focused on collecting opinions from participants to understand which uml diagrams are most used. however, this assumes that participants’ perceptions and experiences worldwide match those at the regional level (i.e., country or significant geographic region). these studies neither explore, for example, whether the project context can influence the uml adoption nor do they discuss practitioners’ views on the perceived usefulness of uml itself. this article investigates the state of the practice regarding the use of uml in the brazilian industry by surveying and interviewing software practitioners in that country. specifically, this work seeks to investigate (1) how practitioners use uml and (2) the relevance of its use in real-world software projects. therefore, this study surveyed 376 professionals from 210 brazilian information technology companies. we selected participants based on two criteria: (1) level of knowledge and practical experience related to software modeling; and (2) programming experience in regular projects. participants answered an online questionnaire about their experience with uml, the difficulties of adopting it, factors that affect its practical use, frequency of use, and the benefits it brings (or could bring). also, in the second phase, we interviewed 20 participants following a semi-structured interview protocol to understand further the survey results. our findings are encouraging and bridge the literature gap regarding the impact of the organizational culture issue in uml use, the analysis of factors that hinder uml use, and help to understand the broader landscape of the uml adoption. some evidence already reported in the literature is reinforced. this study can help companies and software practitioners understand the broader landscape of uml use, thus supporting their future decision-making around software practices and techniques in their future projects. academia and industry can benefit from our insights on how to improve their software modeling practices or develop new tools and processes. besides, this study also benefits researchers and practitioners by providing additional empirical knowledge about practical issues concerning uml modeling in a broader view. https://orcid.org/0000-0002-2225-3004 mailto:edwjr7@edu.unisinos.br https://orcid.org/0000-0003-1891-3580 mailto:kleinnerfarias@unisinos.br https://orcid.org/0000-0002-6467-9943 mailto:bcdasilv@calpoly.edu presenting the new sbc journal template júnior et al. 2022 this article is an extended version of our previous work (júnior et al., 2021) in several ways. first, the article underwent a careful review and was significantly improved as a whole. second, the research protocol was improved by adding the list of interview questions and considering the location of the companies where the participants work. third, the number of survey participants increased from 314 to 376 (i.e., 62 new participants), new findings are generated from this sample, and more thorough discussions regarding six research questions. in addition, this article presents additional discussions, identifies open challenges and implications, and describes the key underlying issues that need to be addressed in future investigations. the article is structured in seven sections. section 2 defines related work. section 3 details the adopted methodology. section 4 describes the results for each research question. section 5 brings up qualitative reflections and insights for future work. section 6 presents the main threats to the study’s validity. section 7 wraps up the article and include some ideas for future work. 2 related work the selection of related works was performed based on two steps: (1) initial search in digital repositories, such as google scholar1 and scopus2, was done to identify articles regarding the uml usage and survey in this research field; and (2) filtering of the selected articles considering the alignment of such works with the objective of our article (section 3.1). we selected studies from 2014 until now as our study is based on the findings reported in (petre, 2014). after that, they were analyzed (section 2.1) and compared to nine studies to identify research opportunities (section 2.2). 2.1 analysis of related works petre (2014). this work performed an empirical study about the use of uml in practice, which involved interviews conducted over two years with more than fifty software developers. the participants were mainly from north america and europe, but some were from brazil, india, and japan, and many had worked in more than one country. petre found that participants did not use uml universally but used it consistently in specific contexts such as embedded systems (e.g., automotive, aerospace, etc.). in addition, petre reported that the uml models are not used homogeneously, on the contrary, the interviewees reported heterogeneity in relation to the way to use the models in practice. typically, interviewees assumed different roles throughout the development cycle, using uml models differently in each role. petre also reported that the way practitioners used uml diagrams depended on the problem domain faced. ozkaya and erata (2020). this research involved 109 professionals from 34 countries, representing the different profiles, positions, types of software projects, and years of experience to understand how professionals use uml to 1https://scholar.google.com/ 2https://www.scopus.com/ model software architecture from different viewpoints: functional, information, concurrency, development, deployment, and operational. they found that the information and functional viewpoints are the most popular ones. moreover, the obtained results showed that most participants (88%) used uml when they needed to model system architecture from different viewpoints. fernández-sáez et al. (2015). this study presents a survey on the use of uml in software maintenance. they surveyed 178 practitioners working on software maintenance projects in 12 different countries. their results indicate that companies can improve system maintenance by leveraging the use of uml diagrams while executing maintenance tasks; however, it would require a significant effort to update uml diagrams as source code evolves. farias et al. (2018). we reported research findings on a shorter survey to identify the uml use in practice in the brazilian industry. two hundred and twenty-two practitioners from 140 different information technology companies answered a questionnaire concerning their experiences with uml, the difficulty in adopting it, and what should be done to increase adoption in practice. the results show that: (1) only 60 participants (28.2%) had used uml in their daily work; (2) 55.41% of the surveyed participants did not disagree with the statement that uml is the “lingua franca” in software modeling; (3) 61.26% reported to find that the automatic creation of uml diagrams to represent a big picture of the system under development would be useful to boost uml use. ciccozzi et al. (2019). this work carried out a systematic review that involved 63 research studies and 19 tools from more than 5400 initial entries. the objective was to identify, classify, and evaluate the existing solutions for uml model execution (i.e., automatically interpret or translate models into running software). the main results of this study are: (1) there is a growing scientific interest in the execution of uml models; (2) model-level debugging is supported in very few cases; (3) only a few surveys provide evidence of industrial use, with very limited empirical assessments; and the most common limitation is the coverage of the uml language. störrle (2017). this article conducted an online survey involving 82 professionals to determine whether and to what extent they use conceptual models and for what purposes. specifically, the author sought to grasp (1) if practitioners use uml and bpmn (business process modeling notation) for software modeling; (2) for what purposes are these modeling languages used; (3) what are the different ways of using these models in practice; and (4) how often practitioners use these modeling languages. storrle found that models are perceived to be widely used by study participants, and uml is the leading language. storrle reported three distinct usage modes of models, of which the most frequent is informal usage for communication and cognition. fernández-sáez et al. (2018). this study performed a case study in a multinational company’s ict department and involved 31 interviews with employees who work on software maintenance projects. the study mainly focused on the use of uml in software maintenance. they found that using software modeling notations such as uml is considered beneficial for software maintenance but needs to be tailored to presenting the new sbc journal template júnior et al. 2022 its context. the authors also provided a list of recommended practices that contribute to the increased effectiveness of software modeling. ho-quang et al. (2017). the authors conducted a largescale survey with 485 responses from contributors from 458 different open source projects. in that context, they found that collaboration was the most important motivation for using uml in open source projects as teams use uml during communication and planning of joint implementation efforts. uml models seem to benefit new contributors’ onboarding but do not seem to be a significant factor in attracting new contributors. neto et al. (2021). this study presents an overview of the adoption of uml in it companies in são carlos (brazil) and the region through a survey of 21 questions answered by 24 participants. also, it aims to compare how language is taught in universities. the results show a significant use of uml, including in companies that adopt agile methods and the authors suggest that the content on uml is preserved in the curriculum of educational institutions, in an updated and optimized way, meeting the trends presented by it companies. the study also points out that the opportunities in the area of modeling, with the mastery of agile methodologies and the trend of continuous acceleration of processes, are vast. one of them would be, at first, the adequacy of uml modeling for agile methodologies, without the most valued asset in these methodologies: time. 2.2 comparative analysis and opportunities six comparison criteria (cc) were defined to assist in identifying similarities and differences between the proposed work and the selected articles. this comparison is crucial to identify research opportunities using objective rather than subjective criteria. we describe the six comparison criteria below: • context (cc01): studies that involved professionals in the brazilian industry. • participant profile (cc02): studies that collected participant data for screening and profile characterization. • specific geographic region (cc03): works that explored the uml use in a specific regional scope. • applicability of uml (cc04): studies that evaluated which factors prevent the adoption of uml in the software industry. • interviews with participants (cc05): studies that triangulated quantitative and qualitative data. • different domains (cc06): studies that involved software developers working in different problem domains or business segments. table 1 compares the selected papers and summarizes whether they meet the criteria completely, partially, or do not meet, thus contrasting them with our work. moreover, it highlights the similarities and differences between them. we observe that only our work fulfill’s all criteria. in this sense, two research opportunities were identified: (1) few studies broadly inspect the adoption of uml models from the perspective of the brazilian industry; and (2) no study produced empirical evidence from a survey and conductingted interviews at the same time. the next section outlines a methodology to explore these identified research opportunities. table 1. comparative analysis of the selected related works related work comparison criterioncc1 cc2 cc3 cc4 cc5 cc6 proposed work petre (2014) g# # # # ozkaya and erata (2020) g# # # # fernández-sáez et al. (2015) # # # g# farias et al. (2018) # g# # ciccozzi et al. (2019) # # # # g# störrle (2017) # # # g# fernández-sáez et al. (2018) # # # g# # ho-quang et al. (2017) # # # neto et al. (2021) # # completely meets g# partially meets # does not attend 3 methodology this section presents the research methodology followed for conducting our survey. this protocol was formulated based on well-known guidelines (wohlin et al., 2012; kitchenham and pfleeger, 2008) to design and run empirical studies, as well as based on our experience in carrying out previous surveys (farias et al., 2018; júnior et al., 2021). this section is organized as follows. section 3.1 introduces the main objective and research questions. section 3.2 describes the adopted experimental process. section 3.3 describes the questionnaire and interview formulated and applied in the study. 3.1 objective and research questions the study objectives are twofold: (1) to understand the diffusion and relevance of the use of uml in the brazilian industry; and (2) to analyze at what level developers understand the benefits of uml in real-world projects. we formulated six research questions (rq) to analyze different facets of these objectives. table 2 describes the formulated rqs. 3.2 experimental process figure 1 introduces the adopted experimental process, composed of three phases discussed as follows: phase 1: selection of participants. participants were selected based on the following criteria: level of knowledge, practical experience related to software modeling, and programming in industrial software development projects. using these criteria, we sought to select participants with academic backgrounds and practical experience in the brazilian industry. this set of all possible participants represents the target population (kitchenham and pfleeger, 2008; wohlin et al., 2012). more specifically, the target population comprises practitioners working in brazil — including developers (different seniority levels), software architects, and project managers — with academic backgrounds obtained from brazilian universities. this population represents those who are in a position to answer the questions asked and to whom the research results apply (kitchenham and pfleeger, presenting the new sbc journal template júnior et al. 2022 table 2. research questions investigated in this article research questions motivation variable rq1: what factors influence the effective use of the uml? reveal the influencing factors in a broader usage of uml models in practice. usage-influencing factors rq2: what makes uml modeling a challenging practice? understand the challenges practitioners face that hinder the adoption of uml modeling adoption-hindering rq3: what benefits do practitioners realize when it comes to using uml? reveal the most commonly realized benefits when using uml modeling. perceived benefits rq4: how often do practitioners use uml? understand how often practitioners use uml modeling. frequency of use rq5: how does the context of software projects limit the use of uml in organizations? identify context factors that limit the use of uml in organizations. project context rq6: how do practitioners view uml modeling? reveal the practitioners’ vision regarding the adoption of uml modeling. practitioner view figure 1. experimental process 2008; wohlin et al., 2012). in total, 376 participants answered the questionnaire. phase 2: application of the questionnaire and interviews. this phase focused on the application of the questionnaire and the interviews execution. we conducted interviews to collect additional qualitative data related to research questions. such data is essential to triangulate the obtained results (section 4) from our questionnaire and interviews. the questionnaire (discussed in section 3.3) was sent by e-mail to the target population, totaling more than 406 people invited. in total, the study had 376 participants. we carefully selected the target population to avoid collecting data from people with inadequate profiles. we invited undergraduates, graduate students (master’s and doctorate), industry professionals with a recognized academic background, and professionals identified in the social network of professionals, such as linkedin. the 376 participants worked in 210 companies in different brazilian regions (midwest, south, southeast, and northeast). after completing the stages of answering and sending the questionnaire, we randomly invited 27 participants (out of 376) for a semi-structured interview (wohlin et al., 2012; farias et al., 2015). 20 participants, namely (p120) hereafter, accepted the invitation. the script was direct, starting from basic questions about the professional experience, the vision of software modeling, the use of tools, and other aspects of uml. the interviews were performed and recorded using the microsoft teams software. in a further step, we triangulated the qualitative and quantitative data from the interview and the questionnaire to explore complementary aspects of the data. phase 3: data analysis. this phase sought to analyze the data collected through the questionnaire and interviews carefully. for this, we first analyzed the collected data (interviews and survey) separately and then compared them (triangulation). initially, we analyzed the data collected through the survey and tabulated it. then, we used those initial survey results as the basis to formulate the interview questions. therefore, the interviewees answered questions that sought to explore the results obtained through the survey more deeply, seeking consistency in the data analysis. the investigation provided interaction through a dialectical process, interaction, and reflection between the researcher and the participants. we manually performed interview data analysis and went from a broad view to a more focal one without divergences. that helped us obtain complementary evidence to explain the quantitative results and then derive concrete conclusions from a chain of evidence formed from the systematic alignment of quantitative and qualitative data. 3.3 questionnaire and interviews data were collected from interviews and an online questionnaire3 (created in google forms). the study repository 4 has more information. participants reflected on their experience on uml software modeling in practice through our semi-structured interviews. table 3 presents the list of questions used in the interview. these interviews helped us to enrich the body of qualitative data. the authors ask a list of predefined questions for all respondents. new questions were formulated based on the answers given by the participants. we chose the online survey instrument because it enabled quick application, and fast distribution, thus reaching a larger number of individuals in geographically diverse locations at no additional cost. the survey questions examined research gaps in previous studies and apprehended the structures of the previously developed questionnaire. in addition, we based the design of the questionnaire and interview questions on the findings reported by petre (2014). 3questionnaire: https://forms.gle/tfrwsgj7ufucpafn7 4study repository: https://github.com/edwjr/surveyquestionnaire presenting the new sbc journal template júnior et al. 2022 table 3. list of questions used in the interview id question q1 which company do you currently work for? q2 what is your view on software modeling? q3 how is uml used where you work? q4 what is the main difficulty in using uml? q5 why do developers tend not to use uml in organizations? q6 when is the use of uml worth it? q7 do you use any specific software modeling tools to visualize and edit diagrams? q8 how often do you not consult the software documentation and work directly with source code? q9 how much effort do you put into reading uml diagrams? q10 what improvements should be made to enhance the use of uml? 4 results this section presents the obtained results concerning the formulated research questions (described in section 3.1). we used histograms to provide an overview of the collected data from the responses of 376 survey participants and 20 interviews. 4.1 analysis of the participants’ profile table 4 summarizes the participant’s profile, reporting different facets including education, undergraduate degree, job role, overall experience, professional experience with software modeling, experience with software development, and location. the 376 participants who responded to the survey came from 210 companies in brazil (at the time of data collection). as some questions were not required, the sum (n) is not necessarily equivalent to the total number of participants (376). education. the majority (68.1%) either had already graduated from college (36.9%) or were pursuing a degree as a student, while 10.6% had already completed either a postgraduate specialization (7.9%) or a master’s degree (3.7%) in the field of computing. 20.6% of the participants were “certified technicians” in the field of computing5. only one participant did not earn an undergraduate degree in computing but rather mathematics, subsequently pursuing a master’s degree in applied computing. regardless of their level of education, all participants were professionals with experience in the industry. undergraduatedegree. most participants (91.8%) had an undergraduate degree in computing. in brazil, universities offer computing degrees under different names, including systems analysis (51.9%), computer science (28.7%), and information systems (11.2%). this shows our participant pool has a strong academic background which complements the participants’ practical experience. considering their job roles, 50.7% were software developers, 23.6% were systems analysts and 2.4% were software architects. software architects 5in brazil, some schools have programs to offer high school degrees with an additional professional/technical certificate. table 4. the profile data of the participants. characteristic (n=376) answer # % education technical certificate 77 20.6% undergraduate student 117 31.2% graduate 138 36.9% specialization 22 7.9% master 14 3.7% undergraduate degree system analysis 195 51.9% computer science 108 28.7% information systems 42 11.2% others 31 8.2% position developer 187 50.7% systems analyst 87 23.6% software architect 9 2.4% manager 7 1.9% others 79 19.6% overall < 2 years 138 37.5% experience 2-4 years 129 35.1% 5-6 years 56 15.2% 7-8 years 10 2.7% > 8 years 18 4.9% professional experience < 2 years 227 61.2% with software modeling 2-4 years 91 24.5% 5-6 years 25 6.7% 7-8 years 10 2.7% > 8 years 18 4.9% professional experience < 2 years 126 34.1% with software development 2-4 years 120 32.5% 5-6 years 54 14.6% 7-8 years 28 7.6% > 8 years 41 11.1% geographical distribution of northeast 3 1% companies midwest 31 15% south 102 42% southeast 13 6% more than one location 61 29% and managers accounted for 1.9% of the sample. thus, 80% of the participants were in job positions directly related to software development practices. overall experience. the experience level is diverse in our participant pool, showing higher concentration in the 2 to 6 years range (62.5%), 7.6% had seven years or more of overall professional experience. modeling experience. regarding the characteristics of modeling experience, participants were experienced, but not highly, with software modeling. the expected result would be the lack of experience since previous empirical studies point to low adoption of uml models in the industry. about 38% of the participants had more than two years of professional experience in software modeling, while the others said they had less than two years of experience. development experience. regarding software development, overall, participants reported more years of experience compared to software modeling experience (when software modeling is considered a separate activity). as expected, practitioners are generally more exposed to experience programming tasks than modeling tasks. that is why we see more years of experience in “software development” than “software modeling” when these are considered “separate activities”. geographical distribution of companies. regarding work location, our participants came from 210 different companies located in all regions of the country except the northern region. the largest concentration was in the southern region with 102 companies, representing 42% of the sample. the midwest and southeast regions were 15% (31) and 6% (13), respectively, and the northeast region represented 1% (3). companies located in more than one region represent 29% (61). given the participant demographics, we consider the presenting the new sbc journal template júnior et al. 2022 participants’profile adequate to answer the research questions of our study for two main reasons. first, the participants came from a diverse set of companies (210), avoiding responses biased by experiences obtained in a limited set of companies. also, the large number of companies the participants came from increases the chances of participants with experiences in diverse business contexts and organizational cultures, thus improving the quality of the signal we can get in the study. second, all the participants had some formal education in computing, thus increasing the chances that they had some level of training in software modeling. this reduces the risk of biasing their answers because they had not known uml or had not heard about software modeling before the survey. moreover, the 20 interviewed participants reported modeling experience greater than five years, and they worked in software development in areas such as education (4 participants), agribusiness (3), e-commerce (2), government (3), trading (3), product exports (2), and finance (3). that diversity of areas, experience, and knowledge enriched the discussion. for ethical and privacy reasons, we chose not to present the names of the companies where participants worked. the following sections discuss the results obtained organized by research question. 4.2 rq1: what factors influence the effective use of the uml? figure 2 presents the collected data concerning the uml usage-influencing factors (rq1). we explored three factors to answer rq1: (a) time pressure that leads developers not to do software modeling, focusing only on working on the code; (b) the cost of promoting a common model understanding among the involved people with different levels of education/experience; and (c) the difficulty in assessing the quality of the created models. time. figure 2(a) indicates that 52% of the survey participants and 18 of the 20 interviewees reinforced that the short development time and high demands are the main factors that influence the use of uml since the software systems developed are getting larger and more complex every day due to the increasing demand of customers. “currently the projects are large and with a very short delivery time, you can barely deliver 100% software, imagine a documentation that would have to be updated at every step” (p17). this also leads to complex software projects that cannot be easily managed by project stakeholders and cause software systems to be delivered late (or with budget overrun) or incorrectly developed (ozkaya and erata, 2020). consequently, they end up opting for other complementary methods, such as screen prototyping, or not even creating uml models. cost of promoting understanding. figure 2(b) shows that most of the participants either fully agree (34%) or partially agree (34%) that the cost of promoting a common understanding among team members is a significant influencing factor on uml use. conversely, when we approached the interviewees with this question, most of them (12 out of 20) considered that the cost of promoting accurate modeling understanding between different people with different levels of education/experience and viewpoints is low, diverging from the survey data. this divergence possibly emerged since most interviewees worked in teams where all members had the same level of experience/training, thus leading to a smoother alignment regarding model understanding. the academic skill set affects where/how stakeholders have learned software modeling, influencing their modeling approaches and their relevant practices through the modeling experience akdur et al. (2017). difficulty evaluating. figure 2(c) shows the difficulty in evaluating the quality of uml models is another significant usage-influencing factor (21% fully agree, 40% partially agree). also, data from the interviews supported the difficulty in evaluating the models created and identified that this is one of the factors that affect the effective use of uml in the industry. moreover, the results on the usage-influencing factors support previous findings (chaudron et al., 2012; fernándezsáez et al., 2015; bucchiarone et al., 2021; störrle, 2017). bucchiarone et al. (2021) advocate that stakeholders model informally to support communicative and cognitive processes using emergent and flexible graphical notations in the early stages of the software development process. störrle (2017) also indicates that informal modeling (e.g., sketching on a whiteboard) is considered more effective in promoting communication, collaboration, and understanding. however, it is worth noting that such diagrams can be scrapped or become inaccurate since they are not maintained together with the updated source code. jackson (2019) points out that informal representations can be a good start for modeling, but it is limited, gives inconsistent interpretations, and cannot be analyzed mechanically. additionally, previous experimental studies such as (ho-quang et al., 2017; petre, 2014; scanniello et al., 2014) revealed that some issues challenge uml’s effectiveness. for instance, the uml complex notation as a whole, preference for other modeling approaches (e.g., informal sketches), and certain problem domains or industries might be more suitable than others for uml modeling. however, professionals have developed ad hoc practices that employ uml models in reasoning and communication about design, both individually and in collaborative dialogue. on the other hand, in some scenarios and industries, models can be transformed into programs using the proper tools. in such cases, models have a longer service life and must be kept up to date. it is also often observed that different teams and sub-organizations within the same company can use different modeling approaches for different purposes at different stages of the software development lifecycle (heldal et al., 2016). therefore, either informal modeling or “traditional uml modeling” with automated code generation can become alternatives when time is a first-class constraint. presenting the new sbc journal template júnior et al. 2022 figure 2. usage-influencing factors (rq1) summary of rq1: the results show that most participants indicate three points that affect the use of uml diagrams: (1) limited available time to create and maintain diagrams; (2) the cost of promoting proper understanding among different people with different levels of education/experience and viewpoints is high; and (3) difficulty in evaluating the quality of the diagrams. we understand that companies may need different modeling practices for different projects or roles within projects. practitioners should consider those three points when considering uml modeling as part of their development processes. 4.3 rq2: what makes uml modeling a challenging practice? figure 3 shows the collected data regarding rq2. from the survey responses, we highlight three adoption-hindering challenges: (a) the company’s culture, which affects the way uml is used, (b) the necessary effort to keep different uml diagrams in sync, and (c) the high effort to create and maintain the models. company culture. figure 3(a) indicates that 56% answered that they totally agree, 30% partially agree, and 10% were neutral. from the interviews, participants pointed out that, in some organizations, there is a culture of risking and failing as a path to learn quickly and meet customer needs, even if it requires much rework, thus, sometimes neglecting planning and upfront design. in addition, one of our interviewees mentioned: “i believe that the greatest difficulty is to change paradigms, especially when working with more mature teams that have grown without this modeling” (p4). although the current state of practice has reached some degree of automation in systems engineering, its tasks still require many human resources. thus, introducing process change in an organization already in operation is not easy (böhm et al., 2014). it is important to note that organizations may need different modeling approaches for different projects or even for different engineering roles within projects (akdur et al., 2021). as also described in (heldal et al., 2016), different units within the same company tend to use different modeling approaches. in addition, in the same project, different engineers may use different modeling practices, depending on their tasks and responsibilities (akdur et al., 2021). synchronization of diagrams. figure 3(b) shows that 37% of the participants partially agree and 30% fully agree that keeping diagrams in sync is a significant challenge that hinders uml use, corroborating the majority of the interviewees (19). although collaborative tools for software modeling exist, our result reinforces the findings reported in other studies conducted with industry participants (chaudron et al., 2012; cicchetti et al., 2016; kuhn et al., 2012; liebel et al., 2018), which appointed problems related to insufficient support for collaboration. there is a gap between uml tools and advanced solutions specialized in supporting collaboration. in addition, the next generation of modeling tools should support round-trip engineering to synchronize related uml diagrams and source code. since modeling a software system’s structural and behavioral aspects within a single model is not a trivial task, uml has proposed a set of diagrams to support a multiview modeling approach. thus, different aspects of the system under development are represented by different diagrams. high effort. figure 3(c) revealed that 41% totally agree, 38% partially agree, 13% are neutral, 7% partially disagree, and 1% totally disagree. therefore, the vast majority consider the effort invested in the creation and maintenance of uml models unanimously pointed out by the interviewees. “the biggest problem is the cost of keeping the diagrams as the system changes. in addition, it is still difficult to maintain a strong culture of maintenance and updating of models” (p17). another interviewee complements:“from a maintenance point of view, i think that some improvements would be necessary for the diagrams to provide a better figure of the big picture, allowing to identify more quickly relevant issues such as impact and points that can be taken into attention” (p4). in ozkaya and erata (2020), the authors mentioned that modeling software architectures based on uml from the concurrency point of view has relatively less interest on the part of professionals. one important reason here could be uml’s lack of support for modeling concurrency and race conditions. in addition, based on the findings of this study, most professionals are not used to planning development issues (e.g., source code organization and software construction and release processes) during the modeling and design, and this is usually omitted until the implementation. interviewee 11 reports: “uml is used at the beginning of the project, more specifically the projection phase, but with the progress being left aside, it ends up being outdated, since most developers focus only on the code and management does not make large charges on its use” (p11). in this context, fernández-sáez et al. (2015) pointed out that the modeling tool used to maintain/modify uml diagrams is an important factor when deciding whether to use a software development process. there are different types of tools with different benefits: licensed tools (which implies an investment but also return with possible training, customizations, etc.) vs. open tools or specific tools for modeling in uml (which check the syntax correction) or general modeling tools (are more “accessible”). uml was identified as the dominant notation in forward presenting the new sbc journal template júnior et al. 2022 and lethbridge (2008). the authors found that uml modeling tools are primarily used for initial design, while uml is not widely used for code generation. the study participants seemed open to incorporating modeling into their processes. however, the difficulty of keeping models up to date with code changes is a significant depreciation factor (68% agreement on this from forward and lethbridge (2008)). the analysis performed on forward and lethbridge (2008) is particularly interesting, finding that programmers are more likely to agree that modeling tools are “heavy-weight.” given this scenario, fernández-sáez et al. (2018) points out that it would be desirable to have a tool that would create and maintain documentation containing a mix of text and diagrams, in addition to having features that improve traceability between model and text to avoid leaving the documentation and the model out of sync. it would also be useful to have a tool that supports diagram versioning that matches the system version, searching model elements and presenting different views for the diagrams (for different consumers of information diagrams). in addition, another point we noted is that most participants are not used to putting effort into upfront planning and design (such as modeling) when they attempt to tackle coding issues. figure 3. adoption-hindering factors (rq2) summary of rq2: the results show that (a) organizational culture represents a significant challenge to enabling the adoption of uml models since the adopted engineering practices and the culture of agility sometimes do not give room to modeling. therefore, we observe that modeling in agile processes consists of a unique pattern of uml use. (b) synchronization between uml artifacts makes it difficult to use in highly collaborative software teams, and (c) the overall high effort to develop and maintain models is scarce in current organizational cultures. 4.4 rq3: what benefits are realized when using uml? figure 4 shows a summary of collected data related to rq3. we asked three questions related to (a) whether using uml selectively (only a few diagrams) helps to minimize complexity, avoid problems of completeness and inconsistency between diagrams, (b) whether uml models are helpful during application integration discussions, and (c) whether uml helps to form a common system understanding among developers. figure 4(a) indicates that 39% fully agree and 39% partially agree, and 15% are neutral. figure 4 (b) shows that 49% fully agree; 41% partially agree, and; 7% are neutral. figure 4 (c) reveals that 41% fully agree, 41% partially agree, and 11% are neutral. all twenty interviewees unanimously agreed that using uml benefits software development, as it helps in the general understanding of the system context, thus facilitating communication in the team. “the use of this language enables the understanding and discussion of the architecture of a project by the entire team and allows the representing more complex and difficult flows” (p17). “uml is a powerful language for understanding software at various levels of abstraction. when used properly it contributes to creating a better product. when used improperly (in a forced way) ends up consuming resources and not helping much. in short, diagrams should be used as a means to understand various aspects of the software to be developed and not as the end. the goal of development is software and not diagrams” (p9). these factors are identified in ho-quang et al. (2017) where most participants (79%) found uml useful for understanding systems, improving communication between developers, guiding implementation, and managing project quality. interviewees also mentioned uml could help with defect detection and design/implement integration of heterogeneous applications. however, inconsistent model interpretations can have serious consequences, especially when multiple and conflicting stakeholders are involved. for example, different interpretations between the development team, customers, and regulatory bodies can lead to rework, delays, and financial and legal repercussions. this risk may be exacerbated because compliance verification is usually performed later in the software development process. consequently, any problem discovered in the compliance check (when applicable) is expensive to repair (usman et al., 2020). participants of petre (2014) reported using uml more enthusiastically, working in a more scope-focused manner, and keeping the artifacts manageable in size and suitable to avoid synchronization and consistency issues. the interest revolves around problem-solving or decision-making to avoid undue costs. one area that deserves further research is how the use of uml is shaped by the context of the domain an investigation that requires much more access to a variety of software industries. this context demonstrates that it is necessary to understand what actually facilitates effective software development. all this evidence highlights the need to consider the relationship of tools, including notation, both with the community of practice and with the application domain. participresenting the new sbc journal template júnior et al. 2022 pants reinforced the fact that software developers are open to understanding the concepts and that, at the same time, they want to use tools that make the process effective. otherwise, they tend to discard them if they are at odds with their practices. figure 4. perceived benefits (rq3) summaryofrq3: selectively using only a few uml diagrams helps minimize complexity and avoid problems of completeness and inconsistencies between diagrams. in participants’ view, using uml is beneficial and can help avoid issues in the project, enabling better system understanding and assisting in integration discussions. 4.5 rq4: how often is uml used? figure 5 presents the participants’responses on the use of uml in their work. as the question was not mandatory, 365 of the 376 participants answered it. 74% answered that they do not use uml frequently, while 26% answered that they use uml quite often. this result reinforces findings in ozkaya and erata (2020), in which the authors report that 35 of the 50 subjects in the study do not use uml in practice. similarly, gorschek et al. (2014) found that practitioners do not frequently use uml. when they do it, they do it informally, with minimal or no tool support, and the notation is not necessarily enforced to be uml. figure 5. frequency of use (rq4) the twenty interviewees stated that they did not use uml frequently. however, they acknowledged the various benefits of using it in software development. “i understand that uml has a very strong semantic power, which favors its use in the elaboration of architecture, as well as in the construction of the system” (p4). störrle (2017) pointed out the importance of understanding the ever-changing demands of the software industry, which indicates more organizational and software development cultural differences as potential factors influencing uml use. similarly, the results of ozkaya and erata (2020) show that the majority of professionals (88%) use uml in modeling their software systems from different architectural points of view. among the architectural views (i.e., functional, information, concurrency, development, deployment, and operational), the most popular ones are functional and information views (96–99%). the operational point of view is the least popular, ignored by 61% of participants in their software modeling with uml. studies (kobryn, 2002; dori, 2002; thomas, 2004) argue that uml is not fulfilling the role of being a “lingua franca” or standard because of issues such as size, complexity, semantics, consistency, and model transformation. summary of rq4: the collected results show that uml modeling has low adherence in companies, although participants recognize the benefits of using uml models in software projects. these results are consistent with previous studies. 4.6 rq5: how does the context of software projects in companies limit the use of uml? figure 6 presents the collected data associated with rq5. three project context issues have been summarized that may affect uml use: (a) uml formalism (or lack thereof) – would more formalism in uml lead developers to use it more frequently? (b) the use of uml for practitioners arises from the fact of adapting its use for a specific purpose, and (c) companies tend to develop relatively small software that undergoes continuous modification. participants indicated that the high demand for software development may end up limiting the use of uml in practice. thus, developers start to keep design decisions ”in mind” (or through informal communication channels) and communicate effectively without any formal diagram. more formalism. regarding uml formalism, figure 6 (a) shows that 28% are neutral, 27% partially agree, and 21% totally agree that more formalism would help uml use. of the 20 participants we interviewed, 15 consider that the high degree of formalism becomes a negative factor for the applicability of uml since the processes are highly dynamic and agile, requiring a less formal and more interactive use. the project context our interviewees were involved in is usually very dynamic and agile, thus leading to constant changes in design, documentation, and uml models when they exist. more formalism in the language may lead to higher effort in producing and maintaining up-to-date models in such dynamic and agile scenarios. therefore, even though some participants seem to understand the benefits of having more presenting the new sbc journal template júnior et al. 2022 figure 6. context of use (rq5) formalism in modeling languages (e.g., more code generation and model transformations), in most of today’s projects, there are not enough resources to take up the high cost of creating and maintaining semantically-rich models (with a higher degree of formalism). adaptation of use. figure 6(b) summarizes to what extent participants agree that the uml use correlates to whether they can adapt it to their specific needs. the majority of the interviewees (12) pointed out that uml can be adaptable to a specific purpose (e.g., project domain, a specific section of the architecture, or a specific stakeholder’s view), but this adaptation is complex due to factors such as 1) it costs a lot to assure that documents/models are in sync with the code; 2) the difficulty in measuring the return on investment of adopting modeling practices; 3) uml use in legacy software; 4) the fear of adopting changes in the process, especially when working with more mature teams that have grown without modeling practices. therefore, that all leads us to believe that much research is still needed. continuous modification. figure 6(c) summarizes data on whether participants agree that the continuous modification nature of relatively small to medium projects makes it difficult to use uml. that data also matches with interviewees’ perceptions. even when practitioners work on larger projects, they usually break them into smaller iterations (and sub-projects) where developers can get along without much modeling activity. although the study participants of petre (2014) believe, for the most part, that uml is a “lingua franca” in companies and that they have theoretical knowledge about this type of modeling, participants end up not using it frequently. the results of fernández-sáez et al. (2015) revealed that software developers using uml diagrams end up experiencing difficulties with reading them. therefore, most surveyed companies use the “most understandable” uml diagrams. maintainers do not always use the available documentation and work directly with the source code; even when documentation with models is available, it is not typically used. summary of rq5: the project context matters. depending on the project and process, more or less formalism might help uml use. also, the ability to continuously update diagrams together with continuously changing code in specific projects is another influencing factor. finally, whether it is possible to adapt modeling practices to specific project needs affects uml use. 4.7 rq6: how do practitioners view uml modeling? figure 7 summarizes data regarding rq6. we explored three possible issues related to practitioners’ views on adopting uml modeling. not interested in modeling. figure 7(a) shows that 41% totally agree, 33% partially agree, and 13% are neutral on whether they are interested in modeling tasks. additionally, out of the 20 participants interviewed, 13 stressed that developers like and understand the importance of modeling; however, factors (discussed in rq1 and rq2) limit its adoption. in petre (2013), uml is considered “unnecessarily complex” by several participants in that study who reported variations in understanding and interpretation among developers, resulting in problems such as challenges in formal language semantics. others noted that the complexities of the notation limited its usefulness – or required targeted use – in discussions with stakeholders (including highly technical stakeholders). lack of modeling pattern. figure 7(b) indicates that 15% fully agree, 37% partially agree with the lack of modeling patterns and modeling guidance; in other words, the open-ended nature of uml makes it less attractive. according to the interviewees, this lack of modeling guidance on creating models correctly and effectively prevents developers from using uml modeling. “not all project participants will understand modeling, there is no pattern. there are no people qualified to generate uml” (p5). in hutchinson et al. (2011b,a), they found that there are various modeling languages people use in projects following model-driven engineering (mde). companies using mde tend to develop domain-specific languages (dsls), which have a very product/implementationfocused notion. general model. figure 7(c) shows that 19% fully agree and 39% partially agree that the lack of a general diagram that provides a system big picture with structural and behavioral elements makes the uml adoption less attractive. most of the interviewed participants (16) reinforced the difficulty of modeling structural and behavioral aspects of complex software in a single “big picture view.” fernández-sáez et al. (2018) sought to provide a comprehensive and systematic view of the main challenges in software modeling and to understand the different categories of them together with discussions of the concrete challenges in each category that professionals may face. in their study, they raised eight different types of challenges, including managing the complexity of the language, extensive modeling languages, domain-specific modeling environments, developing formal modeling languages, analyzing models, sepapresenting the new sbc journal template júnior et al. 2022 ration of concerns, transforming models, and management models. figure 7. practitioner view (rq6) summary of rq6: most developers do not demonstrate an interest in modeling, which can be justified by crucial factors such as the absence of standard modeling guidance and difficulty bringing upfront design aspects to the software development lifecycle. new modeling approaches are required to facilitate modeling and bring developers closer to it, making the process simpler, more dynamic, and motivating. 5 additional discussion in section 5.1 we provide reflections and future directions based on the obtained results. section 5.2 discusses issues related to the adoption of continuous modeling. section 5.3 outlines some discussions on gamified software modeling as a way to enhance the adoption of uml models. section 5.4 discusses the need for new approaches to assess uml diagrams in the context of modeling training education. section 5.5 draws implications from our findings. 5.1 summary of reflections time constraints and lack of knowledge. the study results point to time constraints as one of the main factors that affect the use of uml. although participants recognize the importance and benefits of creating uml diagrams, the short time spent on projects leads professionals not to use uml or to use it in a limited manner. in addition, the lack of in-depth knowledge about uml diagrams would be an impediment since the cost of promoting proper understanding among people with different levels of education/experience and viewpoints is high. also, the ability to evaluate the uml model quality is another considerable challenge. academic vision. one factor that interviewees consistently pointed out was the impression that uml tends to be more academic than industrial/practical and that new teaching approaches need to be adopted in academic programs that involve uml in their curriculum. software engineering education training with uml needs to be accompanied by real problems from the industry, which reinforces the findings from neto et al. (2021). it is evident that regardless uml is a dominant representation in practice, there is evidence that it plays an important role in software engineering teaching (petre, 2014). uml provides a common representation from which to direct the system design discussion and build a shared model of the problem. it provides a means for “model-based thinking” for students who do not yet have a repertoire of representations and reasoning tools. the typical use of uml in education introduces key concepts and directs attention and structure to student exploration and practical involvement with problems and design. one can argue that the value of uml in education lies in intellectual development rather than mirroring industry practice. company culture and agility. from the responses we got, we identified that the culture of agility in companies conflicts with the use of uml. preparing and maintaining uml diagrams are two manual activities requiring knowledge and time. therefore, the popularity of informal modeling (e.g., whiteboard sketches) has grown as an attempt to improve collaboration and communication effectiveness. also, informal and lower-cost models (in the sense of being more straightforward and faster to draw) become more flexible since learning is simplified. usually, working with the representation of abstractions (i.e., modeling) in the context of agility culture has not proved to be a popular choice. delivering quickly (without major planning) and considering failures as a natural process to arrive at the final software product have proved to be a priority. in this context, the multi-vision modeling proposed by uml does not find any application space, although it is recognized as something important. selective use of diagrams and complexity. when asked about what benefits they perceived when using uml, most participants responded that using uml diagrams selectively (i.e., the use of only a few diagrams) helps to minimize complexity, avoids problems of inconsistency between diagrams, and helps in forming a common understanding between developers. this conclusion was also verified by dzidek et al. (2008). the generality and freedom that enable uml to meet this wide range of purposes are also the sources of its weakness. uml has no formal semantics, which poses a problem when people use an uml model for different purposes. because one of uml’s main objectives is to communicate software design, different ways of using uml are potential causes of communication problems (lange et al., 2006). 5.2 adoption of continuous modeling companies seek not only to streamline their processes but mainly to find continuity throughout the software development cycle (rubert and farias, 2022; chen, 2015; elazhary et al., 2021; chen, 2017; laukkanen et al., 2017; fitzgerald and stol, 2017). fitzgerald and stol (2017) argue that achieving flow and continuity throughout the software development cycle is much more important in the first instance than velocity. since companies increasingly prioritize continuous delivery practices (chen, 2015), to benefit from uml adoption (bucchiarone et al., 2021; chaudron et al., 2012; dzidek presenting the new sbc journal template júnior et al. 2022 et al., 2008), companies must put effort into involving uml modeling practices throughout the software development cycle. however, this requires significant process changes, for example, augmenting the ci/cd pipeline, giving rise to continuous software modeling ( which poses a significant challenge). technical challenges. robust uml modeling approaches, tools, and good practices out-of-the-box and highly adaptable to the companies’ realities are lacking. the absence of such an approach has led to the isolated, as opposed to continuous, adoption of uml models throughout the continuous delivery pipeline. modeling tools that fill this gap can bring the already documented benefits of using uml models to the reality of companies, such as improving the traceability between models so as not to leave documentation and the process of modeling out of sync, not to mention the ability to highlight resource saving. when building a continuous modeling platform, different tools and technologies can be used as building blocks for the continuous delivery pipeline (chen, 2015, 2017). however, companies should not be trapped by such tool suppliers. the scientific community should propose widely accepted modeling guidelines and good practices applicable to organizational needs companies typically experience, define open apis (software modeling as a service) and build an ecosystem of tools for building a continuous software modeling pipeline. nowadays, software development iterations are short to of delivering newly requested features rapidly, establishing a continuous cycle of getting feedback. such large, monolithic models need to be characterized and rethought as featureoriented uml models. modeling practices must fit iterative processes (with very short release cycles) that are typically driven by incremental feature development. rather than designing a colossal set of uml diagrams upfront, it is recommended that software design with uml follows the same iterative approach driven by incremental feature development, which may ease the adoption of software modeling in agile teams. that poses the significant challenge of implementing continuous modeling approaches oriented by features, as well as the production of empirical evidence about the advantages and disadvantages of adopting continuous uml modeling. solving this challenge will require close collaboration between researchers and practitioners and will enable the benefits of uml modeling to be brought to the reality of more companies. process challenge. xavier et al. (2019) pointed out that people still associate uml modeling with traditional process practices (e.g. rup), while uml is not explicitly integrated with agile practices. our results indicate that agile teams tend not to adopt uml modeling. one of the participants reported: “if the preparation of uml models requires, for example, three days before it is ready for use by developers, this period will be responsible for much of the sprint time, for example” (p12). it is important to highlight that agile methodologies do not prohibit the use of uml, another participant states: “we work with scrum and with some uml diagrams, but few and only in the project phase. the system is giant to meet a bank’s demands, there are many requests for functionality changes and improvements on the part of the customer and we usually fit the demands into weekly sprints” (p11). there are research gaps in looking for alternatives, aiming at the alignment between business processes, agile development practices, and uml modeling. documentation and legacy monolithic systems. promoting large-system modeling practices without processes that support documentation is still a challenge for decades. there may also be the cultural tendency to assume that the status quo is the only possible path. the absence of design documentation complicates restructuring legacy monolithic systems into highly distributed systems such as those following the microservice architecture. legacy systems typically have dozens of tightly coupled subsystems that interact to provide different services for internal and external customers within companies. fitzgerald and stol (2017) point out that the lack of documentation only takes into account the tacit knowledge of software engineers who work in different teams. the legacy systems modeling based on creating a “big picture view” is still hard to implement due to the size, usually consisting of hundreds of thousands of lines of code. continuous updates to these models can be very challenging. the multiview modeling of uml allows updating complementary models, such as class diagrams and sequence diagrams. this can lead to inconsistencies between such models (kretschmer et al., 2021; khelladi et al., 2019; reder and egyed, 2013). 5.3 gamification of modeling software gamification can be defined as “the use of game design elements in non-game contexts” (deterding et al., 2011; huotari and hamari, 2017; liu et al., 2017). this technique uses the philosophy, elements, and mechanics of game design in nongame environments, aiming to bring all the positive aspects they provide. the current literature recognizes the benefits of applying gamification in software engineering practice. however, how to design and use gamification in the context of modeling applied to industrial needs is still an open question. as far as we know, only a few studies on the application of gamification in software engineering practices are available — most of which are related to broader contexts (porto et al., 2020; pedreira et al., 2015; ren et al., 2020). due to the related theoretical and practical difficulties, learning to use the full potential of uml can be a complex task, which makes developers feel discouraged and less engaged over time. this scenario could lead, for example, to the development of incomplete, decontextualized, and poorquality models. lange et al. (2006) reinforce that this issue brings potential risks that might cause misinterpretation and miscommunication, thus reducing software quality. therefore, finding configurations that favor developer practices, generate engagement, and consequently, increasingly effective uml models can become one of the main challenges encountered in the industry today. given this scenario, gamification emerges as a possible alternative to mitigate these problems, enhancing the adoption of uml, improving the models generated by developers, and generating high-quality software. there is no clear and usually accepted taxonomy of game elements (pedreira et al., 2015). shpakova et al. (2016) propresenting the new sbc journal template júnior et al. 2022 posed a unified view of the different classifications, which summarizes gamification in three dimensions: components, mechanics, and dynamics. components are the basic building blocks of gamification. they represent the objects that users see and interact with, such as badges, levels, and points. mechanics define the game as a rules-based system, specifying how everything behaves and how the player can interact with the game. dynamics are the top level of gamification elements. they include all aspects of the game that cannot be implemented and managed directly and are related to users’ emotional responses (e.g., progression, exploration). the success of gamifying a particular context unrelated to the game depends heavily on the gamification design choices for those three dimensions. several research efforts have focused on identifying the phases that make up the gamification project (mora et al., 2015; webb, 2013). however, similarly to the taxonomy of gamification elements, there are no commonly accepted phases. they can vary in number and terminology used. in software development, developers’ performance concerning productivity or quality may relate to the number of artifacts developers produce and how good the artifacts are. however, while performance is often a quantitative and objective metric for assessing the impact of gamification on users’ activities in the out-of-game context, in software development, performance may be related to productivity or quality (usually subjective). this article conjectures that the insertion of gamification techniques, such as feedback, progress, and challenges, in software modeling could help mitigate the issues of adopting uml modeling. for example, the incompleteness of uml models is a critical problem (lange et al., 2006; fernándezsáez et al., 2018). using gamification techniques, such as challenges, points, feedback, and progress, could motivate developers to create more complete models in exchange for points, for example. a ranking system for software teams could be created to rank them in terms of the quality of the models created. in addition, constant feedback during the model editing could foster learning and stimulate modeling. researchers can carry out empirical studies to analyze the integration between gamification and software modeling based on the factors mentioned in rq1 and rq2. that would increase the perception of benefits by practitioners (rq3) and the frequency of use (rq4). therefore, the use of gamification techniques can motivate developers, enhance the quality of the created uml models, and foster learning. 5.4 assessing and grading uml diagrams before using uml models, practitioners need to learn that there are structural and behavioral diagrams available in uml. also, students (or practitioners under training) submit their diagrams for assessment and grading in educational/training contexts. university courses worldwide teach uml modeling, to some extent, as the standard language for modeling software. additionally, uml is still a well-known language when practitioners need to model software systems. moreover, universities are increasingly adopting a learning-by-doing approach and having online classes with a high number of students. in this context, students need to practice through hands-on exercises and real-world tasks. instructors must find an efficient mechanism to fairly and equitably assess student projects and assignments. in addition, assessments must enable rapid feedback and provide learners with instructions on how to overcome their deficiencies or limitations. imagine that an instructor needs to train 120 people in geographically distributed teams. the instructor provides an exercise in which the learner needs to design 10 uml class diagrams. the instructor needs to provide feedback on the 1,200 uml class diagrams two days after delivery. the short time to evaluate a high number of diagrams makes the teaching and learning process difficult. therefore, the manual assessment of uml models proves to be a very costly and subjective activity, creating friction in the practice-assessment-learning feedback loop involving students and instructors. this reality is not exclusively found in universities; on the contrary, it is found anywhere where the teaching-learning cycle of uml models needs to happen quickly and with a relatively high number of learners. some tools and approaches (vesin et al., 2018; bian et al., 2019; stikkolorum et al., 2019) have been proposed in recent years. for example, sdmetrics6 presents a set of metrics for uml models but does not compute the differentiation between the rubric and the uml model created by the learner. the modelguru approach7 goes a little beyond sdmetrics when computing students’ grades using object-oriented measures of design size, coupling, and complexity. vesin et al. (2018) came up with a new integrated tool to support the evaluation of uml models produced by students. bian et al. (2019) introduced a grading process based on syntactic, semantic, and structural matching for computing grades by comparing students’ models with the desired model. in a different approach, stikkolorum et al. (2019) presented an exploratory study regarding machine learning for grading uml diagrams. however, a streamlined approach for grading uml diagrams based on syntactic, semantic, and structural criteria is still lacking. the use of machine learning also emerged as a trend and a new avenue to be explored. lastly, we outline the need for the scientific community to explore three objectives (farias and silva, 2020): (1) provide a tool to streamline the process of managing rubrics for grading uml diagrams; (2) allow students to get faster and more objective and itemized feedback for their submissions; and (3) ultimately, enhance the practicing-grading-learning feedback loop associated with designing uml diagrams. 5.5 practical implication when software development teams constantly change source code and revise uml models to keep them up-to-date, the effort engineers put in can make the difference between adopting or not uml models throughout the development process. from our findings, updating and synchronizing models with source code appears to be one of the major impediments to the broader use of uml modeling. rather than being easy and intuitive, study participants point to model update and synchronization as a highly time-consuming and error-prone process. 6sdmetrics: https://www.sdmetrics.com/ 7modelguru: http://modelguru.snotra.com.br/ presenting the new sbc journal template júnior et al. 2022 still, the need to update and synchronize uml models attracts the spotlight as organizations increasingly adopt devops and agile practices in globally distributed development teams. therefore, updating and synchronizing (upsync) uml models with source code emerges as a critical requirement to leverage uml adoption in real-world settings. the ability to “upsync” uml models can be seen as the mitigation with which modern development teams (adept to devops and agile practices) can update the uml’s structural and behavioral models to accommodate new design decisions, or requirements change. we conjecture that the greater the upsync, the better the quality of the source code. this paves the way for the scientific community to propose friendly round-trip engineering approaches — existing uml models can be transformed into source code and then be converted back — combined with the integrated development environment used by development teams. in that perspective, updating and synchronizing models helps improve the software system under maintenance. previous empirical studies (dzidek et al., 2008) have shown that using uml models improves source code quality and reduces bugs. for this, not only robust round-trip engineering approaches are needed, but also improvements that span the agile development process as a whole. for example, scrum-based development processes can have automated tasks at the end of each sprint to update and synchronize uml models. practical research implication: our findings highlight that the adoption of uml modeling in practice is affected by the difficulty of updating and synchronizing models with the source code. currently, development processes adopt source code as a primary artifact, thus demanding that new cost-effective updating and synchronization approaches be proposed. although upsync models sound like a promising trend, the scientific community needs to evaluate future proposed techniques and carry out empirical studies to investigate the impact on the quality of uml models and source code, as well as on the degree of practitioners’ satisfaction in real-world settings. 6 threats to validity this section discusses the possible threats to the study’s validity. internal validity. internal validity is related to issues that may affect the causal relationship between treatment and outcome. threats to internal validity include instrumentation and selection threats. the main points affecting our study’s internal validity refer to the participants’ profiles and experiences. when analyzing the profile of the participants, as presented in section 4.1, around 30% of them have low (up to 4 years) general experience, low experience with software modeling, and low experience with software development. this is probably because the level of education of about 50% of that 30% group is low compared to others, or they are still attending undergraduate degrees. in addition, many of these participants may not have studied uml yet during their undergraduate degrees. also, there was no option in the survey question corresponding to the time of 2 to 3 years of experience. however, due to the sample size and also the complementary interviews we conducted, we believe that the data collected are not affected by this threat. another internal threat is linked to the random process of selecting participants for the interview, which may have caused a potential similarity in the profile of the interviewed participants. thus, a selection bias may interfere with the potential validity of completion. although the interviewed participants work on software development in the fields of education, agribusiness, e-commerce, government, trade, product export, and finance, we recognize that qualitative data could be further explored if we had greater participation of professionals linked to other sectors. still, we have a wider variety of sectors from the survey participants. external validity. external validity concerns the ability to generalize the results beyond the actual study. to perform the correct interpretation of the survey results, although the demographic data of our sample are diversified, we understand that the generalization of them for the entire population may not be adequate. in our study, participants belonged to a geographic variety and worked in companies of different domains and sizes. however, we cannot be sure that this sample is representative of the sector in general. we understand that these threats are always present in industrial research. reliability focuses on the replicability of results by other researchers. this study has a repository with the collected data and an online form, both of free access. 7 conclusions and future work this article presented an exploratory survey on how practitioners have used uml modeling in the brazilian industry. in total, 376 employees from 210 information technology companies answered an online questionnaire about the factors affecting use, difficulty, and frequency of use, perceived benefits, and contextual factors that prevent the adoption of uml models. in addition, we interviewed 20 randomly chosen participants from the survey pool using a semi-structured interview protocol as a follow-up investigation to triangulate with the survey data. in summary, the results show that: 74.8% of the participants answered that they do not use uml frequently. participants who responded not to use uml models attributed factors such as continuous delivery practices, time constraints, lack of knowledge about modeling, company culture, and the always present difficulty of keeping the models up to date and synchronized with each other and source code. the results of this research reinforced some evidence already found in the literature concerning the use of uml (gorschek et al., 2014; petre, 2014). in general, most people know uml but do not use it in their projects. these results can help professionals understand how to invest to avoid increased development spending and provide a foundation to motivate software developers to design uml diagrams throughout development cycles. that would facilitate, for example, maintenance tasks. future work should focus effort on investigating more aspects related to uml practice, such as the possibilities of using uml in agpresenting the new sbc journal template júnior et al. 2022 ile teams/organizations, whether teaching methodologies in academia influence the practices in the software industry, and how gamification can be applied to software modeling practices. finally, we hope that the issues outlined throughout the article will encourage other researchers to replicate our study in the future in different circumstances and that this work represents a solid step in a more ambitious agenda to improve software engineering practices. references akdur, d., demirörs, o., and garousi, v. (2017). characterizing the development and usage of diagrams in embedded software systems. in 2017 43rd euromicro conference on software engineering and advanced applications (seaa), pages 167–175. ieee. akdur, d., say, b., and demirörs, o. (2021). modeling cultures of the embedded software industry: feedback from the field. software and systems modeling, 20(2):447–467. bian, w., alam, o., and kienzle, j. (2019). automated grading of class diagrams. in 2019 acm/ieee 22nd international conference on model driven engineering languages and systems companion (models-c), pages 700–709. ieee. böhm, w., junker, m., vogelsang, a., teufl, s., pinger, r., and rahn, k. (2014). a formal systems engineering approach in practice: an experience report. in proceedings of the 1st international workshop on software engineering research and industrial practices, pages 34–41. bucchiarone, a., ciccozzi, f., lambers, l., pierantonio, a., tichy, m., tisi, m., wortmann, a., and zaytsev, v. (2021). what is the future of modeling? ieee software, 38(2):119– 127. chaudron, m. r., heijstek, w., and nugroho, a. (2012). how effective is uml modeling? software & systems modeling, 11(4):571–580. chen, l. (2015). continuous delivery: huge benefits, but challenges too. ieee software, 32(2):50–54. chen, l. (2017). continuous delivery: overcoming adoption challenges. journal of systems and software, 128:72–86. cicchetti, a., ciccozzi, f., and carlson, j. (2016). software evolution management: industrial practices. in me@ models, pages 8–13. citeseer. ciccozzi, f., malavolta, i., and selic, b. (2019). execution of uml models: a systematic review of research and practice. software & systems modeling, 18(3):2313–2360. deterding, s., dixon, d., khaled, r., and nacke, l. (2011). from game design elements to gamefulness: defining” gamification”. in 15th int. academic mindtrek conference: envisioning future media environments, pages 9–15. dori, d. (2002). why significant uml change is unlikely. communications of the acm, 45(11):82–85. dzidek, w. j., arisholm, e., and briand, l. c. (2008). a realistic empirical evaluation of the costs and benefits of uml in software maintenance. ieee transactions on software engineering, 34(3):407–432. elazhary, o., werner, c., li, z. s., lowlind, d., ernst, n. a., and storey, m.-a. (2021). uncovering the benefits and challenges of continuous integration practices. ieee transactions on software engineering. farias, k., garcia, a., whittle, j., von flach garcia chavez, c., and lucena, c. (2015). evaluating the effort of composing design models: a controlled experiment. software & systems modeling, 14(4):1349–1365. farias, k., gonçales, l., bischoff, v., da silva, b. c., guimarães, e. t., and nogle, j. (2018). on the uml use in the brazilian industry: a state of the practice survey (s). in seke, pages 372–371. farias, k. and silva, b. c. d. (2020). what’s the grade of your diagram? towards a streamlined approach for grading uml diagrams. in 23rd acm/ieee international conference on model driven engineering languages and systems: companion proceedings, pages 1–2. fernández-sáez, a. m., caivano, d., genero, m., and chaudron, m. r. (2015). on the use of uml documentation in software maintenance: results from a survey in industry. in 2015 acm/ieee 18th int. conf. on model driven engineering languages and systems (models), pages 292– 301. ieee. fernández-sáez, a. m., chaudron, m. r., and genero, m. (2018). an industrial case study on the use of uml in software maintenance and its perceived benefits and hurdles. empirical software engineering, 23(6):3281–3345. fitzgerald, b. and stol, k.-j. (2017). continuous software engineering: a roadmap and agenda. journal of systems and software, 123:176–189. forward, a. and lethbridge, t. c. (2008). problems and opportunities for model-centric versus code-centric software development: a survey of software professionals. in proceedings of the 2008 international workshop on models in software engineering, pages 27–32. gorschek, t., tempero, e., and angelis, l. (2014). on the use of software design models in software development practice: an empirical investigation. journal of systems and software, 95:176–193. heldal, r., pelliccione, p., eliasson, u., lantz, j., derehag, j., and whittle, j. (2016). descriptive vs prescriptive models in industry. in proceedings of the acm/ieee 19th international conference on model driven engineering languages and systems, pages 216–226. ho-quang, t., hebig, r., robles, g., chaudron, m. r., and fernandez, m. a. (2017). practices and perceptions of uml use in open source projects. in 39th icse: software engineering in practice track, pages 203–212. ieee. huotari, k. and hamari, j. (2017). a definition for gamification: anchoring gamification in the service marketing literature. electronic markets, 27(1):21–31. hutchinson, j., rouncefield, m., and whittle, j. (2011a). model-driven engineering practices in industry. in proceedings of the 33rd international conference on software engineering, pages 633–642. hutchinson, j., whittle, j., rouncefield, m., and kristoffersen, s. (2011b). empirical assessment of mde in industry. in proceedings of the 33rd international conference on software engineering, pages 471–480. jackson, d. (2019). alloy: a language and tool for exploring software designs. commun. acm, 62(9):66–76. presenting the new sbc journal template júnior et al. 2022 júnior, e., farias, k., and silva, b. (2021). a survey on the use of uml in the brazilian industry. in brazilian symposium on software engineering, pages 275–284. khelladi, d. e., kretschmer, r., and egyed, a. (2019). detecting and exploring side effects when repairing model inconsistencies. in 12th acm int. conf. on software language engineering, pages 113–126. kitchenham, b. a. and pfleeger, s. l. (2008). personal opinion surveys. in guide to advanced empirical software engineering, pages 63–92. springer. kobryn, c. (2002). will uml 2.0 be agile or awkward? communications of the acm, 45(1):107–110. kretschmer, r., khelladi, d. e., lopez-herrejon, r. e., and egyed, a. (2021). consistent change propagation within models. software and systems modeling, 20(2):539–555. kuhn, a., murphy, g. c., and thompson, c. a. (2012). an exploratory study of forces and frictions affecting largescale model-driven development. in int. conf. on model driven engineering languages and systems, pages 352– 367. springer. lange, c. f., chaudron, m. r., and muskens, j. (2006). in practice: uml software architecture and design description. ieee software, 23(2):40–46. laukkanen, e., itkonen, j., and lassenius, c. (2017). problems, causes and solutions when adopting continuous delivery—a systematic literature review. information and software technology, 82:55–79. liebel, g., marko, n., tichy, m., leitner, a., and hansson, j. (2018). model-based engineering in the embedded systems domain: an industrial survey on the state-of-practice. software & systems modeling, 17(1):91–113. liu, d., santhanam, r., and webster, j. (2017). toward meaningful engagement: a framework for design and research of gamified information systems. mis quarterly, 41(4). mora, a., riera, d., gonzalez, c., and arnedo-moreno, j. (2015). a literature review of gamification design frameworks. in 2015 7th international conference on games and virtual worlds for serious applications (vs-games), pages 1–8. ieee. neto, j. c., bento, l. h. t. c., oliveirajr, e., and souza, s. d. r. s. (2021). are we teaching uml according to what it companies need? a survey on the são carlos-sp region. in anais do simpósio brasileiro de educação em computação, pages 34–43. sbc. omg (2017). uml: infrastructure specification. https://www.omg.org/spec/uml/2.5.1/pdf. ozkaya, m. and erata, f. (2020). a survey on the practical use of uml for different software architecture viewpoints. information and software technology, 121:106275. pedreira, o., garcía, f., brisaboa, n., and piattini, m. (2015). gamification in software engineering–a systematic mapping. information and software technology, 57:157–168. petre, m. (2013). uml in practice. in 2013 35th international conference on software engineering (icse), pages 722–731. ieee. petre, m. (2014). no shit or oh, shit!: responses to observations on the use of uml in professional practice. software & systems modeling, 13(4):1225–1235. porto, d., jesus, g., ferrari, f., and fabbri, s. (2020). initiatives and challenges of using gamification in software engineering: a systematic mapping. arxiv preprint arxiv:2011.07115. reder, a. and egyed, a. (2013). determining the cause of a design model inconsistency. ieee transac. on software engineering, 39(11):1531–1548. ren, w., barrett, s., and das, s. (2020). toward gamification to software engineering and contribution of software engineer. in 4th int. conf. on management engineering, software engineering and service sciences, pages 1–5. rubert, m. and farias, k. (2022). on the effects of continuous delivery on code quality: a case study in industry. computer standards & interfaces, 81:103588. scanniello, g., gravino, c., genero, m., cruz-lemus, j., and tortora, g. (2014). on the impact of uml analysis models on source-code comprehensibility and modifiability. acm tosem, 23(2):1–26. shpakova, a., dörfler, v., and macbryde, j. (2016). gamification and innovation: a mutually beneficial union. in bam 2016: 30th annual conference of the british academy of management. stikkolorum, d. r., van der putten, p., sperandio, c., and chaudron, m. (2019). towards automated grading of uml class diagrams with machine learning. in bnaic/benelearn. störrle, h. (2017). how are conceptual models used in industrial software development? a descriptive survey. in 21st int. conf. on evaluation and assessment in software engineering, pages 160–169. thomas, d. (2004). mda: revenge of the modelers or uml utopia? ieee software, 21(3):15–17. usman, m., felderer, m., unterkalmsteiner, m., klotins, e., mendez, d., and alégroth, e. (2020). compliance requirements in large-scale software development: an industrial case study. in int. conf. on product-focused software process improvement, pages 385–401. springer. vesin, b., klašnja-milićević, a., mangaroska, k., ivanović, m., jolak, r., stikkolorum, d., and chaudron, m. (2018). web-based educational ecosystem for automatization of teaching process and assessment of students. in proceedings of the 8th international conference on web intelligence, mining and semantics, pages 1–9. webb, e. n. (2013). gamification: when it works, when it doesn’t. in international conference of design, user experience, and usability, pages 608–614. springer. wohlin, c., runeson, p., höst, m., ohlsson, m. c., regnell, b., and wesslén, a. (2012). experimentation in software engineering. springer science & business media. xavier, a., martins, f., pimentel, r., and carvalho, d. (2019). aplicação da uml no contexto das metodologias ágeis. in anais do vi encontro nacional de computação dos institutos federais. sbc. introduction related work analysis of related works comparative analysis and opportunities methodology objective and research questions experimental process questionnaire and interviews results analysis of the participants' profile rq1: what factors influence the effective use of the uml? rq2: what makes uml modeling a challenging practice? rq3: what benefits are realized when using uml? rq4: how often is uml used? rq5: how does the context of software projects in companies limit the use of uml? rq6: how do practitioners view uml modeling? additional discussion summary of reflections adoption of continuous modeling gamification of modeling software assessing and grading uml diagrams practical implication threats to validity conclusions and future work journal of software engineering research and development, 2023, 11:8, doi: 10.5753/jserd.2023.2671 this work is licensed under a creative commons attribution 4.0 international license. identification and management of technical debt: a systematic mapping study update maría isabel murillo [ university of costa rica | maria.murilloquintana@ucr.ac.cr] gustavo lópez [ university of costa rica | gustavo.lopezherrera@ucr.ac.cr] rodrigo spínola [ salvador university | rodrigo.spinola@unifacs.br] julio guzmán [ university of costa rica | julio.guzman@ucr.ac.cr] nicolli rios [ federal university of rio de janeiro | nicolli@cos.ufrj.br] alexia pacheco [ university of costa rica | alexia.pacheco@ucr.ac.cr] abstract technical debt is a concept used to describe the lack of good practices during software development, leading to several problems and costs. identification and management strategies can help reduce these difficulties. in a previous study, alves et al. (2016) analyzed the research landscape of such strategies from 2010 to 2014. this paper replicates and updates their study to explore the evolution of technical debt identification and management research landscape over a decade, including literature from 2010 until 2022. we analyzed 117 papers from the acm digital library, ieee xplore, science direct, and springer link. newly suggested strategies include automatically identifying admitted debt in comments, commits, and source code. between 2015 and 2022, more empirical evaluations have been performed, and the general research focus has changed to a more holistic approach. therefore, the research area evolved and reached a new level of maturity compared to previous results from alves et al. (2016). not only are code aspects considered for technical debt, but other aspects have also been investigated (e.g., models for the development process). keywords: technical debt management, technical debt identification, software development process. 1 introduction technical debt (td) is the consequence of taking shortcuts during the software development process, providing shortterm benefits but potentially bringing more difficulties and costs in later stages (izurieta et al., 2012). when developers take these shortcuts, deficiencies may be inserted. the cost of fixing previous work increases as the development continues because correcting the defects becomes more complex when technical debt is not timely paid (akbarinasaji & bener, 2016) the interest is the additional cost that may have to be assumed because of the delayed payment. on the other hand, the principal is the amount over which interests are paid. in technical debt, the principal is the original cost of fixing the software (ampatzoglou et al., 2015). when developers cannot pay the existing technical debt, bankruptcy may occur (akbarinasaji & bener, 2016) several activities can help manage debt during the software development process. management activities may include measuring, prioritizing, preventing, monitoring, documenting, communicating, and paying the debt. (li et al., 2015). the purpose of performing these actions is to avoid major problems that may lead to significant consequences, such as the failure of software projects. management strategies can help determine the appropriate time to pay the debt before interests become very costly. consequently, it is possible to make faster deliveries in a controlled manner (freire et al., 2020). also, strategies allow even to recognize if the debt needs to be paid because there may be cases when there is no need to pay it, for example, when there is the certainty that a module will not change in the future (guo et al., 2014). technical debt management is complex because there may be uncertainty during software development. also, many factors must be considered for its management, such as the present and future costs, as well as the risks that are implied (guo et al., 2014). the identification of td comprises the activities or actions taken to detect the presence of debt in software artifacts. technical debt identification is the first step that needs to be taken to start managing it and avoid its possible high costs (guo et al., 2014). for instance, td identification is essential to prevent unwanted consequences of debt. alves et al. (2016) investigated the technical debt identification and management landscape between 2010 and 2014 by analyzing 100 primary studies. they found that strategies mainly addressed types of technical debt associated with source code. nevertheless, few empirical evaluations demonstrated the proposals' actual benefits, limitations, and applicability. alves et al. (2016) also presented an initial taxonomy of technical debt types and a list of indicators for their identification. in that study, td management was understood as the activities that follow its identification. the findings of alves et al. (2016) provided valuable contributions for both researchers and practitioners, while they also characterized the state of the art in the research area. in this paper, we update the mapping study of alves et al. (2016) to find proposals done between 2015 and 2022 about managing and identifying technical debt. additionally, this paper provides a comparison with the previous results obtained by alves et al. (2016) and an analysis of the research landscape comprising more than a decade. keeping the results updated is essential because it helps to understand the https://orcid.org/0000-0002-8729-3867 identification and management of technical debt: a systematic study update murillo et al. 2023 evolution of the research topic and new findings (nepomuceno & soares, 2019). we consider the application of the same research questions, search string, search strategy, sources, inclusion, and exclusion criteria as an update of the previous systematic mapping. we intend to answer the same research questions (without adaptations) since changing them could be considered a new mapping instead of an update (nepomuceno & soares, 2019). likewise, we considered td management as the activities following debt identification to be conceptually consistent. the main difference in the protocol was the time-frame delimitation since our update only included papers published after the original study’s year of inclusion. we also provide a more detailed definition than the original study of what we considered as “general” technical debt papers for their classification. furthermore, two original authors assisted in the update process to ensure the compatibility between the concepts, procedure, and results of both systematic studies. the high-level research question we aim to answer is: • what strategies have been proposed to identify or manage technical debt in software projects? similarly, the complementary research questions are: • rq1. what are the types of technical debt found in the literature? • rq2. what are the strategies proposed to identify technical debt? o rq2.1. which empirical evaluations have been performed? o rq2.2. which artifacts and data sources have been proposed to identify technical debt? o rq2.3. which software visualization techniques have been proposed to identify technical debt? • rq3. what strategies have been proposed for the management of technical debt? o rq3.1. which empirical evaluations have been performed? o rq3.2. which software visualization techniques have been proposed to manage technical debt? we analyzed 117 primary studies and identified new proposals and indicators between 2015 and 2022. empirical evaluations of the analyzed papers include case studies, controlled experiments, and action research, but more evaluations are still required. also, we found that technical debt visualization is yet an area that researchers have not extensively studied. this is a relevant finding since visualization techniques may be useful to aid decision-making for td. this paper’s results benefit researchers since we provide knowledge about state-of-the-art and open problems that are future research opportunities. it is also helpful to practitioners since we present identification, management, and visualization strategies applicable to software projects to prevent technical debt unwanted negative consequences. future research opportunities include investigating new ways to use developers’ knowledge about debt (not only through commits and comments) and exploring new strategies with a less-technical approach (such as incentives and td guilds). moreover, analyzing the applicability of strategies in different contexts, such as public or private organizations, is a future research opportunity. the structure of this article is as follows. section 2 describes previous literature related to this work. section 3 presents the methodology used to perform the literature review. section 4 presents the results obtained. section 5 includes the discussion, section 6 the threats to validity, and section 7 covers the conclusions. 2 related work this section presents several previous works performed by other authors, particularly those that have addressed the identification and management of technical debt during software development. table 1 gives an overview of each authors’ contributions. rios et al. performed a tertiary study to identify the state of the art regarding technical debt between 2012 and 2018 (rios et al., 2018). the authors studied the understanding of the technical debt concept and the research efforts on its identification and management. they found nine secondary studies about technical debt management and two regarding its identification. until 2018 there was little knowledge about the benefits and limitations of the proposed management strategies and indicators. another systematic mapping studied the concept of technical debt and its management activities and tools (li et al., 2015). the authors analyzed 94 studies published between 1992 and 2013 and identified activities and tools for technical debt. some of the mentioned tools are checkstyle, debtflag, sonarqube, codevizard, and findbugs. likewise, the activities include code analysis, cost categorization, calculation models, code metrics, and portfolio approach. also, the authors proposed a classification of technical debt types and argued that there needs to be more literature about what should not be considered technical debt. the work of fernández-sánchez et al. consisted of a systematic mapping to identify the elements to consider for the management of technical debt, based on the literature until 2015 (fernández-sánchez et al., 2017). the authors identified the main aspects of technical debt management. they found that the business organizational perspective has not been addressed much in the literature, while research has focused more on the technical point of view. another systematic literature review focused on technical debt in the digital government area (nielsen et al., 2020). this paper aimed to discover what fields of technical debt management are being studied and the focus of the performed research. the authors analyzed 31 pieces, from which a third proposed a tool, method, technique, or model for technical debt management. the authors found several gaps, including a lack of research on the public sector and a limited abstraction level of the analyses. authors conclude that technical debt management is mainly studied either on open software projects or in the private sector. identification and management of technical debt: a systematic study update murillo et al. 2023 macit et al. performed a systematic mapping study regarding methods for identifying architectural debt based on the analysis of 28 papers published between 2011 and 2020 (macit et al., 2020). the authors mention that architectural debt identification has been increasingly investigated in recent years. also, code mining and expert opinion are common methods. alfayez et al. performed a systematic literature review on technical debt prioritization (alfayez et al., 2020). the authors aimed to identify the current prioritization approaches, the decision factors, and the artifacts on which these approaches are based. a total of 23 papers published between 1992 and 2018 were analyzed. as a result, 24 strategies were found for technical debt prioritization. these approaches mainly addressed code, general, and design technical debt. lenarduzzi et al. performed a literature review regarding strategies and tools for technical debt prioritization (lenarduzzi et al., 2021). in this study, they analyzed 44 primary studies published until 2020. code, architecture, and design were the most frequent types of technical debt addressed. the authors found a lack of consensus on the factors to consider when prioritizing and measuring technical debt. also, they show a lack of validated and reliable tools for technical debt prioritization. alves et al. present a systematic mapping regarding technical debt identification and management (alves et al., 2016). in that study, the authors analyzed 100 primary studies, discussed a taxonomy of td types, presented a list of the strategies found in the literature, and created a list of indicators that can help identify technical debt. table 1. contributions by other authors. authors research topic analyzed period contributions li et al., 2015 technical debt management 1992 – 2013 • analyzed the technical debt concept on 94 existing research efforts. • proposed a classification of ten technical debt types. • identified the quality attributes compromised by technical debt. • determined activities and tools for technical debt management. alves et al., 2016 technical debt identification and management 2010 2014 • analyzed 100 papers and determined a classification for technical debt types. • listed strategies to identify or manage technical debt. • determined the empirical evaluations, artifacts, and data sources cited in the literature for technical debt identification and management. fernández-sánchez et al., 2017 elements to manage technical debt 2010 2015 • provided a taxonomy of elements for technical debt management by analyzing 63 papers. • identified the proposed methods and techniques to manage technical debt. • analyzed technical debt management elements from the perspective of stakeholders. rios et al., 2018 technical debt 2012 – 2018 • studied 13 secondary studies and their td research topics. • proposed a taxonomy of technical debt types. • identified activities, strategies, and tools to support technical debt management. nielsen et al., 2020 technical debt management in digital government 2017 2020 • analyzed 31 papers about technical debt management research in the public sector. • determined a research agenda for the digital government area. alfayez et al., 2020 technical debt prioritization 1992 2018 • identified approaches and decision factors for technical debt prioritization by studying 23 papers. • analyzed the type of human involvement and artifacts needed for technical debt prioritization. lenarduzzi et al., 2021 technical debt prioritization 2011 2020 • determined the prioritization strategies for technical debt by analyzing 44 primary studies. • analyzed factors and measures considered for technical debt prioritization. • identified tools for technical debt prioritization. identification and management of technical debt: a systematic study update murillo et al. 2023 three previous studies analyzed and proposed a classification of technical debt types (alves et al., 2016; li et al., 2015; rios et al., 2018). however, there is still no consensus on these taxonomies. this paper does not aim to provide a consensus but to find if new td types have been mentioned recently and should be considered for new or refined taxonomies. more recent efforts in the research area were those made by nielsen et al. (2020) and lenarduzzi et al. (2021). their studies focused on technical debt prioritization and technical debt management in the digital government area. this paper focuses on technical debt identification and management, a related but different scope than their contributions. alves et al. (2016) and li et al. (2015) made previous efforts specifically about technical debt identification or management. however, they analyzed literature published between 1992 and 2014. our study aims to replicate and update the work of alves et al. (2016) to integrate the obtained results by including papers published between 2015 and 2022. this delimitation is the main difference between this work and previous contributions made by other authors. the relevance of performing this study is justified by the application of the framework proposed by (mendes et al., 2020) for updating systematic literature reviews: • does the previous study still address a current question? the high-level research question of this paper is: what strategies have been proposed to identify or manage technical debt in software projects? any software may contain technical debt issues, regardless of the developing company’s size or resources. the consortium for information and software quality (cisq) reports that the cost of poor software quality in the us is at least $2.41 trillion and the accumulated technical debt principal is about $1.52 trillion in 2022 (consortium for information & software quality, 2022). therefore, td remains an expensive issue. identifying and managing td is still a major problem in the software development industry. thus, investigating these topics is relevant for both practitioners and researchers. • has the previous study had good access or use? the work of alves et al. (2016) was published in the information and software technology journal and is fully available through the science direct library. by march 2023, this paper has 589 reads and 238 citations (according to researchgate metrics). thus, the previous study has good access and use. • has the previous study used valid methods and it was well conducted? alves et al. (2016) based their methods on the standard process for conducting systematic mapping studies by petersen et al. (2008). they provide a full explanation of the study implementation (research questions, search strategy, selection criteria, and classification scheme). the study presents a clear view of each step’s outcome. hence, it provides sufficient details and data to replicate the procedures. moreover, two of the original authors participated in the update process. • are there any new relevant studies, methods, or new information? research on technical debt is constantly being published in different venues, such as conferences and journals. for example, the international conference on technical debt (techdebt) is held annually since 2018, which is two years after the publication of the previous study in 2016. consequently, there are plenty of new pieces on td. • will the inclusion of new studies/information/data change the findings, conclusions, or credibility? since the publication of the previous study in 2016, the concepts and focus of the td research area have evolved. in this paper, we will discuss in detail these changes. one of the most important aspects is the increase of research efforts that address technical debt management with a more holistic perspective. by updating the previous study by alves et al. (2016), we provide the following contributions: • an analysis of the technical debt identification and management research landscape between 2010 and 2022; • analysis of the previously proposed technical debt types and identification of new potential types mentioned recently in the literature; • list of the strategies or techniques for technical debt identification, management, and visualization; • an analysis of the empirical evaluations of the proposed methods, including the artifacts, programming language, and data sources used. • discussion on technical debt concepts and their evolution from 2010 to 2022. the contributions presented herein provide insights to both practitioners and researchers regarding the most recent proposals for identifying and managing technical debt. this may help for further industry application of new proposals and for finding new research opportunities. the following section presents the methodology applied to perform this study. 3 research method this paper aims to analyze the research landscape on technical debt identification and management. this section details the search strategy, study selection process, and synthesis methods. 3.1 research questions in this section, we present the rationale and importance of the research questions. • rq1. what are the types of technical debt found in the literature? this question aims to determine if there are new technical debt types described in the literature different from identification and management of technical debt: a systematic study update murillo et al. 2023 those proposed by alves et al. (2016). also, we aim to know which types of technical debt have been most studied in the literature between 2015 and 2022. this research question is important because there is still no consensus on the different technical debt types. we aim to analyze the evolution of these concepts between 2010 and 2022 by integrating our results and those provided by alves et al. (2016). however, the intent of this paper is not to establish a consensus but to find out the td types mentioned in the literature. • rq2. what are the strategies proposed to identify technical debt? this research question aims to determine new artifacts or data sources mentioned in the literature. also, we aim to know which artifacts and data sources are the most cited. this allows us to determine trends or changes in recent years. we also aim to analyze the empirical evaluations of previously mentioned strategies since alves et al. (2016) describe the need for more assessments to determine the applicability of such strategies. visualization techniques for technical debt identification are also crucial because they may help communication between developers and stakeholders, affecting decisionmaking during software development. • rq3. what strategies have been proposed for the management of technical debt? this research question aims to determine the strategies for technical debt management and how they have been empirically tested to determine their applicability. also, we aim to identify the visualization techniques proposed for technical debt management. alves et al. (2016) found few visualization strategies. this study analyzes how this specific research topic has evolved from 2010 to 2022. 3.2 search strategy we retrieved papers from the databases acm digital library, ieee xplore, science direct, and springer link. we also consulted engineering village, scopus, citeseer, and dblp, but no papers were included from these libraries. since this paper updates the previous work of alves et al. (2016), we used the same search string: (“technical debt”) and (“software”) this search string was used in all the sources, restricting the results to publications between 2015 and 2022. 3.2.1 inclusion criteria we considered papers that met the following inclusion criteria: • address the identification or management of technical debt in the context of software development; • explain one or more strategies, techniques, or activities for identifying or managing technical debt; • the year of its publication is between 2015 and 2022 since the previous work of alves et al. (2016) included papers from earlier years. we also considered papers that address technical debt in a general manner or focus on a specific type of debt. moreover, we included those that either provided empirical proof of their proposal or only a theoretical description. 3.2.2 exclusion criteria only the most recent paper was considered when several pieces reported the same study, and each study was considered separately when multiple studies were contained in a single paper. also, we applied the next exclusion criteria: • papers that do not specify how to identify or manage technical debt with a strategy, activity, or technique. therefore, we excluded exploratory studies of technical debt management; • papers in progress (incomplete) or those that do not provide full-text access; • papers published before the year 2015; • duplicate papers; • papers published in a language different than english. moreover, papers in the form of powerpoint presentations, reports, and abstracts only were not considered. 3.3 study selection the study selection was performed by following the following steps: • search: the first step of the process was to perform the search using the defined search string on the databases (acm digital library, ieee xplore, science direct, engineering village, springer link, scopus, citeseer, and dblp). as a result, we found 2517 papers in total. • identification (filter 1): the second step was removing duplicate papers and applying the exclusion and inclusion criteria. in total, 466 duplicate studies were identified, leaving a total of 2051 articles (without duplicates). • screening (filter 2): the next step was screening the articles. we read each of the 2051 titles and abstracts to find those that comply with the eligibility criteria. in this step, we identified 209 studies. in this step, 1852 papers were excluded because they did not comply fully with the inclusion criteria. this is explained because the search string is generic and returned articles that are not relevant to this study. • inclusion and analysis (filter 3): all 209 papers were read in full at this stage. after reading them, only 111 were selected using the eligibility criteria. at this stage, we also extracted data from the selected papers as described in the following subsection (3.4 synthesis methods). • backward snowballing: during the final stage, we reviewed the references of each of the studies. as a result, we included seven more papers, which were analyzed and combined with the results of the study selection. one researcher performed the search, identification, screening, and inclusion for every paper. later, two other researchers were randomly assigned a set of papers each to independently review each paper and extract the data. the results were compared and discussed in case of disagreement. the process was performed in mid-march 2022. identification and management of technical debt: a systematic study update murillo et al. 2023 3.4 synthesis methods from each of the included papers, we extracted six categories of information. table 2 summarizes the data variables collected. • metadata: for a demographic characterization, we collected the title, authors, type and year of publication, and digital library from each included paper. these data were extracted as explicitly found on each corresponding database. we also considered two research topics: identification or management of technical debt. from each paper, we also collected the corresponding research topic. • technical debt types: we documented each paper’s addressed type of technical debt. some papers explicitly mentioned the studied type of technical debt, but in others, this was implicit. also, many papers addressed technical debt without focusing on any specific type (77 in total). therefore, we considered the types as follows: o direct: papers that explicitly mention the name of the type + debt. o indirect: determined from phrases in the text, such as technical debt derived from issues on the documentation (documentation debt). o general: we classified into general technical debt those papers that do not focus on a specific technical debt type directly or indirectly and consider only the concept of technical debt, such as technical debt management approach. • indicators: indicators are elements that help identify technical debt items (alves et al., 2016). we created a list of the indicators cited by authors on each paper and its associated type of technical debt, based on the indicators found by alves et al. (2016). a new indicator was created when these previous indicators did not fit what was mentioned in a paper. we also collected data on how these indicators were empirically tested, including the artifact in which they are identified and data sources. • management strategies: we extracted the management strategies described in each included paper. we used the same criteria as alves et al. (2016): to be considered a management strategy, it must support the decision-making about technical debt items. this definition includes activities for measuring, prioritizing, preventing, monitoring, documenting, and paying the debt. each strategy and its definitions were collected as mentioned in each paper. • evaluation studies: evaluations are needed to determine the feasibility of the proposed strategies. there are several types of evaluation studies. we classified them into case studies, controlled experiments, or ethnographic studies with the same criteria as in the previous study (alves et al., 2016). also, we documented the artifact considered, the programming language used, and the data sources used on each paper that performed an empirical evaluation. • visualization techniques: several visualization techniques help understand the potential problems of technical debt in software projects. we extracted the visualization techniques described in the included papers for technical debt identification or management mentioned in each paper as alves et al. (2016). the aforementioned research method is based on the procedure performed by alves et al. (2016) in their study. in this paper, we aim to answer the same research questions by applying the same study selection and synthesis methods. however, this update’s protocol has two main differences from the original study methodology. one is the time-frame delimitation. we only considered publications between 2015 and 2022, a restriction that was added to the search strategy criteria. moreover, we provide the definition of “general” technical debt classification. in the previous study, the authors refer to “type not specified” while we classify these papers as “general” technical debt to provide more clarity to the reader. however, we refer to the same type of papers (as described in the synthesis methods). table 2. data collection variables and their purpose. data collection variable purpose title demographic characterization author type of publication (workshop, conference, journal) year of publication digital library (database) research topic (identification or management) technical debt type rq1 indicators rq2 artifact considered (identification studies) rq2 data source (identification studies) rq2 management strategy (management studies) rq3 evaluation type (if applicable: case studies, controlled experiments, ethnographic studies, action research) rq2 and rq3 visualization techniques rq2 and rq3 4 results this section presents the integration of our results and those obtained by alves et al. (2016), which included 100 papers published between 2010 and 2014. our study analyzes 117 additional articles dating from 2015 to 2022 (see appendix a1). figure 1 shows the number of studies included by publication type and year. identification and management of technical debt: a systematic study update murillo et al. 2023 figure 1. number of studies by year and publication type. we searched the same databases using the same search string and applied the established selection criteria. papers were published in symposia, journals, magazines, workshops, and conferences. from 2010 to 2014, workshops and conferences were the most common publication types. between 2015 and 2022, the most common publication types were conferences and journals. the decrease in publication on workshops and the rise of conferences shows that the theme has developed certain maturity over the years. the number of publications on technical debt identification and management has been irregular during the last decade. overall, the number of articles included in conferences has been rising. in 2010 only two papers were from conferences, while this number increased to 17 in 2019. however, there may have been an impact on the publications done in 2020 and 2021 due to the coronavirus pandemic. in this study, we performed the search on acm digital library, ieee xplore, science direct, springer link, engineering village, scopus, citeseer, and dblp. overall, since 2010 most papers have been published in the ieee xplore and acm digital library. however, the number of papers on springer link has increased considerably since 2015. figure 2 shows the number of studies by digital library. 4.1 technical debt types (rq1) alves et al. (2016) proposed a taxonomy of technical debt that includes: design, architecture, documentation, test, code, defect, requirements, infrastructure, people, test automation, process, build, service, usability, and versioning debts. from 2010 to 2014, the most common technical debt types studied in the literature were design, architecture, and documentation. also, a high concentration of studies addressed the test, code, and defect debt. between 2015 and 2022, 77 studies did not focus on a particular type but addressed the topic in a general manner. in contrast, other papers focused on a specific technical debt type, such as code, design, or architecture. consequently, technical debt is increasingly studied with a holistic approach rather than distinct kinds of debts that need to be managed differently. figure 3 shows the number of papers included by type of technical debt. of the 117 included papers (2015 2022), 34 addressed self-admitted technical debt, a concept commonly mentioned in the literature. self-admitted technical debt (satd) refers to situations in which developers are aware and admit that technical debt has been incurred. these scenarios are different from those with no consciousness that debt is present. when satd exists, issues may correspond to various technical debt types, such as code, architecture, documentation, etc. for this reason, those papers that addressed satd were classified into the general technical debt (td) category. figure 2. number of included papers by digital library. u m b e r o f s tu d ie s ear orkshop conference ournal ymposium magazine cm igital ibrary pringer ink copus cience irect citeseer ngineering illage umber of included papers ig it a l ib ra ry identification and management of technical debt: a systematic study update murillo et al. 2023 figure 3. number of included papers by technical debt type and year. forty out of the 117 papers addressed a specific technical debt type, as described by (alves et al., 2016). from 2015 to 2022, architecture, code, and design debt were the most common types. however, there has been a significant reduction in studies focused on these types over the last seven years. for example, there was a reduction of nearly half of the publications on architecture and code debts, while design debt went from 42 publications to only five compared to the previous period (2010 – 2014). we found no articles about documentation, people, build, services, or usability debt during the last seven years. these types of technical debt have not been as extensively studied compared to others, so they represent potential areas for further investigation in future work. as a result of the literature review, we did find two new types mentioned in the literature: security and elasticity debts. security debt refers to security issues in the software, such as vulnerabilities or exploitable weaknesses (izurieta et al., 2018). elasticity debt is a concept that describes non-effective or non-efficient resource provisioning resulting from the lack of dynamical adaptation to resource consumption (mera-gómez et al., 2016). these two types of technical debt have been mentioned in few studies. consequently, they cannot still be considered widely accepted types of technical debt. both may be subtypes of requirement (security) and infrastructure (elasticity) papers and classified them as such when performing the literature review. 4.2 technical debt identification (rq2) an essential step for technical debt management is its identification. identification comprises activities or actions to detect the presence of debt in software artifacts. out of the 117 included papers, 47 addressed technical debt identification. we extracted the indicators and type of technical debt associated with each paper. indicators are symptoms that help identify technical debt items (alves et al., 2016). from 2010 to 2014, forty-five indicators were found and presented by alves et al. (2016). in this study, we found 11 indicators mentioned in the literature between 2015 and 2022. table 3 shows the eleven indicators and the top 5 most common indicators presented previously (alves et al., 2016): code smells, documentation issues, software architecture issues, violation of modularity, and automatic static analysis issues. these indicators were either just mentioned or analyzed in the included papers. the results show significant differences between both periods. code smells were the most common indicator in previous years, while the comments and commits were mostly mentioned during the last seven years. this fact is due to the considerable number of papers (34 in total) that addressed self-admitted technical debt in recent years, which used several strategies to analyze comments or commits to identify different types of technical debt, not only those related to source code. this represents a more holistic view, in which not only code issues are intended to be identified. authors have recently studied satd through natural language processing, neural networks, deep learning, and machine learning. satd may be identified using these different approaches on commits, comments, and issue trackers to be further prioritized and managed. 4.2.1 evaluation studies most studies on technical debt identification have performed empirical evaluations through case studies in recent years. however, there has been an increase in this type of study during the last seven years. a possible explanation for this is that the knowledge consolidated before 2015 gave the necessary foundations to perform empirical evaluations, such as case studies. the execution of case studies helps provide more information about the context in which the different identification strategies are applicable. the growth in the number of case studies is relevant because it is vital to have multiple sources of empirical data to generalize results. we also found a significant increase in the number of controlled experiments. figure 4 shows the number of empirical evaluations performed by the number of papers and year of publication. figure 4. empirical evaluations on technical debt identification studies. ersioning uild ervice sability eople rocess nfrastructure e uirement efect test ocumentation code rchitecture esign eneral t umber of studies t e c h n ic a l d e b t ty p e case studies controlled e periment thnographic study umber of papers t y p e o f s tu d y identification and management of technical debt: a systematic study update murillo et al. 2023 table 3. indicators organized by technical debt (td) type and period. indicator 2010 2014 2015 2022 # papers technical debt types # papers technical debt types code smells 52 code, design 1 general td, architecture documentation issues 17 documentation software architecture issues 9 architecture 9 architecture violation of modularity 9 architecture automatic static analysis issues 9 code, design 3 architecture, code, general td comments 1 documentation 26 code, requirements, general td uncorrected known defects 6 defect, test 1 general td immature software 1 general td feature usage and maintenance costs 1 general td insufficient resource provisioning 1 infrastructure low external/internal quality 1 design 1 general td software design issues 4 design 1 design 4.2.2 artifacts and data sources we extracted the data source and artifact considered in each paper that performed an empirical evaluation. figure 5 shows the number of studies by artifact. from 2010 to 2014, the most common artifact was source code. the obtained results show that source code remains the primary artifact used to perform empirical evaluations; this may be because static analysis tools can help for these purposes. however, the number of studies considering source code decreased from 58 between 2010 and 2014 to 39 from 2015 to 2022. figure 5. number of studies by artifact considered for technical debt identification. researchers have started investigating by mining the repositories to extract metadata about technical debt in recent years. alves et al. (2016) identified four different data sources: cms (configuration management systems), software repositories, and bug tracking. the cms were the most used in that period. in contrast, we found six different data sources from 2015 to 2020. software repositories predominated, which makes sense since the most common artifact was source code. figure 6 shows the number of papers by the data source used. 4.2.3. visualization techniques only two papers on technical debt identification mentioned a visualization technique. the proposed methods are not mature because there is not much validation. therefore, the visualization of technical debt is still an area that requires further investigation. the proposed visualization techniques were the assessment graph (shapochka & omelayenko, 2016) and coupling probability matrix (l. xiao et al., 2016). figure 6. number of papers by the data source for technical debt identification. ource code ocumentation test report ugs report rchitecture specification acklogs commit change report mplemented tests e uirement specification ervice utilization atabase data umber of studies rt if a c t c o n si d e re d ug tracking and cm orkload simulation ug tracking nterview ot specified cm oftware repositories umber of papers a ta s o u rc e identification and management of technical debt: a systematic study update murillo et al. 2023 4.3 technical debt management (rq3) the management of technical debt includes several activities to control debt during the software development process. these activities aim to avoid bankruptcy situations in which the debt becomes uncontrollable. out of the 117 included papers, 70 addressed technical debt management. this section presents the strategies proposed by authors in the literature, the evaluation studies performed, and the visualization techniques mentioned. 4.3.1 strategies for managing technical debt the first step for technical debt management is to identify its presence. then, several strategies could be used for its timely administration to reduce interests’ impact. table 4 shows the complete list of management strategies found during the literature review. the top 5 most studied strategies between 2015 and 2022 were the following: • automated analysis of code issues: recent studies mention several tools to aid td management: sonargraph for analyzing software architecture (von zitzewitz, 2019), teamscale to analyze software quality based on data from version control systems, issue trackers and other tools (haas et al., 2019), sonarqube to analyze code and get several code metrics (baldassarre et al., 2020), and codescene to perform behavioral code analysis, which can be helpful for debt prioritization and communication with stakeholders (tornhill, 2018). these papers report code, architecture, test, and general technical debt management supported by tools that automatically identify code issues. the generated metrics or reports may be used by developers and stakeholders to prioritize, monitor, and perform the necessary management actions. one of the advantages of such tools is that they need minor human intervention to measure several code issues while creating awareness that td exists. • calculation of technical debt (td) interest: interest is the additional cost that will have to be assumed because of the delayed payment. authors have proposed methods for its calculation to prioritize technical debt items according to the interest that will have to be assumed (chatzigeorgiou et al., 2015; falessi & reichel, 2015). this allows decision-making about the appropriate moment to pay the debt, depending on the acceptable costs of each scenario. • portfolio approach: in finance, the portfolio comprises the assets that an investor has. portfolio management is carried out to decide what investments to make with the assets considering risks and return of investment. a td portfolio approach brings financial concepts to td and considers it a potential investment, whose final goal is to get more gains than losses (guo & seaman, 2011). the td portfolio approaches are based on the financial portfolio theory and consider principal, interest, or correlations with other td items to help decision-making. authors proposed a glossary of financial technical debt concepts (akbarinasaji & bener, 2016), while others present frameworks that consider portfolio theory (nielsen & skaarup, 2021; rindell et al., 2019). papers that consider more than only interest calculation and reference the portfolio theory were classified into portfolio approaches rather than just interest calculation, while those that only mention interest formulas were classified as calculation of td interest. • prioritization approach: authors suggest different methods to prioritize td items. the purpose of prioritization is to determine the order in which technical debt will be paid. the proposed strategies include code smells ranking through automated tools (alfayez & boehm, 2019; vidal et al., 2016), backlogs managed considering risks and business needs (besker et al., 2019), and approaches that focus on the business perspective (stochel et al., 2020). • satd removal approach: authors have also suggested management strategies for self-admitted technical debt removal. for example, natural language processing could analyze source code comments and later compare their evolution among different versions of each file (da maldonado et al., 2017). it is also possible to use deep neural networks to provide recommendations for satd removal (zampetti et al., 2020). some of the included papers addressed strategies or techniques identified in previous years (alves et al., 2016). these proposals are the portfolio approach, options analysis, calculation of the principal and interest, and td management in database schemas. however, the number of empirical evaluations performed on each strategy is still small. overall, the authors have proposed their own strategies and tested them empirically instead of validating or comparing them to previous proposals. 4.3.2 evaluation studies case studies have been the most frequent type of empirical evaluation performed on technical debt management. this is true for both periods, as shown in figure 7. nevertheless, the number of this type of study raised to more than double from 2015 to 2022. also, ten papers presented action research and controlled experiments in recent years, adding some diversity to the type of evaluation studies. from 2010 to 2014, few empirical studies were performed in real settings. in contrast, subsequent years show more case studies and action research in real settings. the number of these evaluations is still small for every management strategy. still, it is essential to highlight that researchers have started to acknowledge the need for empirical testing. figure 7. number of papers by type of study. controlled e periment ction research case studies umber of papers t y p e o f st u d y identification and management of technical debt: a systematic study update murillo et al. 2023 4.3.3. visualization techniques only four papers on technical debt management mentioned visualizations techniques, which were the following: dynamic graphic (pacheco et al., 2018), line chart (falessi & reichel, 2015), portfolio matrix (plösch et al., 2018), and probabilistic cause-effect diagrams (rios et al., 2019). each technique was only mentioned once. therefore, they still require further research to determine their applicability. 5 discussion this paper studied the technical debt identification and management research landscape from 2015 to 2022 and integrated our results with previous investigation efforts that analyzed the period 2010 2014 (alves et al., 2016). this section presents a discussion of the obtained results. 5.1 technical debt types (rq1) technical debt as an analogy with financial debt is well known among authors in the academic literature. overall, there is a common understanding of the technical debt concept itself as taking shortcuts during software development, leading to several future costs. however, different technical debt classifications exist, and there is no clarity on which are the accepted types. alves et al. (2016) proposed a taxonomy that includes: design, architecture, documentation, test, code, defect, requirements, infrastructure, people, test automation, process, build, service, usability, and versioning debts. still, other studies mention different classifications, but there is a lack of consensus on some technical debt types. to the best of our knowledge, besides the proposal of alves et al. (2016), only three other papers address technical debt types or propose a classification (li et al., 2015; rios et al., 2018; tom et al., 2013). one of these classifications was presented as a result of a non-academic literature review and interviews with people in the software development industry (tom et al., 2013). others were derived from a systematic mapping and a tertiary study of academic literature (alves et al., 2016; rios et al., 2018). some types of technical debt are presented in the three studies: code, design, architecture, and test debt. in fact, we found that from 2015 to 2022, the most addressed types correspond to design, architecture, code, and test debts. we observed that authors use these terms consistently, agreeing with their general meaning. therefore, these particular types may be considered accepted technical debt types. on the other hand, the concept of self-admitted technical debt (satd) is overall consistent among papers and referred to as a technical debt type. other types are much less established in the literature. for example, between 2010 and 2022, process and people debts were only mentioned in three papers each, while usability, service, build, and versioning debts were only cited in two papers each. there is also another new concept mentioned in the literature: variability debt. it was not identified through the performed review because papers mentioning it do not meet the acceptance criteria proposed in this study. however, it may be considered for future research. variability debt refers to software’s characteristics that allow it to adapt (create variants) for different needs (wolfart et al., 2021). these concepts are still not widely accepted since not much literature is available on them. in some cases, they may not even represent technical debt categories themselves but subcategories. the same may be true for security and elasticity debts, which could be subcategories of other types of debt. another relevant aspect is a position in the literature that considers defects and processes as non-technical debt (li et al., 2015). however, this does not imply that the elements addressed by these types of technical debt lack importance. figure 8 presents a heatmap showing the number of publications by technical debt type and year, including papers from 2010 to 2022. the lack of clarity on some technical debt types and the number of existing categories may have influenced the authors’ choice of categorizing their work. still, the number of papers that presented typifications of technical debt dropped in 2016. authors may have inadvertently reached the consensus that technical debt is an issue to be managed without necessarily specifying its type. there was a turning point between 2014 and 2015, in which authors left aside the classification and began to focus their studies on technical debt management. 5.2 technical debt identification (rq2) technical debt identification comprises actions to detect debt presence; it is the first step necessary for its management. in recent years, source code has been the most common artifact that helps technical debt identification since it is possible to implement several techniques, algorithms, or tools to detect debt automatically. however, other artifacts may be used, such as test cases. figure 9 summarizes the findings on technical debt identification between 2010 and 2022. when technical debt exists, there are several indicators that show symptoms of its presence. identification approaches help find indicators through several artifacts and data sources for further management. comments on source code were the most common indicator from 2015 to 2022. analyzing comments helps to identify self-admitted technical debt. the increasing number of studies on satd suggests that there is valuable information that developers themselves can provide through comments. nevertheless, it is a future research opportunity to explore how to take advantage of developers’ knowledge of the code issues in other ways, different from only comments or commit messages. identification and management of technical debt: a systematic study update murillo et al. 2023 figure 8. heatmap showing the number of publications by technical debt type and year. the automatic detection of technical debt allows getting a variety of quantitative measurements. additionally, organizations may be interested in also having qualitative measures on technical debt, which has not been much explored in the literature and constitutes a future work opportunity. epending on every project’s business objectives and needs, organizations may identify those measurements that may help them further manage technical debt in their contexts. in the academic literature, the number of empirical evaluations on technical debt identification has increased in recent years, which is beneficial for both researchers and practitioners because they help discover the indicators’ applicability in different contexts. however, research has concentrated on identification based on source code, while other artifacts and data sources may be further investigated. few studies proposed visualization techniques for technical debt identification between 2010 and 2022. this is still an open issue and research opportunity. technical debt visualization is important because it may support communication between developers and stakeholders while aiding decisionmaking on further technical debt management and prevention. 5.3 technical debt management (rq3) technical debt management comprises actions or activities performed to control debt once it has been identified. authors in the literature have proposed several strategies for debt management. in general, papers present new proposals and test them empirically instead of testing others previously described in the literature. however, the number of empirical evaluations, especially case studies, has increased during the last seven years, along with the number of proposals. table 4 shows the complete list of strategies found in this replication study. from 2015 to 2022, many strategies proposed for technical debt management were supported by automatic tools applied to source code, such as sonargraph, codescene, sonarqube, and teamscale (baldassarre et al., 2020; haas et al., 2019; tornhill & ab, 2018; von zitzewitz, 2019). the obtained measurements through such tools help to prioritize and support decision-making. other authors compared penalties and gamification techniques for technical debt using automated tools in educational contexts, showing that rewards may be a suitable option for td management (crespo et al., 2021). other papers presented novel frameworks or models for managing technical debt during the software development process with a more holistic perspective, including several process elements or phases. for example, some authors propose creating a guild, a group of people that help address td management and guide its payment (detofeno et al., 2021). moreover, another paper mentioned encouraging and rewarding incentives for developers to manage technical debt (besker et al., 2022). other authors evaluate a business prioritization approach that allows an alignment between business and technical stakeholders for prioritizing td items (reboucas de almeida, 2019), while additional research efforts report using td tickets that allow td management and prevention (wiese et al., 2021). nevertheless, few papers specifically address the human resources involved during software development, which is essential because it is known that people issues can also lead to technical debt (rios et al., 2020). between 2010 and 2014, twenty-two papers on technical debt management described a visualization technique. in contrast, we only found four visualization techniques in papers about technical debt management from 2015 to 2022. this shows a significant decrease in research efforts, even when only a few studies address this topic. although it is not part of the research questions of this paper, it is worth mentioning that there are different perspectives regarding the definition of td management in the identification and management of technical debt: a systematic study update murillo et al. 2023 literature. in this paper, we used the same definition of td management as alves et al. (2016) to be conceptually consistent. however, some authors consider that td management includes its recognition, analysis, monitoring, and measurement (izurieta et al., 2016), while others consider its identification, assessment, and remediation (griffith et al., 2014). furthermore, li et al. (2015) present nine activities for technical debt management: identification, measurement, prioritization, prevention, monitoring, repayment, documentation, and communication. as several definitions of td management suggest, there are plenty of actions that help to control debt during software development. however, the concepts mentioned by the different authors agree that the first step for managing td is to identify or recognize the presence of debt and start measuring (quantifying) it. these two activities alone are not the solution for td issues. subsequent strategies are needed to take effective actions toward td management. it may be necessary to prioritize, monitor, repay, and document debt (li et al., 2015). prioritization includes deciding the order of importance or urgency to pay debt items. monitoring refers to supervising several aspects related to td, such as historical costs and resolution times. it is not possible to monitor debt if metrics have not been established and measured. also, the progress on debt issues cannot be tracked if there is no monitoring. moreover, debt repayment or remediation is the resolution of a td item. also, documentation and communication with stakeholders may be needed. lastly, organizations may be interested in establishing prevention actions. technical debt is a context-dependent issue (fernándezsánchez et al., 2017). therefore, the context must be well understood to take appropriate actions for debt management. gathering and analyzing data (not only about td) may be useful for establishing a td management plan. for example, debt management may be different in an agile organization than in a traditional one. also, the team size and type of software that is developed are variables that may be considered. moreover, determining the main debt issues perceived by the software developers could be a starting point. regardless of which definition of td management is used, the appropriate strategies will depend solely on the specific needs, issues, and objectives of the organizational context. furthermore, the selected strategies may vary or be adapted in time depending on the obtained outcomes. the following sections present the threats to the validity and conclusions of this paper. figure 9. technical debt identification concept map. rtifacts ata sources ource code ocumentation acklogs tests oftware repositories cm ug tracking tatic nalysis atural anguage rocessing machine earning nterview urvey manifests through ound using upported by ncluding epresented using ssessment graph coupling probability matri lags in code maps catterplot correlation matri time range timeline tree map ncluding identification and management of technical debt: a systematic study update murillo et al. 2023 table 4. management strategies proposed in the academic literature from 2015 to 2022. strategy proposed number of papers references automated analysis of code issues 7 (anderson et al., 2019; baldassarre et al., 2020; fontana et al., 2016; haas et al., 2019; lahti et al., 2021; sharma, 2019; tornhill & ab, n.d.; von zitzewitz, 2019) calculation of td interest 6 (ampatzoglou, ampatzoglou, avgeriou, et al., 2015; ampatzoglou et al., 2018; chatzigeorgiou et al., 2015; falessi & reichel, 2015; kontsevoi et al., 2019; martini & bosch, 2016, 2017a) portfolio approach 5 (akbarinasaji & bener, 2016; aldaeej & seaman, 2018; nielsen & skaarup, 2021; plösch et al., 2018; rindell et al., 2019) prioritization approach 5 (alfayez & boehm, 2019; besker et al., 2019; de lima et al., 2022; stochel et al., 2020; vidal et al., 2016) satd removal approach 3 (da maldonado et al., 2017; t. xiao et al., 2021; zampetti et al., 2020) approach for technical debt decision making 3 (codabux & williams, 2016; pacheco et al., 2018; ribeiro et al., 2017) model for td alignment with business 3 (reboucas de almeida, 2019; reboucas de almeida et al., 2018, 2019) calculation of td principal 3 (akbarinasaji et al., 2016; kontsevoi et al., 2019; kosti et al., 2017) process framework for managing td 2 (oliveira et al., 2015; ramasubbu & kemerer, 2019) model for optimizing technical debt 2 (perez et al., 2019; yli-huumo et al., 2016) strategic td management model 2 (ciancarini & russo, 2020; martini et al., 2016) framework for td management 2 (borup et al., 2021; wiese et al., 2021) continuous architecting framework for embedded software and agile (caffea) 1 (martini & bosch, 2017b) automated identification of refactoring candidates 1 (tornhill, 2018) automated refactoring 1 (mohan et al., 2016) automatic identification and interactive monitoring 1 (fernandez-sanchez et al., 2017) benchmarking-based model 1 (mera-gómez et al., 2016) continuous/extensive testing 1 (trumler & paulisch, 2016) estimation approach 1 (lenarduzzi et al., 2019) linear-predictive lifecycle/incrementalpredictive lifecycle application 1 (fairley & willshire, 2017) maintainability model 1 (di biase et al., 2019) managing td in database schemas 1 (albarak et al., 2020) metric for managing architectural technical debt 1 (kouros et al., 2019) model-driven development (preemptive) 1 (izurieta et al., 2015) model of maintenance cost growth 1 (snipes & ramaswamy, 2018) propagation model 1 (holvitie et al., 2016) real options analysis 1 (abad & ruhe, 2015) td enhanced backlog 1 (martini, 2018) visual thinking 1 (chicote, 2017) td cause-effect analysis 1 (rios et al., 2019) normative process framework 1 (de leon-sigg et al., 2020) td predictive model 1 (aversano et al., 2021) conceptual model for holistic debt management 1 (malakuti & heuschkel, 2021) automated identification of deprecation in metamodels 1 (iovino et al., 2020) td management guild 1 (detofeno et al., 2021) encouraging and rewarding incentives 1 (besker et al., 2022) identification and management of technical debt: a systematic study update murillo et al. 2023 6 threats to validity the results presented in this systematic mapping may have been affected by the following threats to validity: • publication bias: relevant studies may not have been returned when performing the literature search. several databases were consulted, and we did a backward snowballing process to find as many studies as possible to address this threat. • search string: it is possible that some papers in the literature propose a way to manage or identify technical debt but do not explicitly mention that they are suggesting an approach for technical debt. therefore, these papers may have been left out when performing the search. still, our focus was on literature regarding technical debt strategies. also, since this was an update to a previous systematic mapping study, other limitations and threats to validity include: • consistency in integrating the results: this paper updates the previous work by alves et al. (2016). different researchers performed the data extraction than the original study, and we cannot ensure that this could cause some differences in the updated results. however, our research method is based on the procedure performed by alves et al. (2016) in their study. also, two of the original authors contributed to the elaboration of this paper and reviewed the obtained results from the data extraction to ensure there was no misunderstanding of concepts between the two sets of primary sources to address this risk. lastly, we performed the paper selection process in march 2022, so the results of that year are not fully complete. these aspects are the limitations of this study. 7 conclusion this paper explored the evolution of technical debt identification and management research landscape over a decade. we searched for studies on eight databases and analyzed academic literature published between 2015 and 2022. by applying the defined search string and the inclusion criteria, we found 117 papers. we integrated our results with a previous study (alves et al., 2016) that analyzed literature from 2010 to 2014. in addition to the technical debt types mentioned in the taxonomy by alves et al. (2016), there are three new terms in the literature: security, elasticity, and variability debt. the security type refers to security issues in the software, such as vulnerabilities or exploitable weaknesses (izurieta et al., 2018). on the other hand, elasticity debt is a concept that refers to non-effective or non-efficient resource provisioning resulting from the lack of dynamical adaptation to resource consumption (mera-gómez et al., 2016). lastly, variability debt comprises the lack of software characteristics that allow it to adapt (create variants) for different needs (wolfart et al., 2021). unlike the previous mapping, most of the included papers addressed technical debt without focusing on specific types. this shows that the technical debt phenomenon is analyzed more holistically. still, those papers that focused on specific types of technical debt studied those identifiable or measurable through code. the most frequent artifacts and data sources are the source code and repositories; this may be because there are various code and repository data analysis tools. there are no such abundant tools for analyzing other types of debts, like documentation, people, and infrastructure. over the years, several proposals have been developed for technical debt management. however, as in the previous systematic mapping, there is a need for more research to validate the effectiveness of the proposals and their applicability in different contexts. another finding was that only a few studies included in the update proposed a visualization strategy. therefore, the topic of technical debt visualization continues to be a future research opportunity. the automatic identification of debt through the analysis of comments, commits, and source code is among the main proposals found in the literature published between 2015 and 2022. several evaluations have been performed through case studies, controlled experiments, and action research. the number of evaluations has been rising through the years, which is particularly important for consolidating the knowledge gained in the research area. however, it is still required to perform more evaluations to generalize the obtained results. the most relevant findings of this paper were the following: • investigations on technical debt identification and management have increasingly changed their focus to a more holistic perspective, considering technical debt as a global problem during the software development process instead of analyzing it as a set of different isolated problems. however, a significant number of investigations still focus on technical debt types closely related to source code. • the number of empirical evaluations performed on each strategy is still small. in most cases, authors have proposed their own strategies and tested them empirically instead of testing previous proposals. • recent research on technical debt has focused on its management, while the proposal of new types has decreased dramatically since 2016. creating new categories seems unnecessary, while authors may have inadvertently reached the consensus that technical debt is an issue to manage without specifying its type. • overall, authors agree on the general meaning of code, design, architecture, and test debt, which suggests that these are widely accepted technical debt types. likewise, future work possibilities include the following: • research on how to use developers’ knowledge on existing technical debt, not only focused on their comments or commit messages. it is an opportunity to identification and management of technical debt: a systematic study update murillo et al. 2023 explore this knowledge as a valuable asset for technical debt identification. • creating tools for analyzing certain types of debt, such as documentation, people, and infrastructure is a potential research opportunity since there is a lack of such tools. • there is still a small number of proposals regarding technical debt visualization; this is a future research opportunity, particularly considering that visualization techniques can help to better communicate with stakeholders. • few studies have explored strategies with a less-technical approach but focus on human resources, such as creating guilds, communities of practice, and rewards or incentives. therefore, performing such investigations is a future opportunity. • there is a need to analyze which strategies are best in specific contexts (for example, public or private organizations). the next steps in the research could be regarding how technical debt can be used as a competitive advantage, generating value rather than bringing undesired and costly consequences. acknowledgments the authors thank dr. carolyn seaman for her valuable suggestions and comments. this work was partially supported by citic at the university of costa rica. grant no. 834-b4412. appendices a1. complete list of included papers the complete bibliography of the 117 papers analyzed in the full-text review is available at: https://drive.google.com/file/d/1g8thuunysvuhdwbr5a_scvltxwccnta/view?usp=sharing a2. list of included papers about technical debt identification the complete list of included papers about technical debt identification and the artifact considered is available at; https://drive.google.com/file/d/1txn8sv6og_n59dhzkjktc mmazcl4ud3e/view?usp=sharing references abad, z. s. h., & ruhe, g. (2015). using real options to manage technical debt in requirements engineering. 2015 ieee 23rd international requirements engineering conference, re 2015 proceedings, 230–235. https://doi.org/10.1109/re.2015.7320428 akbarinasaji, s., & bener, a. (2016). adjusting the balance sheet by appending technical debt. proceedings 2016 ieee 8th international workshop on managing technical debt, mtd 2016, 36–39. https://doi.org/10.1109/mtd.2016.14 akbarinasaji, s., bener, a. b., & erdem, a. (2016). measuring the principal of defect debt. proceedings 5th international workshop on realizing artificial intelligence synergies in software engineering, raise 2016, 1–7. https://doi.org/10.1145/2896995.2896999 albarak, m., bahsoon, r., ozkaya, i., & nord, r. l. (2020). managing technical debt in database normalization. ieee transactions on software engineering. https://doi.org/10.1109/tse.2020.3001339 aldaeej, a., & seaman, c. (2018). from lasagna to spaghetti, a decision model to manage defect debt. proceedings international conference on software engineering, 67–71. https://doi.org/10.1145/3194164.3194177 alfayez, r., alwehaibi, w., winn, r., venson, e., & boehm, b. (2020). a systematic literature review of technical debt prioritization. proceedings 2020 ieee/acm international conference on technical debt, techdebt 2020, 10, 1–10. https://doi.org/10.1145/3387906.3388630 alfayez, r., & boehm, b. (2019). technical debt prioritization: a search-based approach. proceedings 19th ieee international conference on software quality, reliability and security, qrs 2019, 434–445. https://doi.org/10.1109/qrs.2019.00060 alves, n. s. r., mendes, t. s., de mendonça, m. g., spinola, r. o., shull, f., & seaman, c. (2016a). identification and management of technical debt: a systematic mapping study. information and software technology, 70, 100–121. https://doi.org/10.1016/j.infsof.2015.10.008 alves, n. s. r., mendes, t. s., de mendonça, m. g., spinola, r. o., shull, f., & seaman, c. (2016b). identification and management of technical debt: a systematic mapping study. information and software technology, 70, 100–121. https://doi.org/10.1016/j.infsof.2015.10.008 ampatzoglou, a., ampatzoglou, a., avgeriou, p., & chatzigeorgiou, a. (2015). a financial approach for managing interest in technical debt. lecture notes in business information processing, 257, 117–133. https://doi.org/10.1007/978-3-319-40512-4_7 ampatzoglou, a., ampatzoglou, a., chatzigeorgiou, a., & avgeriou, p. (2015). the financial aspect of managing technical debt: a systematic literature review. information and software technology, 64, 52–73. https://doi.org/10.1016/j.infsof.2015.04.001 ampatzoglou, a., michailidis, a., sarikyriakidis, c., ampatzoglou, a., chatzigeorgiou, a., avgeriou, p., ampatzoglou, a., michailidis, a., sarikyriakidis, c., chatzigeorigiou, a., & avgeriou, p. (2018). a framework for managing interest in technical debt: an industrial validation. proceedings of the 2018 international conference on technical debt, 10. https://doi.org/10.1145/3194164 identification and management of technical debt: a systematic study update murillo et al. 2023 anderson, p., kot, l., gilmore, n., & vitek, d. (2019). sarif-enabled tooling to encourage gradual technical debt reduction. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 71–72. https://doi.org/10.1109/techdebt.2019.00024 aversano, l., bernardi, m. l., cimitile, m., & iammarino, m. (2021). technical debt predictive model through temporal convolutional network. proceedings of the international joint conference on neural networks, 2021-july. https://doi.org/10.1109/ijcnn52387.2021.9534423 baldassarre, m. t., lenarduzzi, v., romano, s., & saarimäki, n. (2020). on the diffuseness of technical debt items and accuracy of remediation time when using sonarqube. information and software technology, 128, 106377. https://doi.org/10.1016/j.infsof.2020.106377 besker, t., martini, a., & bosch, j. (2019). technical debt triage in backlog management. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 13–22. https://doi.org/10.1109/techdebt.2019.00010 besker, t., martini, a., & bosch, j. (2022). the use of incentives to promote technical debt management. information and software technology, 142, 106740. https://doi.org/10.1016/j.infsof.2021.106740 borup, n. b., christiansen, a. l. j., tovgaard, s. h., & persson, j. s. (2021). deliberative technical debt management: an action research study. lecture notes in business information processing, 434 lnbip, 50–65. https://doi.org/10.1007/978-3-03091983-2_5/tables/3 chatzigeorgiou, a., ampatzoglou, a., ampatzoglou, a., & amanatidis, t. (2015). estimating the breaking point for technical debt. 2015 ieee 7th international workshop on managing technical debt, mtd 2015 proceedings, 53–56. https://doi.org/10.1109/mtd.2015.7332625 chicote, m. (2017). startups and technical debt: managing technical debt with visual thinking. proceedings 2017 ieee/acm 1st international workshop on software engineering for startups, softstart 2017, 10–11. https://doi.org/10.1109/softstart.2017.6 ciancarini, p., & russo, d. (2020). the strategic technical debt management model: an empirical proposal. ifip advances in information and communication technology, 582 ifip, 131–140. https://doi.org/10.1007/978-3-030-47240-5_13 codabux, z., & williams, b. j. (2016). technical debt prioritization using predictive analytics. proceedings international conference on software engineering, 704–706. https://doi.org/10.1145/2889160.2892643 consortium for information & software quality. (2022). cost of poor software quality in the u.s.: a 2022 report cisq. https://www.it-cisq.org/the-cost-of-poorquality-software-in-the-us-a-2022-report/ crespo, y., gonzalez-escribano, a., & piattini, m. (2021). carrot and stick approaches revisited when managing technical debt in an educational context. proceedings 2021 ieee/acm international conference on technical debt, techdebt 2021, 99–108. https://doi.org/10.1109/techdebt52882.2021.000 20 da maldonado, e. s., abdalkareem, r., shihab, e., & serebrenik, a. (2017). an empirical study on the removal of self-admitted technical debt. proceedings 2017 ieee international conference on software maintenance and evolution, icsme 2017, 238–248. https://doi.org/10.1109/icsme.2017.8 de leon-sigg, m., vazquez-reyes, s., & rodriguez-avila, d. (2020). towards the use of a framework to make technical debt visible. proceedings 2020 8th edition of the international conference in software engineering research and innovation, conisoft 2020, 86– 92. https://doi.org/10.1109/conisoft50191.2020.0002 2 de lima, b. s., garcia, r. e., & eler, d. m. (2022). toward prioritization of self-admitted technical debt: an approach to support decision to payment. software quality journal, 1–27. https://doi.org/10.1007/s11219021-09578-7/figures/10 detofeno, t., malucelli, a., & reinehr, s. (2021). technical debt guild: when experience and engagement improve technical debt management. xx brazilian symposium on software quality. https://doi.org/10.1145/3493244 di biase, m., rastogi, a., bruntink, m., & van deursen, a. (2019). the delta maintainability model: measuring maintainability of fine-grained code changes. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 113–122. https://doi.org/10.1109/techdebt.2019.00030 fairley, r. e., & willshire, m. j. (2017). better now than later: managing technical debt in systems development. computer, 50(5), 80–87. https://doi.org/10.1109/mc.2017.124 falessi, d., & reichel, a. (2015). towards an open-source tool for measuring and visualizing the interest of technical debt. 2015 ieee 7th international workshop on managing technical debt, mtd 2015 proceedings, 1–8. https://doi.org/10.1109/mtd.2015.7332618 fernández-sánchez, c., garbajosa, j., yagüe, a., & perez, j. (2017). identification and analysis of the elements required to manage technical debt by means of a systematic mapping study. journal of systems and software, 124, 22–38. https://doi.org/10.1016/j.jss.2016.10.018 fernandez-sanchez, c., humanes, h., garbajosa, j., & diaz, j. (2017). an open tool for assisting in technical debt management. proceedings 43rd euromicro conference on software engineering and advanced applications, seaa 2017, 400–403. https://doi.org/10.1109/seaa.2017.60 identification and management of technical debt: a systematic study update murillo et al. 2023 fontana, f. a., roveda, r., & zanoni, m. (2016). tool support for evaluating architectural debt of an existing system: an experience report. proceedings of the acm symposium on applied computing, 04-08-april2016, 1347–1349. https://doi.org/10.1145/2851613.2851963 freire, s., rios, n., mendonça, m., falessi, d., seaman, c., izurieta, c., & spínola, r. o. (2020). actions and impediments for technical debt prevention: results from a global family of industrial surveys. proceedings of the acm symposium on applied computing, 1548– 1555. https://doi.org/10.1145/3341105.3373912 griffith, i., taffahi, h., izurieta, c., & claudio, d. (2014). a simulation study of practical methods for technical debt management in agile software development. proceedings of the winter simulation conference 2014. https://doi.org/10.1109/wsc.2014.7019961 guo, y., & seaman, c. (2011). a portfolio approach to technical debt management. guo, y., spínola, r. o., & seaman, c. (2014). exploring the costs of technical debt management – a case study. empirical software engineering 2014 21:1, 21(1), 159–182. https://doi.org/10.1007/s10664-014-9351-7 haas, r., niedermayr, r., & juergens, e. (2019). teamscale: tackle technical debt and control the quality of your software. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 55–56. https://doi.org/10.1109/techdebt.2019.00016 holvitie, j., licorish, s. a., & leppanen, v. (2016). modelling propagation of technical debt. proceedings 42nd euromicro conference on software engineering and advanced applications, seaa 2016, 54–58. https://doi.org/10.1109/seaa.2016.53 iovino, l., di, a., davide, s., ruscio, d., pierantonio, a., salle, a. di, ruscio, d. di, & pieran, a. (2020). metamodel deprecation to manage technical debt in model co-evolution. proceedings 23rd acm/ieee international conference on model driven engineering languages and systems, models-c 2020 companion proceedings, 306–315. https://doi.org/10.1145/3417990.3419625 izurieta, c., ozkaya, i., seaman, c., kruchten, p., nord, r., snipes, w., & avgeriou, p. (2016). perspectives on managing technical debt : transition point and roadmap from dagstuhl. ceur workshop proceedings, 1771, 84–87. izurieta, c., rice, d., kimball, k., & valentien, t. (2018). a position study to investigate technical debt associated with security weaknesses. proceedings of the 2018 international conference on technical debt. https://doi.org/10.1145/3194164 izurieta, c., rojas, g., & griffith, i. (2015). preemptive management of model driven technical debt for improving software quality. proceedings of the 11th international acm sigsoft conference on quality of software architectures. https://doi.org/10.1145/2737182 izurieta, c., vetrò, a., zazworka, n., cai, y., seaman, c., & shull, f. (2012). organizing the technical debt landscape. 2012 3rd international workshop on managing technical debt, mtd 2012 proceedings, 23–26. https://doi.org/10.1109/mtd.2012.6225995 kontsevoi, b., soroka, e., & terekhov, s. (2019). tetra, as a set of techniques and tools for calculating technical debt principal and interest. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 64–65. https://doi.org/10.1109/techdebt.2019.00021 kosti, m. v., ampatzoglou, a., chatzigeorgiou, a., pallas, g., stamelos, i., & angelis, l. (2017). technical debt principal assessment through structural metrics. proceedings 43rd euromicro conference on software engineering and advanced applications, seaa 2017, 329–333. https://doi.org/10.1109/seaa.2017.59 kouros, p., chaikalis, t., arvanitou, e. m., chatzigeorgiou, a., ampatzoglou, a., & amanatidis, t. (2019). jcaliper: search-based technical debt management. proceedings of the acm symposium on applied computing, part f147772, 1721–1730. https://doi.org/10.1145/3297280.3297448 lahti, j. r., tuovinen, a. p., & mikkonen, t. (2021). experiences on managing technical debt with code smells and antipatterns. proceedings 2021 ieee/acm international conference on technical debt, techdebt 2021, 36–44. https://doi.org/10.1109/techdebt52882.2021.000 13 lenarduzzi, v., besker, t., taibi, d., martini, a., & arcelli fontana, f. (2021). a systematic literature review on technical debt prioritization: strategies, processes, factors, and tools. journal of systems and software, 171, 110827. https://doi.org/10.1016/j.jss.2020.110827 lenarduzzi, v., martini, a., taibi, d., & tamburri, d. a. (2019). towards surgically-precise technical debt estimation: early results and research roadmap. maltesque 2019 proceedings of the 3rd acm sigsoft international workshop on machine learning techniques for software quality evaluation, co-located with esec/fse 2019, 37–42. https://doi.org/10.1145/3340482.3342747 li, z., avgeriou, p., & liang, p. (2015). a systematic mapping study on technical debt and its management. journal of systems and software, 101, 193–220. https://doi.org/10.1016/j.jss.2014.12.027 macit, y., giray, g., & tüzün, e. (2020). methods for identifying architectural debt: a systematic mapping study. 2020 turkish national software engineering symposium, uyms 2020 proceedings. https://doi.org/10.1109/uyms50627.2020.9247070 malakuti, s., & heuschkel, j. (2021). the need for holistic technical debt management across the value stream: lessons learnt and open challenges. proceedings identification and management of technical debt: a systematic study update murillo et al. 2023 2021 ieee/acm international conference on technical debt, techdebt 2021, 109–113. https://doi.org/10.1109/techdebt52882.2021.000 21 martini, a. (2018). anacondebt: a tool to assess and track technical debt. proceedings of the 2018 international conference on technical debt. https://doi.org/10.1145/3194164 martini, a., besker, t., & bosch, j. (2016). the introduction of technical debt tracking in large companies. proceedings asia-pacific software engineering conference, apsec, 0, 161–168. https://doi.org/10.1109/apsec.2016.032 martini, a., & bosch, j. (2016). an empirically developed method to aid decisions on architectural technical debt refactoring: anacondebt. proceedings international conference on software engineering, 31–40. https://doi.org/10.1145/2889160.2889224 martini, a., & bosch, j. (2017a). the magnificent seven: towards a systematic estimation of technical debt interest. proceedings of the xp2017 scientific workshops. https://doi.org/10.1145/3120459 martini, a., & bosch, j. (2017b). revealing social debt with the caffea framework: an antidote to architectural debt. proceedings 2017 ieee international conference on software architecture workshops, icsaw 2017: side track proceedings, 179–181. https://doi.org/10.1109/icsaw.2017.42 mendes, e., wohlin, c., felizardo, k., & kalinowski, m. (2020). when to update systematic literature reviews in software engineering. journal of systems and software, 167, 110607. https://doi.org/10.1016/j.jss.2020.110607 mera-gómez, c., bahsoon, r., & buyya, r. (2016). elasticity debt: a debt-aware approach to reason about elasticity decisions in the cloud. proceedings of the 9th international conference on utility and cloud computing. https://doi.org/10.1145/2996890 mohan, m., greer, d., & mcmullan, p. (2016). technical debt reduction using search based automated refactoring. journal of systems and software, 120, 183–194. https://doi.org/10.1016/j.jss.2016.05.019 nepomuceno, v., & soares, s. (2019). on the need to update systematic literature reviews. information and software technology, 109, 40–42. https://doi.org/10.1016/j.infsof.2019.01.005 nielsen, m. e., østergaard madsen, c., & lungu, m. f. (2020). technical debt management: a systematic literature review and research agenda for digital government. lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 12219 lncs, 121–137. https://doi.org/10.1007/978-3-03057599-1_10 nielsen, m. e., & skaarup, s. (2021). it portfolio management as a framework for managing technical debt; it portfolio management as a framework for managing technical debt. 14th international conference on theory and practice of electronic governance. https://doi.org/10.1145/3494193 oliveira, f., goldman, a., & santos, v. (2015). managing technical debt in software projects using scrum: an action research. proceedings 2015 agile conference, agile 2015, 50–59. https://doi.org/10.1109/agile.2015.7 pacheco, a., marín-raventós, g., & lópez, g. (2018). designing a technical debt visualization tool to improve stakeholder communication in the decisionmaking process: a case study. lecture notes in business information processing, 327, 15–26. https://doi.org/10.1007/978-3-319-99040-8_2 perez, b., correal, d., & astudillo, h. (2019). a proposed model-driven approach to manage architectural technical debt life cycle. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 73–77. https://doi.org/10.1109/techdebt.2019.00025 petersen, k., feldt, r., mujtaba, s., & mattsson, m. (2008). systematic mapping studies in software engineering. 12th international conference on evaluation and assessment in software engineering, ease 2008. https://doi.org/10.14236/ewic/ease2008.8 plösch, r., bräuer, j., saft, m., & körner, c. (2018). design debt prioritization: a design best practice-based approach. proceedings of the 2018 international conference on technical debt, 18. https://doi.org/10.1145/3194164 ramasubbu, n., & kemerer, c. f. (2019). integrating technical debt management and software quality management processes: a normative framework and field tests. ieee transactions on software engineering, 45(3), 285–300. https://doi.org/10.1109/tse.2017.2774832 reboucas de almeida, r. (2019). business-driven technical debt prioritization. proceedings 2019 ieee international conference on software maintenance and evolution, icsme 2019, 605–609. https://doi.org/10.1109/icsme.2019.00096 reboucas de almeida, r., kulesza, u., treude, c., cavalcanti feitosa, d., & lima, a. h. g. (2018). aligning technical debt prioritization with business objectives: a multiple-case study. proceedings 2018 ieee international conference on software maintenance and evolution, icsme 2018, 655–664. https://doi.org/10.1109/icsme.2018.00075 reboucas de almeida, r., treude, c., & kulesza, u. (2019). tracy: a business-driven technical debt prioritization framework. proceedings 2019 ieee international conference on software maintenance and evolution, icsme 2019, 181–185. https://doi.org/10.1109/icsme.2019.00028 ribeiro, l. f., alves, n. s. r., de mendonca neto, m. g., & spinola, r. o. (2017). a strategy based on multiple decision criteria to support technical debt management. proceedings 43rd euromicro conference on software engineering and advanced applications, identification and management of technical debt: a systematic study update murillo et al. 2023 seaa 2017, 334–341. https://doi.org/10.1109/seaa.2017.37 rindell, k., bernsmed, k., & gilje jaatun, m. (2019). managing security in software or: how i learned to stop worrying and manage the security technical debt. acm international conference proceeding series. https://doi.org/10.1145/3339252.3340338 rios, n., mendonça neto, m. g. de, & spínola, r. o. (2018). a tertiary study on technical debt: types, management strategies, research trends, and base information for practitioners. information and software technology, 102, 117–145. https://doi.org/10.1016/j.infsof.2018.05.010 rios, n., spinola, r. o., de mendonça neto, m. g., & seaman, c. (2019). supporting analysis of technical debt causes and effects with cross-company probabilistic cause-effect diagrams. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 3–12. https://doi.org/10.1109/techdebt.2019.00009 rios, n., spínola, r. o., mendonça, m., & seaman, c. (2020). the practitioners’ point of view on the concept of technical debt and its causes and consequences: a design for a global family of industrial surveys and its first results from brazil. empirical software engineering 2020 25:5, 25(5), 3216–3287. https://doi.org/10.1007/s10664-020-09832-9 shapochka, a., & omelayenko, b. (2016). practical technical debt discovery by matching patterns in assessment graph. proceedings 2016 ieee 8th international workshop on managing technical debt, mtd 2016, 32–35. https://doi.org/10.1109/mtd.2016.7 sharma, t. (2019). how deep is the mud: fathoming architecture technical debt using designite. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 59–60. https://doi.org/10.1109/techdebt.2019.00018 snipes, w., & ramaswamy, s. (2018). a proposed sizing model for managing 3rd party code technical debt. proceedings of the 2018 international conference on technical debt, 18. https://doi.org/10.1145/3194164 stochel, m. g., cholda, p., & wawrowski, m. r. (2020). continuous debt valuation approach (codva) for technical debt prioritization. proceedings 46th euromicro conference on software engineering and advanced applications, seaa 2020, 362–366. https://doi.org/10.1109/seaa51224.2020.00066 tom, e., aurum, a., & vidgen, r. (2013). an exploration of technical debt. journal of systems and software, 86(6), 1498–1516. https://doi.org/10.1016/j.jss.2012.12.052 tornhill, a. (2018). assessing technical debt in automated tests with codescene. proceedings 2018 ieee 11th international conference on software testing, verification and validation workshops, icstw 2018, 122– 125. https://doi.org/10.1109/icstw.2018.00039 tornhill, a., & ab, e. (n.d.). prioritize technical debt in large-scale systems using codescene. proceedings of the 2018 international conference on technical debt, 18. https://doi.org/10.1145/3194164 trumler, w., & aulisch, . ( ). how “ pecification by ample” and test-driven development help to avoid technial debt. proceedings 2016 ieee 8th international workshop on managing technical debt, mtd 2016, 1–8. https://doi.org/10.1109/mtd.2016.10 vidal, s., vazquez, h., diaz-pace, j. a., marcos, c., garcia, a., & oizumi, w. (2016). jspirit: a flexible tool for the analysis of code smells. proceedings international conference of the chilean computer science society, sccc, 2016-february. https://doi.org/10.1109/sccc.2015.7416572 von zitzewitz, a. (2019). mitigating technical and architectural debt with sonargraph. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 66–67. https://doi.org/10.1109/techdebt.2019.00022 wiese, m., riebisch, m., & schwarze, j. (2021). preventing technical debt by technical debt aware project management. proceedings 2021 ieee/acm international conference on technical debt, techdebt 2021, 84–93. https://doi.org/10.1109/techdebt52882.2021.000 18 wolfart, d., assunção, w. k. g., & martinez, j. (2021). variability debt: characterization, causes and consequences. sbqs ’21: proceedings of the xx brazilian symposium on software quality. https://doi.org/10.1145/3488042.3488048 xiao, l., cai, y., kazman, r., mo, r., & feng, q. (2016). identifying and quantifying architectural debt. proceedings of the 38th international conference on software engineering. https://doi.org/10.1145/2884781 xiao, t., wang, d., mcintosh, s., hata, h., kula, r. g., ishio, t., & matsumoto, k. (2021). characterizing and mitigating self-admitted technical debt in build systems. ieee transactions on software engineering, 1–1. https://doi.org/10.1109/tse.2021.3115772 yli-huumo, j., maglyas, a., smolander, k., haller, j., & törnroos, h. (2016). developing processes to increase technical debt visibility and manageability – an action research study in industry. lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 10027 lncs, 368–378. https://doi.org/10.1007/978-3-319-49094-6_24 zampetti, f., serebrenik, a., & di penta, m. (2020). automatically learning patterns for self-admitted technical debt removal. saner 2020 proceedings of the 2020 ieee 27th international conference on software analysis, evolution, and reengineering, 355–366. https://doi.org/10.1109/saner48275.2020.9054868 journal of software engineering research and development, 2019, 7:8, doi: 10.5753/jserd.2019.155 this work is licensed under a creative commons attribution 4.0 international license.. on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ eduardo pinheiro [ universidade federal de são carlos | edu.g.pinheiro@gmail.com ] larissa lopes [ universidade federal de são carlos | larii.albano@gmail.com ] tayana conte [ universidade federal do amazonas | tayana@icomp.ufam.edu.br ] luciana zaina [ universidade federal de são carlos | lzaina@ufscar.br ] abstract context: requirements elicitation phase in software development investigates both requirements, functional and user experience (ux). proto-persona is a technique that encourages attention on the needs of a group of users. usually, the elaboration of proto-personas is done by software specialists and technical stakeholders without the participation of non-technical stakeholders. however, non-technical stakeholders often have a well-knowledge about target users. objective: this work aims to investigate the contribution that non-technical stakeholders bring to the specification of ux requirements when they use the proto-persona+ technique to this end. to achieve our objective, we extend the original proposal of proto-persona technique creating the proto-persona+. we also explored the construction of proto-persona+ artifacts and their use to the prototyping of solutions. method: we conducted an empirical study in two rounds, wherein we analyzed and compared the contributions of technical and non-technical stakeholders on the specification of ux requirements. in the first round, 8 non-technical and 5 technical stakeholders built the proto-personas+. in the second round, 36 software developers worked in pairs to create low fidelity prototypes using the information provided by the proto-persona+ artifacts. for the two rounds, we conducted a qualitative analysis exploring which ux requirements were described and used. results: the results revealed that although both types of stakeholders had written the details of ux requirements on the artifact, they did in different and complementary perspectives. we could also observe that the proto-persona+ artifacts that were produced by both stakeholders were used on the prototyping activity. conclusion: our study indicates that non-technical stakeholders are able to contribute to the specification of ux requirements and that proto-persona+ is a suitable artifact to promote such activity. the details described by non-technical stakeholders brought new and different contributions when compared to the ones described by the technical stakeholders. from the results of the first round, we concluded that the non-technical stakeholders elicited requirements which impact on accessibility and fun issues. by considering the findings of the second round, we concluded that the ux requirements provided by both stakeholders allowed the developers to build more comprehensive and minimalist user interface prototypes. keywords: non-technical stakeholder, proto-personas, requirement engineering, ux requeriments 1 introduction requirements elicitation is widely discussed in software engineering. the challenges of this important area of software development include issues that range from technical aspects (e.g. use of appropriate tools) to human aspects (e.g. type of stakeholders involved in the process), sharma and pandey (2014); aranda et al. (2016); abelein et al. (2013); hadar et al. (2014). some works have highlighted that the involvement of endusers in the elicitation process can bring important contributions to software construction and that consequently, this affect user satisfaction on the software, berti et al. (2004); maceli and atwood (2011). additionally, the authors stated that the process of requirements elicitation can be enriched not only by the participation of end-users but also by including different stakeholders in this process. non-technical stakeholders are recognized as those who are not a part of the software team, hadar et al. (2014). these stakeholders can be professionals that have close contact with the end-users, for instance. nevertheless, they often have much knowledge about the audience and the domain of the application, aranda et al. (2016). during the elicitation process, a diversity of types of requirements can arise. non-functional software requirements, such as usability and user experience, are linked to qualityrelated requirements; therefore they can impact software acceptance by end-users, de la vara et al. (2011); palomares et al. (2017). nielsen and norman (2013) state a definition of user experience (ux) in a holistic perspective: “user experience encompasses all aspects of the end-users interaction with the company, its services, and its products”. differently, in a more pragmatic definition, garrett (2010) states that for a product provide a good user experience, the software developers have to pay attention to what the product does and how it does it. considering both definitions above, we can affirm that the elicitation of ux requirements involve the gathering of aspects and characteristics of the end-user and the product. these requirements should assist the technical stakeholders (i.e. software experts) in designing and developing software that has good acceptance and brings value to end-users, brown et al. (2011); kashfi et al. (2017). technical stakeholders can be supported by several techniques and methods to the eliciting ux requirements. for this purpose, questionnaires, interviews, as well as techniques and methods from the human-computer interaction (hci) area can be applied, garcia et al. (2017); brown et al. https://orcid.org/0000-0001-5719-4852 mailto:edu.g.pinheiro@gmail.com https://orcid.org/0000-0001-5189-0291 mailto:larii.albano@gmail.com https://orcid.org/0000-0001-6436-3773 mailto:tayana@icomp.ufam.edu.br https://orcid.org/0000-0002-1736-544x mailto:lzaina@ufscar.br on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 (2011). personas are artifacts that have applied to support software teams in both activities, elicitation and use of ux requirements, ferreira et al. (2015). the technique to create the personas follows a process that analyzes end-users data. the persona artifact generated from the technique consists of a fictional character that represents a group of real users of the system and their relevant characteristics within a given software domain, gothelf (2012); grudin (2006); cooper et al. (2014). additionally, personas are useful for establishing an empathy relationship between technical stakeholders and endusers, grudin and pruitt (2002); billestrup et al. (2014). however, the application of personas can be onerous and costly to the team. by the classical definition, a persona is created by analyzing a significant amount of data regarding end-users that requires extensive research and data collection, billestrup et al. (2014). gothelf proposes a new approach to elaborating personas called proto-persona1, gothelf (2012); gothelf and seiden (2013). rather than using the classical technique to creating personas, gothelf’s proposal considers the prior knowledge of stakeholders about end-users and the software domain in question. the technique to constuct proto-persona recognizes that these stakeholders are able to build a sketch of a persona with their assumptions based on their knowledge about a given domain. the technique of constructing proto-personas provides a practical way to gather the knowledge that the stakeholders have about end-users. however, the author recommends that the proto-persona artifact should be validated later by conducting end-user research, gothelf (2012). usually, technical stakeholders work on the development of a diversity of software, which can bring difficulties in obtaining in-depth way the knowledge about different software domains. furthermore, non-technical stakeholders (i.e. who are not a part of the software team) are those who have knowledge about a given domain and can provide relevant information about end-users and the aspects of their interaction with the software. considering the aforementioned discussion, we decided to study the use of proto-personas to elicit ux requirements in the perspective of non-technical stakeholders. our study was focused on investigating how non-technical stakeholders contribute to the requirements elicitation activity. to do this, we selected the proto-persona technique, that produces a lean artifact which can be easily used by this type of stakeholders. the intention of this study is not focus on the comparison of different personas techniques, but to collect evidence about the potential of use proto-persona technique to the purpose of elicitating requirements. to support our study, we extended the proto-persona technique proposed by gothelf (2012) and gothelf and seiden (2013) creating the proto-persona+. developers frequently report that they struggle on how to arrange information of persona, billestrup et al. (2014). considering the difficulties that the participants would have to handle with the proto-persona technique, in our extension we included a new template and 1proto-persona is also known as lean persona, gothelf and seiden (2013). guide questions which support the individuals that will use the proto-persona+. the construction of the proto-personas is supported by the template and the questions that guides the participants to the writing of the personas. taking into account the basis of the gothelf’s proposal, gothelf (2012), the template outlines the important points that individuals should have during the design of the proto-personas. to conduct our study, we defined three research questions (rqs): (rq1) which ux requirements do non-technical stakeholders describe while using the proto-persona technique?; (rq2) how is the acceptance of the use of the proto-persona+ technique by these stakeholders?; and (rq3) which ux requirements presented in the proto-personas+ can support the prototyping of user interfaces?. we conducted two rounds of experimental studies. firstly, we explored the use of the technique to construct the protopersona+ artifact with the participation of 8 non-technical stakeholders and 5 technical stakeholders; and consequently, we could answer the rq1 and rq2. to respond to rq3, we invited 36 software developers to design user interface prototypes by using the proto-personas+ artifacts that were previously created in the first round. the participants worked in pairs and produced 18 user interface prototypes in total. from this second round, we were able to examine the use of the proto-personas+ that was previous developed by the different stakeholders (i.e. technical and non-technical). this paper presents these two rounds in details and discusses the results. in this paper, we extend our previous result presented in the brazilian symposium on software engineering in 2018. in the earlier version, we have discussed the results regarding the rq1 and rq2. in this version, we added a new perspective of analysis (related to rq3) that enriched our findings regarding the contributions of non-technical stakeholders and the potential of proto-persona to support the elicitation of ux requirements. the process of selecting of individuals to participate in the study as non-technical stakeholders was directly related to the domain that our research focused on. our domain was defined as: applications to support e-learning, and consequently, pedagogues2 were the non-technical stakeholders. as our research group have experience in the development of applications in the e-learning domain, we created a network of contacts with pedagogues, (i.e., potential non-technical stakeholders) which was the key factor to our choice. the study allowed us to observe how the stakeholders described ux requirements by applying the proto-persona+ technique and how these artifacts were used to design software solutions. our main contribution is the discussion of the feasibility in introducing the non-technical stakeholder as an active agent in the specification of ux requirements through the use of the proto-persona technique. our study not only examines the construction of the artifacts but also their use in the elaboration of software. the rest of the paper is organized as follows: section 2 presents the fundamentals and related work; the proto2in brazil, pedagogues are professionals who are responsible for the education of children in elementary schools; they obtain their degrees by attending a pedagogy course. on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 persona+ artifact is presented in section 3; the domain selected to the study and scenario we applied in are explained in section 4; the details of the first round of investigation are discussed in section 5 and its results are in follow section 6; the second round of investigation and its results are presented in section 7 and section 8, respectively; in section 9 we return to our research questions to point out the important results and make a comparison with the literature; the main limitations of our study is pointed out in section 10; and finally section 11 presents the conclusion and future work. 2 fundamentals and related work requirements elicitation can be considered a complex task that often requires the participation of different stakeholders. these stakeholders contribute with different knowledge in this process, fernandez and wagner (2015). recently, the identification of ux requirements during requirements elicitation has become a trend, castro et al. (2008); ferreira et al. (2015, 2018a); choma et al. (2016a,b). personas allow the production of artifacts in which ux related issues, such as personal characteristics, needs, and restrictions of end-users, are described, cooper et al. (2014). personas are recognized as important artifacts by both of professionals, academics and practitioners, billestrup et al. (2014). it can support teams during the software development by providing important insights about end-users, ferreira et al. (2018b). another benefit of this technique is to place the user at the center of the development process, which keeps the teams informed about end-users’ requirements. frequently, software teams have personal assumptions about end-users’ characteristics that may differ from the users’ needs in real life, jansen et al. (2017). the team can predict user behavior in a more pragmatic perspective by using persona in their activities. therefore, persona plays the role of developing the empathy of developers toward endusers, cooper et al. (2014); grudin (2006). alves and ali (2018) applied goal-oriented requirements engineering (gore) together with the personas technique to enrich the specification of human factor requirements. gore is focused on fulfilling the demands regarding business goals. the authors stated that by including personas in the process, they could improve the specification of the users’ needs in the software with more assertive and specific details. consequently, they could satisfy the needs of groups of real end-users. gothelf (2012) proposes proto-personas as a technique in which the domain-specific knowledge that specialists have about the audience is used to describe personas. the technique run from a series of brainstorming sessions, osborn (1979), wherein each participant (i.e. specialists) proposes the personas individually. in the next step, these initial proposals are refined by all the participants in the session until they produce a maximum of four personas that represent the target audience. afterwards, the software teams apply these sketches of personas during the software development. these sketches can be validated in future development cycles. the proto-persona technique has the main goal of capturing the knowledge of the experts and uses it in the writing of the proto-persona artifact. this artifact can aid the teams in kicking-off a discussion about the user in the early development phases(e.g. design phase). in the work of anvari et al. (2015), the traditional persona technique was used to hold the emotional characteristics of users. the authors’ intention was to verify whether the developers could see the differences among the characteristics of the personas and whether these differences caused some influence during the software design. results revealed that most participants noticed the variance on the details of those personas and reported that the artifacts helped them in designing the software. ferreira et al. (2015) proposed the pathy, a technique that adapts the empathy map to the construction of personas. the empathy map provides a different form to the building of personas, wherein the focus is on establishing an empathy relationship between end-users and developers. pathy provides a set of questions that drives the software engineer to the artifact elaboration . the technique includes the specification of the user characteristics as well as of other software features. subsequently, ferreira et al. (2018b) investigated the feasibility of combining the pathy technique with user stories to support software development. the results suggested that pathy helps the team in understanding the context of use, identifying potential software requirements, and integrating personas into the design and development process. bhattarai et al. (2016) applied the proto-persona technique in the construction of user profiles. the experience was conducted in several sessions with the participation of different developers. the findings showed that proto-persona supports the teams in aligning their point of view about the software to a set of testable hypotheses about consumers or end-users. kortbeek (2016) presented an experience of using the gothelf technique to build and communicate the hypothesis of a user in industry context. later, in order to verify whether the hypothesis reflected information about the end-users, interviews were conducted with users that have the same characteristics found in the proto-persona. unlike previous works, this paper presents not only the application of proto-persona+ technique but also discusses the contribution that the non-technical stakeholder brings to the specification of ux requirements while using the protopersona technique. to the best of our knowledge, no prior research investigated the contribution of non-technical stakeholders in elaborating personas. however, there are other works regarding the participation of non-technical stakeholders in different requirements engineering tasks, mainly in end-user development or end-user software engineering context. berti et al. (2004) discuss how scenarios and sketches can be used to capture informal input from enduser developer stakeholders. faily (2008) presents a case study where end-user developers obtained practical benefit by adopting professional requirements engineering practices. maceli and atwood (2011) claim that people need to be involved in the software design, not just as workers, but as someone who brings their entire life experience into the design. they identified some principles for participatory codesign, and they described guidelines to help to achieve these principles. on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 3 proto-persona+ we chose the gothelf’s proto-persona technique, gothelf (2012) and gothelf and seiden (2013), to conduct our study. however, in this work we made some improvements to the original proposal of gothelf that resulted in a new version of proto-personas, which we named proto-persona+. protopersona+ extends the original proto-persona by adding a set of guideline questions that aid the stakeholders to produce the proto-personas artifacts. we considered that this adaptation was fundamental to support non-technical stakeholders. the main difference between the traditional persona, cooper et al. (2014), and the proto-persona is in the order that the steps to its construction are performed. the building of traditional persona begins with wide demographic research about end-users. on the contrary, the proto-persona elaboration is not driven by collecting data from direct users but it is constructed based on the knowledge that specialists have of the domain, gothelf and seiden (2013). according to gothelf and seiden (2013), the design of proto-personas starts with assumptions of potential personas and the validation of assumptions are performed afterwards. additionally, the whole team contributes to the process of proto-persona creation by providing their premises about the end-users. as the team members participate actively, this process becomes an effective way to create a shared understanding of the end-users needs and characteristics. as a consequence, the feeling of empathy to the end-users is evolved by the team members. proto-persona produces a lean artifact that is seen as one of the advantages of the technique. the artifact focuses on delivering only the relevant information about end-users, gothelf and seiden (2013). after examining different proto-persona’s templates, we concluded that by joining different parts we could provide a better way to use the artifact. the mix of templates aids the stakeholders to describe ux requirements whereas keeping the concise format of the proto-personas. we considered two templates proposed by gothelf, wherein the information is reported in four quadrants. proposal (a) has two quadrants in which demographic information and characterization of users (e.g. how user looks like, individual’s name, and attributes that defined the users) are described. the other two quadrants refer to attitudes (e.g. life history, routine, habits) and needs (e.g. what motivates them, what they do daily), gothelf (2012). in proposal (b), the first quadrant outlines of the persona’s name and its role in the software, the second describes the basic demographic information, the third informs the needs and frustrations of users about a product, and the fourth reports potential solutions that can fulfill the needs of the users, gothelf and seiden (2013). after analyzing the similarities and differences in both gothelf’s proposals, we rearranged the quadrants to give a new shape to proto-persona+. table 1 presents its objectives and a relationship with the gothelf’s models. different from others’ templates, proto-persona+ provides a set of guideline questions to aid the stakeholder during its elaboration. we decided to add guideline questions because professionals claim that persona is a difficult technique of handling, billestrup et al. (2014). those responsible for the creation of protofigure 1. proto-persona+: template and guideline questions personas+ fill the template by answering the guideline questions. however, it is not mandatory to answer all the questions to use our proposal. finally, the proto-persona+ proposal is flexible by allowing to extend the guideline adding other questions in future research. some questions could be more or less related to the domain in which the study is running. the flexibility for adapting the set of questions can improve the potential of proto-persona+ to catch relevant knowledge of the different types of stakeholders in a particular domain. table 1. proto-persona+: purpose of the quadrants q objective relation to gothelf proposals (q1) provides persona characterization and relevant information about the individuals that impact on the software development. joint of the two demographic quadrants of proposal a and quadrants 1 and 2 of proposal b. (q2) provides details of what users need to reach their objective while using the software. based on the quadrant about the user needs presented in proposal a and in parts of the quadrant 3 of proposal b. (q3) points how users like to accomplish the steps to fulfill their objectives, description that focuses on the content, and interaction types that they prefer. based on the quadrant about attitude from proposal a and in some parts of the quadrant 4 of proposal b. (q4) describes the difficulties faced by the user while interacting with the software and identifies the potential frustrations that could arise during software use. refined from quadrant 3 of proposal b. 4 study context before starting our study, we decided to focus on in a particular domain area. our research group has worked on the software development to support e-learning area. consequently, we have several contacts with non-technical stakeholders in this field. e-learning is the term that defines the use of electronic systems in the context of learning being applied in both situation in-class and distance courses, clark and mayer on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 (2007). in our study, we took m-leaning area which is a subset of e-learning domain. m-learning applications allow the interaction of students and teachers in a learning environment through the use of mobile devices and the internet, dodero et al. (2014). although several companies in the world have demonstrated their interest in the development of applications for educational purposes, the software teams often face difficulties in the mlearning domain, filho and barbosa (2013); chimalakonda and nori (2013). in addition to the common issues that arise during the software development for mobile devices, filho and barbosa (2013), m-learning domain demands for close work with different domain stakeholders (e.g. teachers, government regulations, designers of learning contents) to capture the knowledge they have, dodero et al. (2014). for our study, we used a scenario of an application within the m-learning domain. we chose an application of virtual museum that would aid in the learning of history and arts. it was a part of a project that the research group was developing. the scenario is described as follows: “an interactive museum is adopted by an elementary school to support the learning of students aged 9 to 11 in history and arts. the museum’s collection comprises of several galleries that deliver the artworks in different formats (e.g. games, videos, images, texts). access to the museum will be facilitated through a mobile application that should provide a variety of options for student interaction (e.g. speech recognition, touchscreen, and recognition of gestures) with the aim of being comprehensive to the public.”. 5 first round: using proto-persona+ 5.1 planning the first round of our study had the goal of answering rq1 and rq2. therefore, we analyzed whether non-technical stakeholders could describe ux requirements by using the proto-persona technique (i.e., proto-persona+). additionally, we verified the acceptance of this technique. to do this, we compared the artifacts produced by technical and nontechnical stakeholders, i.e. software engineers and pedagogues, respectively, looking for evidence of ux requirements. our analysis focused on exploring qualitative data by examining the descriptions presented in the proto-persona+ artifacts. quantitative descriptive data were used only to illustrate the acceptance of the artifacts from the perspective of the participants. the first round was conducted in five steps. before the conduction, the participants filled (i) a profile questionnaire. then, we carried out (ii) a training session presenting the key concepts of the study to level the participants’ knowledge before performing the activity. to complement the training, (iii) a hands-on exercise was applied using an m-learning scenario which was different from the scenario of the study. the activity of (iv) elaboration of proto-personas+ was performed. finally, the participants (v) completed the questionnaire on the acceptance of using the proto-persona+. a set of artifacts to support the steps above was prepared. besides demographic information, the profile questionnaire (i) had questions to capture the participants’ prior knowledge about m-learning applications. a consent form, wherein the participants should agree about the use of their data for academic purposes, was also prepared. a set of slides to present concepts about persona and m-learning was designed to be used in the 15-minute training session (ii). for the hands-on exercise (iii), a scenario of a m-learning application was used. from this exercise, the participants could have contact with the proto-persona+ template. after performing these steps, the experiment to construct the protopersona+ artifact was conducted within a period of 40 minutes (iv). upon completion, the participants answered the acceptance questionnaire (v) on the proposal of proto-persona+, indicating their opinions and suggestions. 5.2 execution the experiment was performed in two different days for the groups of technical and non-technical stakeholders. the study followed the steps that were planned and was conducted in the same physical space of a classroom at ufscar sorocaba. all the participants signed the consent form and declared to have experienced e-learning software at least. a total of thirteen stakeholders participated, wherein eight undergraduate students of a pedagogy course who represented the non-technical stakeholders (i.e., pedagogues ped), and five students of computer science courses: four bachelor’s students and one postgraduate, who represented the technical stakeholders (i.e., software engineers eng). participants built the proto-personas+ individually. each participant generated at least one and at most four artifacts. in total, 22 proto-personas+ were designed, being 11 created by pedagogues and 11 by software engineers. the participants did not receive any recommendations or restriction about the number of proto-personas they should produce. participants were encouraged to construct as many proto-personas+ they considered appropriate to provide the characterization of the end-users in the virtual museum scenario. 5.3 analysis a qualitative analysis was performed in two stages on the 22 artifacts produced by participants. first, the proto-personas+ were evaluated to identify if they reported ux requirements. then, we conducted an analysis on the results of the first step to find out the focus of these requirements. as ux has several definitions in the literature, the researchers could have different interpretations regarding what was a ux description. to avoid the different interpretations, the authors of this article decided to create an instrument to guide the data analysis. the instrument was based on a compilation of a set of ux dimensions. the works of winckler et al. (2013) and ardito et al. (2006) gave us the ground to elect and compile the ux dimensions. we selected these works as the basis of our dimensions because they discuss ux in the two areas our study focused on, mobile with the work of winckler et al. (2013) and e-learning applications with the work of ardito et al. (2006). to define the dimensions, we examined the similarities between the dimensions described in the two works and those on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 that attended the particularities of our study domain. the dimensions of stimulus and value were selected from winckler et al. (2013). the work of ardito et al. (2006) presents a set of heuristics for evaluating e-learning applications and a methodology for using such heuristics. from this work, four dimensions were chosen: access, media, organization, and interaction. as a result, six dimensions were outlined and considered in our analysis. these dimensions focused on the type of ux requirements that we had to search for in the proto-personas. to keep the researchers’ attention to the same ux definitions, we wrote in the guide the meaning of each dimension in details. access dimension covers the aspects of technology and its quality for use; media specifies what media support the communication considering the e-learning context; organization focuses on how learning contents and navigation are arranged; stimulus examines the motivations that lead the participants to engage in the interaction, and encompasses impressions and opportunities for use; value explores what the use of that product brings to the students’ learning; and interaction focuses on the results that each type of interaction deliver to the student. considering the dimensions, first, four researchers searched for evidence of ux requirements on each protopersona+. this first step was carried out for three master’s students in software engineering (se) and human-computer interaction (hci) and an undergraduate student in computer science with experience in hci. after examining an artifact, the researcher had to assign labels on it. the labels indicated in which degree the ux dimensions were being fulfilled or not considering the description found on the artifact. these degrees were classified into three levels (fulfilled completely, fulfilled widely and fulfilled partially). besides, the researchers took notes to justify their rationale to have assigned one or another classification for each dimension. in case of the researchers did not assign any degree they did not make notes. the researchers examined the artifact from a whole perspective because the information of one quadrant of proto-persona+ was complementary to the other. each researcher analyzed 11 artifacts: 2 researchers evaluated 5 proto-personas+ of pedagogues and 6 of software engineers; and 2 others evaluated 6 artifacts of pedagogues and 5 of engineers. in the second round, two senior researchers in se and hci revisited the data and refined the results. taking into account the results of the first step, the first author of this article performed a new qualitative analysis. for this, the open coding technique was used, strauss and corbin (1998). open coding relates codes to chunks of text. these codes receive denominations that give certain significance to the chunks of texts they refer to, strauss and corbin (1998). subsequently, these codifications were revisited and they are grouped when patterns of information were identified. for instance, the code interface could be assigned to chunks of texts that report information on user interface. during the coding process, codes were assigned to parts of the notes written by the researchers. then, this set of codes was re-analyzed to search for patterns of information. the results of these two steps were verified by two senior researchers in the areas of se and hci. the coding was performed using the nvivo 113 tool and a total of 26 codes related to the ux dimensions were generated in this process. 5.4 threats to validity internal threat could be refereed to the tiredness of the participants. this could happen due to participants spend a long time concentrating on the activity of the experiment. to mitigate this, we scheduled a break between the hands-on training and the activity of proto-persona+ elaboration. external threat refereed to the use of students as participants. however, salman et al. (2015) provide evidence that there are few differences in the performance of students and practitioners when they performed an activity they have not previously knowledge. even with greater practical expertise, the fact that professionals do not know a new technique such as protopersona+ allows us to compare them to students. salman et al. results allow us to state that our findings getting from students using proto-persona+ can be extended to more experienced professionals who have never used proto-persona technique. the threat of construct was mitigated by the training and hands-on exercise when the participants had the opportunity to request clarification about the technique and the template. consequently, we considered our sample of artifacts have good quality. additionally, all participants were prior users of e-learning applications. we handled the threat of conclusion by using a common definition of ux based on dimensions. all the researchers inspected the artifacts using this guide avoiding different interpretations about the ux meaning. a bias on the conclusion could be introduced in the study by the fact that there were no limits on how many personas each participant could create. as a consequence, a participant could produce more personas than others, and therefore, s/he could become more representative within his/her group. however, our goal was not to verify how much information each participant offered individually. rather, our focus was to see the contributions that arose from the different types of stakeholders. besides, this study analyzes two groups with the same number of artifacts in both, which mitigates the problem of comparing unbalanced groups. anyway, we considered this is an issue that other researchers should be aware of if they decide to run a similar study. 6 findings of the first round the profile questionnaire showed that out of the 13 participants, 84.5% used mobile devices 5 or more days in a week, 61.5% preferred to access the internet through their mobile phones, and 77% had participated in an online course in the last two years. the findings of the first round aided us to answer the rq1 and rq2. we will present the results in the follow sections. 3http://www.qsrinternational.com/nvivo on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 6.1 ux requirements to respond the (rq1) which ux requirements do nontechnical stakeholders describe while using the protopersona technique?, we observed the codes generated from the open coding process. figure 2 presents the codes associated with the artifacts of each type of stakeholder. our analysis did not have the purpose of quantifying the occurrence of a code. rather, the qualitative analysis concentrated on exploring the evidence of ux issues that arose from the data. in this analysis, we observed the convergence or divergence of the codes assigned to the artifacts of the different stakeholders. we see the codes that are common and different considering the artifacts built by pedagogues and software engineers. we will concentrate our discussion on the codes in bold that represents more relevant findings. both types of stakeholders described characteristics of enjoyment, stimulus, and satisfaction to highlight the importance of building an enjoyable experience that holds the student attention during the learning process. however, by observing the codes, we can see that this objective was expressed in different ways. the pedagogue described a learning process that should be fun (see the code in bold), thereby showing the intention of organizing lessons from this perspective. on the other hand, the software engineers pointed out requirements regarding the focus on use and easiness of use with the intention of avoiding users frustration. these examples are shown in figure 3. the examples highlight the parts where we see how each type of stakeholder describes a way of maintaining students’ interest in learning. the following examples show the different contributions provided by the stakeholders. in addition to focusing on distinct user information and user characteristics, each type of stakeholder provided specific user profile details. two non-technical stakeholders specified requirements for visual impairment or attention deficit that can attend users with special needs. two technical stakeholders delineated the characteristics of users who like to learn by participating in interactive spaces where they can interact with their colleagues. these examples can be seen in the two artifacts in figure 4. figure 5 shows the codes that were found in common or not from the proto-personas+ per dimension and per type of stakeholders. from the access dimension, we see that pedagogues’ artifacts had codes related to accessibility that refers to the availability of hardware that would meet the special needs of each user, as well as the forms of interaction that could attend this audience. only the artifact of the technical stakeholders provided different codes in the media dimension. while reporting the use of different types of media (e.g. video), the software engineer showed concern about media that could provide interactions; therefore the interaction mode code was assigned to the artifact of this stakeholder. an example of this is interaction with text on a small screen of mobile device that can introduce barriers for users to perform their actions. in the organization dimension, the pedagogues focused on how structuring the learning path for a given student profile. the codes assigned related tto this dimension were application complexity, focus on learning process, user restrictions, and student objective. additionally, from the proto-personas+ created by the software engineers, we could see the focus on building applications that could motivate the students to the interaction by providing different medias. both types of stakeholders were concerned about stimulating the students by offering an enjoyable application (see figure 3). it can be seen from the stimulus dimension that had the codes satisfaction and fun related to the proto-personas+ which were created by the pedagogues. on the contrary, the software engineers considered that the care in aspects that bring frustration would encourage the student to continue using the application. in the value dimension, the codes media, device, and user restrictions demonstrated the matter of enriching the user experience during the learning process. finally, in the interaction dimension, we could identify the contributions that the pedagogues did from observing their artifacts. the accessibility and application complexity showed that these stakeholders concentrated their attention on delivering a more personalized interaction in accordance with users’ profile. consequently, these issues can bring stimulus and value to the ux. 6.2 acceptance of proto-persona+ to answer (rq2) how is the acceptance of the use of the proto-persona+ technique by these stakeholders?, three different analyses were performed: (i) the importance that the stakeholders perceived on the template’s quadrants to perform the activity; (ii) the usage and relevance the stakeholders saw in the guideline questions to complete the quadrants; and (iii) the perception of usefulness and ease-ofuse regarding the proto-persona+. the participants answered the questions after finishing the elaboration of the protopersonas+. given the small size of our sample, we analyzed the data from a descriptive perspective. the results will be presented in detail in the following subsections. 6.2.1 importance of quadrants we explored the importance of the quadrants (figure 1) in relation to the description of the proto-persona+ in the perspective of the participants. for this, the participants should classify each quadrant in one of the following categories: very important (vi), important (imp), unimportant (ui) or irrelevant (irr). table 2 presents these classifications in two complementary representations: the sum of classifications for each quadrant in brackets and the percentage of the participants that chose that classification. in table 2, it can be seen that all the quadrants were almost solely classified as very important or important. the quadrant (q2): objectives and necessities was considered very important for all the stakeholders. although all the quadrants seemed to have similar importance to the stakeholders, an exception was observed for quadrant (q1) demographic data: only one software engineer (i.e. technical stakeholders) pointed out the q1 as unimportant. comparing the classifications for the q1, it could be seen that the software engineers mostly pointed this quadrant as on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 figure 2. codes assigned to the artifacts of each type of stakeholders figure 3. two ways of working on student engagement: examples of technical and non-technical stakeholders figure 4. definitions of different end-user profiles: examples of technical and non-technical stakeholders important, while the pedagogues (i.e., non-technical stakeholders) indicated it as very important. the personas technique has the focus on developing the empathy between users and developers; therefore, we can conclude that the nontechnical stakeholders can contribute to characterizing the end-users. on the contrary, technical stakeholders were not concerned with these aspects. table 2. degree of importance of the quadrants q1 q2 q3 q4 vi 20% (1) 100% (5) 60% (3) 60% (3) eng (5) imp 60% (3) 0% (0) 40% (2) 40% (2) ui 20% (1) 0% (0) 0% (0) 0% (0) irr 0% (0) 0% (0) 0% (0) 0% (0) vi 75% (6) 100% (8) 62% (5) 62% (5) ped (8) imp 25% (2) 0% (0) 38% (3) 38% (3) ui 0% (0) 0% (0) 0% (0) 0% (0) irr 0% (0) 0% (0) 0% (0) 0% (0) vi 53.8% (7) 100% (13) 61.5% (8) 61.5% (8) total (13) imp 38.5% (5) 0% (0) 38.5% (5) 38.5% (5) ui 7.7% (1) 0% (0) 0% (0) 0% (0) irr 0% (0) 0% (0) 0% (0) 0% (0) 6.2.2 usage and relevance of the guideline questions we examined the participants’ answers about the perception of the stakeholders regarding the relevance and the use of the guideline questions. an open question asked the participants for suggestions to improve the proto-persona+ template. table 3 presents the results in percentage and in absolute numbers of the “yes” answers. this double representation of the results provides a more real overview considering that we had a small sample of technical stakeholders. therefore, the percentages might not clearly indicate the differences and similarities between the two types of stakeholders. in table 3, we can see that the software engineers used and considered relevant the question q3 what are they better at doing?. on the contrary, most the pedagogues did not show the same results. an inversion was observed from the question q3 how do they like to do it?, which was not used by the software engineers but had great application to the pedagogues. finally, the question q4 what frustrates them? presented a considerable difference in the responses; while all the software engineers used and found it relevant, the pedagogues used it very little, although they found it to be a relevant question. these differences from the perceptions of both types of stakeholders restate that both stakeholders have the potential to give different contributions. by exploring the stakeholders’ written notes for the q3, it can be observed that the questions motivate different points of view. the pedagogues focused on encouraging students to overcome their barriers. they reported the need of performing activities that helped students in developing new skills and not only on improving something in which they were considered good. on the contrary, the software engineer showed emphasis on what the student already knows in a tentative on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 figure 5. codes per ux dimensions and per type of stakeholders table 3. usage and perception of relevance of the guideline questions uses relevance guideline questions eng (5) ped (8) eng (5) ped (8) who are they? 100% (5) 100% (8) 80% (4) 100% (8) q1 what are their ages? 100% (5) 100% (8) 100% (5) 100% (8) what are their school levels? 100% (5) 100% (8) 100% (5) 100% (8) q2 what do they want to accomplish? 100% (5) 100% (8) 100% (5) 100% (8) what do they need to reach their objective? 80% (4) 88% (7) 100% (5) 88% (7) what do they like? 100% (5) 75% (6) 100% (5) 100% (8) q3 what are they better at doing? 80% (4) 38% (3) 80% (4) 50% (4) how do they like to do it? 40% (2) 75% (6) 80% (4) 100% (8) what are the difficulties they can face? 80% (4) 100% (8) 100% (5) 100% (8) q4 what frustrates them? 100% (5) 63% (5) 100% (5) 88% (7) what are the known issues that affect their interaction? 80% (5) 100% (8) 80% (4) 100% (8) of stimulating such student behavior during the use of the application. among 13 participants, only one pedagogue and one software engineer gave suggestions through the open question; both were for the q1 demographic data. one pedagogue suggested the addition of the question: “do users have any deficiencies or restrictions?” that focuses on the individual characterization of users. on the contrary, one software engineer suggested a more technological question: (“do users have access to mobile devices?”). table 4. preferences for each guideline question guideline questions uses relevance who are they? ped q1 what are their ages? what are their school levels? q2 what do they want to accomplish? what do they need to reach their objective? ped eng what do they like? eng q3 what are they better at doing? eng eng how do they like to do it? ped ped what are the difficulties they can face? ped q4 what frustrates them? eng eng what are the known issues that affect their interaction? ped ped we examined the number of times that each question was answered by the participants and identified the questions that were more important for the different stakeholders. the relevance column in table 4 indicates which stakeholder presented more answers for each question. software engineers demonstrated greater interest in the use of the q3 questions. on the contrary, the q4 was the most used for the pedagogues. the quadrants q1 and q2 were answered in a similar manner by the two types of stakeholders. 6.2.3 perception of usefulness and ease-of-use this analysis was based on the responses of the technology acceptance model (tam) questionnaire, conceived by davis (1989), that aims to analyze the acceptance of certain technology by a group of participants, dias et al. (2011). we included a question regarding the ease of memorizing the technique that was based on the work of steinmacher et al. (2015). table 5 lists the questions. for each question, the participants chose the option that best represented their degree of agreement. the options available were “fully agree”, “largely agree”, “partially agree”, “partially agree”, “largely disagree”, and “fully disagree”. by observing the percentages of both types of stakeon the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 figure 6. perception of usefulness and ease-of-use table 5. questions used from the tam questionnaire dimension question usefulness u1 by using the persona technique, i was able to describe the user characteristics more quickly. u2 by using the persona technique, i was able to enhance my ability to describe the user characteristics. u3 by using the persona technique, i was able to enhance my efficiency during user characteristics description. u4 by using the persona technique, i was able to more effectively describe the user characteristics. u5 by using the persona technique i was able to improve my perception about the good practices for describing user characteristics. u6 i consider the persona technique useful in describing the user characteristics. ease-of-use f1 it was easy to learn to use the persona technique. f2 i was able to use the technique in the way i intended to. f3 the orientations of use for the persona technique were easy to understand. f4 i understand what happened during my interaction with the persona technique. f5 it was easy to gain ability to use the persona technique. f6 the persona technique allows flexibility to describe the user profile using the quadrants. f7 it is easy for me to remember how to use the persona technique. holders, it can be seen that a great number of questions was answer as “largely agree”. few exceptions could be found. the difference in agreement perceptions in question f5 about “the easiness to gain ability to use the technique” was high. the group of software engineers answered 60% (3 of 5) with “partially agree” and had 20% (1 of 5) that “partially disagree” on the questioning that the technique was easy to gain ability. revisiting the notes in the proto-persona+ artifact, we found out that the software engineers struggled in describing the proto-persona+, which can explain the low “easy to gain ability” perception on the above question. on the contrary, the pedagogues showed a lower perception that the technique improved their efficiency to describing the audience. the majority of the pedagogues answers was “partially agree” in the question u5 (50%, 4 of 8). however, 60% (3 of 5) of the software engineers indicated that they “largely agree” that the proto-persona+ improved their efficiency to describing the users (question u3). overall, only the software engineers pointed out some degree of disagreement (“partially disagree”). moreover, “fully agree” prevailed in the pedagogues’ responses, which can reiterate the fact that they had the perception that the technique was useful to describe end-users. 7 second round: using the protopersonas+ in design 7.1 planning after exploring the creation of the proto-persona+ artifacts we decided to investigate whether these artifacts could support developers during the prototyping of solutions. the results of this investigation aided to answer the (rq3) which ux requirements presented in the proto-personas+ can support the prototyping of user interfaces?. the objective of this second round was to analyse whether the information from proto-persona+ artifacts contributed to the design of the low fidelity prototypes. the participants constructed low fidelity prototypes by using storyboards technique. storyboard is a technique in which the people’s interaction with an application is shown. often it delivery a complementary view of the static drawings of user interfaces. storyboard simulates the flow that users can follow from one part of the interface to another, rogers et al. (2015). in this round, our subjects were software developers. in our study, the storyboard artifacts were drawn on paper. the participants could enrich their proposal by adding stickers around of interface elements. these stickers contained supplementary textual information such as actions associated with buttons, navigation flow between the screens, and so on. additionally, in the stickers, the developers also reported which part of the proto-personas+ and the scenario have aided them to make their choices regarding the deon the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 sign. through these textual justifications, we could analyze what information they used and from which proto-persona+ it originated. this second round was conducted in four steps. first, we performed a (i) pre-analysis of the participants knowledge about hci techniques that were used in the activity; (ii) a training session on hci techniques and mobile development; (iii) a hands-on exercise to prototyping a user interface by using a scenario; and (iv) the construction of the storyboards considering the proto-personas+. artifacts to support the steps were prepared. the prequestionnaire of participants’ profiles (i) contained as much personal information as information about their knowledge about personas and prototyping techniques, and about the nielsen heuristics, nielsen (1995). the training session (ii) consisted of a two-hour class that presented techniques that were required for the development of the storyboards. for the hands-on exercise (iii) a two-hour activity was planned, wherein some proto-personas+ and an example of scenario were made available, from which the participants experienced the same artifact that they would use in the study. for the step of the construction of the storyboards, (iv) a consent form was distributed to the participants to indicate their agreement on the use of their data for the purpose of academic research; here, the same scenario used in the first round was applied. 7.2 execution thirty-six undergraduate students in computer science at ufscar participated in the study, known as developers henceforth. they answered the pre-questionnaire, and in the preanalysis we were able to fathom their knowledge about hci techniques (see figure 7). we noticed that 78% of developers “did not know the technique of persona”; 67% “did not know the nielsen heuristics”; and about prototyping technique: 22% “did not know” and 47% “knew, but had never used”. from the questionnaire results, we separated the developers into 18 pairs. we balanced the pair composition based on their knowledge on the techniques. as noticed, the participants did not have practical knowledge about the techniques we planned to use (i.e., personas and storyboards). to mitigate this, we conducted the training session in two steps to leverage the participants’ knowledge. first, a senior professor in se and hci carried out twohour class covering topics about personas, storyboard, and how nielsen heuristics could help them to the application of design good practices. later, on the same day, two master’s students conducted a two-hour hands-on in which the participants built a storyboard based on a new scenario and examples of proto-personas+ (the artifacts were different from those used later). a week later, the study was conducted in a 3-hour session, wherein 18 pairs of developers constructed storyboards by using the scenario of the study (i.e., the same that was used to construct the proto-personas). we also requested the pairs to select only two proto-personas+ to support their work. this decision to limit the choice into two artifacts was taken so that the participants did not have to deal with a large diversity of user profiles. first, the pairs received 22 proto-personas+ figure 7. participants’ knowledge about hci techniques in a random order to prevent that the same artifact was always placed in the same position of the order of presentation. to avoid the participants selected always the same artifacts, we shuffled the proto-personas+ before presenting them to the participants. the participants received a set of these artifacts arranged from different orders. by doing this, we avoided that the order of presentation could cause biases on the selection of the proto-persona+. the pairs built the storyboards and also fixed the post-it stickers to report their decisions on the design. the participants were instructed to explain through the stickers, which parts of the scenario and proto-personas+ they used to gain the insights into the design. each pair generated five user interfaces (three to nine user interfaces) on average. 7.3 analysis we performed the analysis in two phases. the first one examined which proto-personas+ were selected and applied to the construction of the storyboards by the developers. in the second step, through a more in-depth analysis, we explored which parts of the proto-persona+ were used. from this second analysis, we intended to understand how the information found in the proto-personas+ aided the developers’ work on building the solutions. we first identified the most chosen proto-personas. further, by considering the developers’ notes about the use of the proto-persona, we could identify which parts of the protopersonas+ were used most. the first phase followed the same procedure as in the first round, wherein ux dimensions were applied (see the definitions in section 5.3). different from the first round, in this phase, the storyboards were the targets of the evaluation. twelve software engineers with different profiles attended this session: two undergraduate students in computer science from ufscar (campus sorocaba), five master’s students, of which four were from the graduate program at ufscar (campus sorocaba) and one from the graduate program at unicamp, two graduates working for more than on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 three years in software companies; and three masters in computer science. all had experience in hci and background in computer science. none of these evaluators participated in the previous evaluation of the proto-persona+ (i.e., the first round described in this work). the storyboards were distributed among the evaluators. each storyboard produced by a developer had five low fidelity prototypes on average; therefore, the division of what each evaluator would explore considered the following factors: (i) we made a uniform distribution so that each participant received at the same number of prototypes to evaluate, and (ii) each storyboard was evaluated by two participants; this redundancy in the evaluation intended to enrich the analyzes. however, the same pair did not analyze the same set of storyboards. considering the ux dimensions, each evaluator examined fifteen low fidelity prototypes. as a result, the evaluator took notes to justify whether a given ux dimension was being applied or not in that prototype. none of the participants had seen the proto-personas+ used to create the prototypes they were evaluating. after, we proceeded to the second step, wherein the open coding process happened in two iterations. the first author of this article inspected the notes that the evaluators took in the first step based on 24 codes that were previously generated. later, the fourth author refined the findings and 23 new codes were generated at the end of this round. this generated a total of 49 codes in the two experimental rounds combined. 7.4 threats to validity to deal with the bias on the preference of proto-persona+ selected by the developers, which is an internal threat, we presented the 22 artifacts in a random order to the participants. in our arrangement, the proto-persona+ did not appear more than twice in the same ordinal position of the list. with the order changed for each group, the threat of a possible false preference was mitigated and the results became more reliable for the inferences and support. another threat to the internal validity refers to the motivation of the participants during the experiment because the workshop was applied during a compulsory course in computer science. we collected the participants’ opinion about the activity at the end of the study. the participants’ feedback showed that they considered the activity important, e.g., “i found the activity very interesting”; and opinions like the “[proto-persona] was useful to the achiement of my goal...”. the feedback showed that the participants felt motivated to participate in the study. a threat to the external validity was the fact that the storyboards were constructed by participants that had no prior contact with proto-persona+ and storyboards. to mitigate this threat, we conducted a training about the proto-persona+ and storyboard techniques and a hands-on exercise using them. on this validity, we arranged the developers in homogeneous pairs that had complementary knowledge. similar to the first round, the subjects here were also students. salman et al. (2015) in their work provide evidence that students and experienced professionals have equal performances in new activities. although storyboard and prototyping are largely applied techniques, in our case, we changed the traditional application of both. by using a scenario and the proto-personas, we provided a method to mitigate the lack of experience of the developers because it is different from the usual prototyping. 8 findings of the second round using the results of the second round, we answered (rq3): which ux requirements presented in the proto-personas+ can support the prototyping of user interfaces. the details are presented in the following two subsections. 8.1 developers’ preferences firstly, we identified the proto-personas+ that the developers chose and used on considering that they should select only 2 from the 22 proto-personas+ that are available. we organized this result in three groups of proto-personas. group (i) presents the proto-personas+ that were widely used, being the most chosen; group (ii) comprises the proto-personas+ that were chosen by the groups in an amount equal to the average relative to the distribution of the choices; and group (iii) comprises the proto-personas+ that were chosen at least once by the pairs (i.e. developers). table 6 summarizes these groups and indicates the id of the proto-persona, which stakeholder was the creator of the proto-persona, and some features of the artifacts. one of the goals of proto-personas+ is to promote the empathy between developers and users; therefore, the use of an image to represent the persona could be important. we also obtained direct and indirect findings about the use of the artifacts. the direct analysis comprises the absolute number of references that each proto-persona+ received by the developers. on the contrary, the indirect one comprises the results from the perspective of the authors analysis regarding the preference between the two proto-personas+ chosen by the pair. considering the two artifacts, it was analyzed which of the two proto-personas+ was most emphasized during the construction of the storyboard while counting the number of references to the parts of each artifact. the indirect analysis resulted in two cases: (1) of equal interest, wherein both the artifacts obtained the same amount of references and (2) different interests among the proto-personas+ (classifying artifacts into primary or secondary personas). the classification in primary and secondary personas happens when there are more than one user profiles that will use the application; however, one of them should be considered with higher priority owing to being the primary user of the application, cooper et al. (2014). a primary persona is defined as a profile that represents the user’s focus of the application; therefore, it will have its prioritized needs met. secondary persona refers to a user profile that will use the application; however, to fulfill its needs is not a priority for the application. based on these definitions, a classification of the proto-persona+ that fits in case (2) was performed. the protopersona+ with the highest number of parts referenced was classified as primary, whereas the lower one was classified as secondary. in table 6 were found some relevant results. all the protopersonas+ that had an image in (q1) (i.e. demographic data) on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 table 6. proto-personas selected to the construction of the storyboards direct indirect id protoprimary secundary persona+ stakeholder image references persona persona equal i 9 pedagogue x 10 5 2 3 22 engineer x 8 4 0 4 8 pedagogue x 3 1 1 1 ii 21 engineer x 3 0 1 2 14 engineer 3 0 3 0 2 pedagogue 2 0 1 1 20 engineer 2 1 1 0 18 engineer 1 1 0 0 iii 11 pedagogue x 1 0 0 1 1 pedagogue 1 0 1 0 7 pedagogue x 1 0 1 0 19 engineer 1 0 1 0 were chosen at least once by the pairs of developers. additionally, from the five most chosen artifacts (i.e., groups i and ii), four had an image associated to the proto-persona. this fact reinforces the idea that persona is a technique that stimulates empathy in the developer. the use of image to represent the target audience is a method to instigate the developers to think and develop the association of their ideas with those of the user represented in the persona, grudin (2006). it can be seen that two artifacts, i.e. id 9 and 22 obtained the highest numbers of references (group i) in both indirect and direct analysis. they were designed by the pedagogue and the software engineer, respectively. during direct analysis, we observed that proto-persona+ 9 was chosen by 10 of the 18 developers, whereas the 22 was chosen by 8 of the 18. considering that only 10 of the 22 proto-personas+ got indications of at most 3 groups and that 10 others were not chosen by any group, we can see a clear preference for artifacts 9 and 22 to support the construction of the storyboards. additionally, the proto-personas+ 9 and 22 were classified as primary personas in most cases. examining the data, we could observe that proto-persona+ 9 was used 5 times as primary, 22 was used 4 times as primary, and all other artifacts obtained only one emphasis as a primary persona. these restate our results found out in the direct analysis. to explain the preference for proto-personas+ 9 and 22, the 4 authors of this article conducted a qualitative analysis on the content of the quadrants of these proto-personas. the results demonstrated that both artifacts had a more clear definition on the users they represented. they provided information in rich details of who the end-user is, being these details evident in quadrant 2 (objectives and needs). fisher’s exact test fisher (1922) was taken to analyze the existence of a statistical significance between the protopersonas+ produced by the pedagogues and software engineers. by running the same testing, we also checked the influence that an image have on the choice of an artifact. fisher’s exact test is recommended either for small samples of categorical data and for calculating the exact significance of the deviation from a null-hypothesis using the p-value. the statistical analysis was conducted with certain scenarios and proto-personas+ groups, with their respective null (h0) and alternative (h1) hypotheses. to conduct the testings, we defined null and alternative hypotheses considering that the characteristic c1 could influence the results c2 (see table 7). taking into account these assumptions, we could represent the null and the alternative hypotheses respectively as (h0) there is no influence of on the and (h1) there is an influence of on the . table 7. fisher exact tests results c1 c2 p-value stakeholder that create the proto-persona+ classification of the artifact as a primary persona 1 stakeholder that create the proto-persona+ classification of the artifact as a secondary persona 1 stakeholder that create the proto-persona+ classification of the artifact as “equal interest” 1 stakeholder that create the proto-persona+ number of references of the artifact in the prototypes be equal or greater than 3 1 the presence of a representative image number of references of the artifact in the prototypes be equal or greater than 3 0.2424242 we run the testings using r software environment4. it was assumed a p-value with significance 0.05 in the analysis. in table 7, the final p-value got after performing the fisher exact test. the p-value of the analysis do not indicate any statistical significance to refute the null hypothesis in any one of the analyzed pairs of elements. statistically, the protopersona+ creator (i.e. pedagogue or engineer) could not be related to how the proto-persona+ was used. similarly, we could see that the fact of a proto-persona+ presenting an image did not affect the number of times that artifact was referred in prototypes. finally, we explored which proto-personas+ were chosen in the perspective of who created them. table 8 presents a mapping of the storyboard and the type of stakeholder who was the author of the artifact used in the construction of the solution. only four groups used the proto-personas+ created only by the pedagogues. the same could be seen for the application of those built by the software engineers. from this, it was confirmed that the developers mostly opted to build their solutions considering the proto-personas+ of the two specialties. we need to restate that the set of artifacts was delivered in a random order and without indication of which of the two stakeholders elaborated them. the results showed that 4https://www.r-project.org/ on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 table 8. type of proto-persona selected vs storyboards proto-persona+ used storyboard id number of times only pedagogue s1, s9, s12, s16 4 only software engineer s4, s10, s11, s13 4 mix of both s2, s3, s5, s6, s7, s8, s14, s15, s17, s18 10 the combination of the artifacts from different stakeholders aided the developers in most cases. 8.2 application of ux requirements the codes that emerged in the analysis of the storyboards were related to the codes found in the analysis of the protopersonas’ descriptions (see sub-section 6). to support our presentation of the results, we discuss the codes of the storyboards comparatively with the codes used in the first round of the analysis and presented in figure 5. to illustrate our discussion, we will show figures in which the codes was split into three groups. group a represents the five most recurrent codes for that dimension. group c represents the codes that appear only once related to that dimension. and finally, group b represents the codes that arose more than once associated with a dimension but not in an amount that justified to be one of the top five codes (group a). figure 8. codes of access dimension found in the storyboards in figure 8, it can be seen that the questions regarding the physical devices and infrastructure to access were the main focus of the participants. this was noted by the recurring codes of hardware (a), internet (a), and the characteristics of device (a). while comparing the codes found out in this analysis with the ones uncovered in the proto-personas+ analysis, we got the interaction mode (a) code as one of the most presented for this dimension. this code was identified in the proto-personas+ produced by the software engineers and appeared in several ux dimensions in the previous analysis (figure 5). this result demonstrates the concern with these forms of interaction were presented in the prototypes of the storyboards to meet users’ needs. it is also seen that the codes universal accessibility (c) and social interaction (c) that refer to the two profiles built by the pedagogues and the software engineers, respectively, as shown in figure 4. this finding illustrates how the knowledge of different stakeholders contributed in enriching the description of end-user details. in the media dimension (see figure 9), the image (a) and game (a) media of interaction were the major codes mentioned. the code of interaction mode has been found several figure 9. codes of media dimensions found in the storyboards times in the analysis of proto-personas+ created by the software engineers. in this context, the focus reiterates the results found in the first round that took the concerning on which media could affect the users’ learning process and consequently their user experience. considering the common points between the pedagogue and the software engineer stakeholders (see figure 5), we noticed that they concentrated on focus on learning process (b), student preferences (b), and student objective (b). finally, the concerning on a misleading (b) of how a media works or what it stands for has also emerged as a code, thereby demonstrating how app overall organization problems (a) and frustration (c) can affect student learning process. figure 10. codes of organization dimension found in the storyboards simplicity (a), easiness of use (a), navigation (a), app overall organization (a), and confusion (a) were the codes that arose in the organization dimension (see figure 10). these codes indicate that applications in this domain on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 should not introduce complex ways of interaction providing a simple manner of use. to promote the stimulus (b) to the users engagement into the software and a pleasant interaction should be the goals of the applications. these codes indicate that applications in this domain should provide a simple and not complex manner of use. the applications should have the stimulus (b) and a pleasant (c) experience as their goals when the user learns and uses this application. by observing the previous results (see figure 5), we noticed that the artifacts developed by the pedagogues had the codes: user restrictions (b), application complexity (b), and focus on learning process (a) assigned to them; this demonstrates the importance of providing a learning application in which users can have an easy journey. figure 11. codes of stimulus dimension found in the storyboards in the stimulus dimension (see figure 11), the focus on media (a), stimulus (a), and focus on learning process (a) was presented. by looking at figure 5, it can be seen that both pedagogues and software engineers focused on the same points during the construction of the proto-persona+. regarding the fun and satisfaction codes that were assigned to the proto-personas+ of pedagogues, we noticed that the low fidelity prototypes had similar codes associated to them (i.e., fun (b), curiosity (b), and enjoyment (b)); this is an evidence that the developers have tried to keep an exciting experience for the students. considering the codes related to the proto-personas+ of the software engineers, frustration (b) and student objective (b) were the codes identified in the prototypes which demonstrated the concern that these stakeholders had on encouraging students to use the application. lastly, focusing on the app overall organization (a), the prototypes provided a method in which students can customize their learning process and consequently improve their experience. a prevailing occurrence of the codes: media (a), focus on learning process (a), and game (a) could be found in the value dimension (see figure 12). these three code have already been found out from both types of the stakeholders (i.e., pedagogues and software engineers) in the results of the proto-personas’ analysis (see figure 5). while explorfigure 12. codes of value dimension found in the storyboards ing the proto-personas+ of both stakeholders we saw their concerns on the learning process, user experience, and the use of suitable channels of interaction. codes such as stimulus (b), user experience (b), satisfaction (b), fun (b) and pleasant (c) demonstrate that developers who constructed the storyboards were able to catch such claim. considering the code app overall organization (a), we noticed that only the proto-personas+ of software engineers had such code assigned. this code provides evidence that this stakeholder worried on the integration of different resources and features of the application. figure 13. codes of interaction dimension found in the storyboards finally, in the interaction dimension the focus on learning process was the main common code. accessibility (b) and social interaction (c) are codes that were pointed out respectively from pedagogues and software engineers protopersonas+ and that appeared again in the analysis of the storyboards. these codes allowed to reaffirm the different contributions that both types of stakeholders provide to the design of solutions. by observing the differences between the two types of stakeholders we see that interaction mode (a) and stimulus (b) were found in the proto-personas+ of software engineers and pedagogues respectively. these codes clearly demonstrated that software engineers concerned more on technical aspects of interaction whereas the pedagogues were worried on keeping students motivated to the learning. on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 9 discussion this study investigated the effect of the non-technical stakeholders’ participation on ux requirement specification. it is different from other works, wherein the non-technical stakeholders provide information in a passive participation. in our investigation, we considered these stakeholders as active members during the elicitation of ux requirements by using proto-personas+. the findings showed that non-technical stakeholders brought important contributions to the ux requirements elaboration. they could point out the requirements that report ux in different perspective from that provided by the technical stakeholder. ux requirements are strongly context-dependent, and given that this context suffers constant changes over time, kashfi et al. (2017). non-technical stakeholders are the ones that have the knowledge about the context. from the findings, it could be noticed that although the technical stakeholders had experience in the domain, the non-technical ones demonstrated to check the aspects that can directly influence the acceptance of the software, hadar et al. (2014). by looking at the steps that the technical and non-technical stakeholders followed, we can summarize the findings below. the main points in the first round were the preparation of the stakeholders to be able to apply the proto-persona+ technique correctly. firstly, the non-technical stakeholders took part in the training regarding the presentation of the protopersona+ technique, its benefits, and purposes of use it. additionally, a scenario about the domain of the application was presented to these stakeholders in order they had a clear view about the scope of the application. subsequently, a hands-on exercise about the use of the technique in practice was run. this step allowed the participants to clarify their doubts and consequently it avoids misunderstandings and misusing of proto-persona+. finally, the proto-personas+ artifacts were constructed by using the template with the guideline questions and supported by the information presented in the previous steps. in the second round, the focus moved on to the use of the information described in the proto-personas+. the artifacts produced in the previous round were explored, and the information on it supported the construction of the user interface prototypes. to make suitable the usage of the information provided by the non-technical stakeholders, we conducted some actions to the participants (i.e. developers) got the expertise to use the artifact. first, the developers took a part of training session about the concepts of proto-personas+ and how to use these artifacts in the practice. the scenario used in the previous round was presented to keep the same scope of the application. after, a hands-on exercise was run with the purpose of the participants become acquainted with protopersonas+ artifacts. this hands-on focused on the reading of the details available on the proto-persona+ for then extracted the information the developers considered relevant. an artifact example was delivered to the developers that should read and explore it as well as ask questions for clarifying their doubts. afterwards, all the proto-personas+ produced in the first round were offered to the developers. they could select the ones they considered that provided useful information to their activity of prototyping the user interfaces. we must mention that the ux requirements that were raised are relative to a more minimalist application in the m-learning area. this enables them to be reused within the same scope. however, the exploration of results in other elearning applications should be done to verify these requirements reuse. it is also relevant to discuss the scope of the answers to our research questions. since this a first study about the contribution that non-technical stakeholders bring to the specification of ux requirements, we tried to understand this phenomenon, and we asked exploratory questions, easterbrook et al. (2008), aiming at characterizing the non-technical stakeholder contributions. however, the answers to our research questions are context-dependent. different stakeholders would describe different ux requirements. nevertheless, the answers to these questions result in a clearer understanding of the phenomenon, since they show that the nontechnical stakeholders bring a valid contribution to the specification of ux requirements. considering (rq1): which ux requirements do non-technical stakeholders describe while using the proto-persona technique?, we could answer that the non-technical stakeholder has elicited different ux requirements when compared to the technical stakeholders. on exploring the artifacts that both types of stakeholders produced, we could affirm that they contributed in different perspectives. the first round showed that both types of stakeholders described the ux requirements differently e.g., the different actors described in their proto-personas’ characteristics of how to keep the student using the application. while the pedagogues pointed out that students would be encouraged by the enjoyable features that would give rise to the interaction fun, the software engineers preferred to mention the student motivation focused on dealing with student frustration. these approaches are a reflection of the requirements that the e-learning application should have to delivery fun in a learning space, gomes et al. (2018). another evidence of the different contributions that both types of stakeholders brought was seen in the user profiles described by them. the pedagogues suggested profiles in which accessibility issues were at the center, whereas the software engineer provided the description of profiles associated with developing work in groups. therefore, it can be inferred that the knowledge of both are complementary. our findings restate the need of an interdisciplinary participation of various stakeholders, fernandez and wagner (2015). concerning the (rq2): how is the acceptance of the use of the proto-persona+ technique by these stakeholders?, we concluded that the technique proved to be suitable to be used by both types of stakeholders. we can point out some different perspectives in using the technique. the pedagogues showed greater degrees of importance in the use of the demographic quadrant. this represents an important result on the description of end-users. therefore, this quadrant reports an individual’s personal information that can contribute to the construction of a picture of the end-users; it can consequently boost the development of empathy between the developers and the audience, billestrup et al. (2014); ferreira on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 et al. (2018b). in the guideline questions, it was noticed that the two stakeholders demonstrated different perceptions for each questions. as a result, it was seen that these different perceptions can provide complementary viewpoints on the audience, thereby enriching the details about the end-user. by observing the perceptions of ease-of-use and usefulness of use, the findings showed that the non-technical stakeholders presented easiness in using the technique. these results revealed that the proto-persona+ is a suitable technique to be handle non-technical stakeholders with the purpose of eliciting ux requirement. other techniques could be taken into account to eliciting ux requirements. however, often personas are artifacts that stimulate the discussion about the end-user needs indepth. therefore the proto-persona+ provided an adaptation of proto-persona which the aim of being easier to the use by the stakeholders. by answering the rq2, we could verify that the proto-persona+ was suitable to capture the particular knowledge of the different type of stakeholders. this revealed that the different types of stakeholders can contribute to describing different ux requirements. by answering (rq3): which ux requirements presented in the proto-personas+ can support the prototyping of user interfaces?, we noticed what were the sets of ux requirements presented in the proto-personas+ that supported the developers on the prototyping of solutions. by comparing which ux requirements are presented in the storyboards we saw that the proto-personas+ of both types of stakeholders (i.e., pedagogues and software engineers) provide information that these artifacts supported the developers in the design of solutions. additionally, the findings from storyboards analysis reaffirmed that both stakeholders provided complementary information fernandez and wagner (2015). 10 study limitations considering all the steps of our study, we can highlight some limitations which we discuss follow. proto-persona is an approach that focuses on providing a sketch of the representative group of people in a specific domain. from conducting workshops the proto-persona technique allows the participants (i.e. stakeholders) to achieve a shared understanding about the audience. one of the advantages is that the technique offers a practical way to gather the specialists’ knowledge and discuss their inputs about the endusers. however, as the proto-persona is built from assumptions about the end-users it presents some limitations regarding their validation. differently from proto-persona, the traditional persona is constructed by using data gathering from the audience. to mitigate the problem of not collecting data from real end-users, gothelf proposes that proto-persona validation should be carried out later. we did not performed this validation, this could be conducted in another study. we can point out as another limitation, the fact that this study was conducted with a specific group of stakeholders in a specific city in brazil. further studies are necessary to reiterate the proposed methodology as a generalized approach to capture non-technical stakeholder knowledge in other contexts. up to now, our research did not compare the results of the proto-personas+ with different approaches that use traditional personas; or even no personas to elicit requirements from non-technical stakeholders. therefore, we do not claim that applying proto-personas+ leads to a better result than using traditional personas approaches. we also do not claim that the proto-personas+ results are better than not applying any persona approach at all. further comparative studies are needed to fully understand the effectiveness of the protopersona approach. our results must not be generalized to all scenarios and the particularities of our study must be considered. protopersona construction should be seen as a tool to encourage the sharing and discussion of stakeholder knowledge. this study investigated whether the proto-persona is suitable to be used by both technical and non-technical stakeholders to support the ux requirement elicitation. 11 conclusions and future work this paper presents an experimental study that aimed to explore whether a non-technical stakeholder contributes to the description of ux requirements. to conducted the study, we applied the proto-persona+ technique. the results showed that the non-technical stakeholder contributed by giving details about the end-users in a complementary view of the technical stakeholder. considering the types of ux requirements the participants described, we noticed that the non-technical stakeholders raised different ones. fun and accessibility issues were found exclusively in the proto-personas+ created by these stakeholders. accessibility issues are fundamental to meet the needs of a wide range of end-users in the domain we explored in this study. in addition, by taking into account fun issues these stakeholders demonstrated their concerns on motivating the users to keep engaged in the application. we could conclude that by describing these type of ux requirements the non-technical stakeholders had an important contribution on eliciting requirements which have a great impact on the experience of the end-users. the results of our second round revealed that the user interface prototypes produced by the developers encompassed different ux requirements in a complementary way. we could see that the prototypes presented a diversity of details about ux. we could conclude that for the design of the protopersonas+ of the different stakeholders allow the developers to build more comprehensive prototypes at the same time that provided minimalist solutions. to sum up, we could point out that our study provided two important contributions. first, our investigation brought the discussion of how a non-technical stakeholder can contribute to the elicitation of requirements that are linked to the end-users characteristics. our findings revealed that the non-technical stakeholder can be a co-participant in the elicitation process and not just a provider of information. in addition, we extended the proto-persona technique by creating the proto-persona+ and showing that our proposal is suitable for the purpose of including the non-technical stakeholder in the process of eliciting ux requirements. our work also preon the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 sented as contribution the structuring of a qualitative analysis that can be replicated in other studies on ux requirements. as future work, we intend to carry out studies on the quality of the low fidelity prototypes by conducting an usability inspection on these. we also intend to evaluate the quality of the storyboards on the perspective of domain experts that in our case are the pedagogues. 12 acknowledgements we thank the financial support of the coordenação de aperfeiçoamento de pessoal de nível superior brasil (capes) finance code 001. we also thank the grant #2013/25572-7 são paulo research foundation (fapesp) and the support of cnpq (311494/2017-0). references abelein, u., sharp, h., and paech, b. (2013). does involving users in software development really influence system success? ieee software, 30(6):17–23. alves, c. and ali, r. (2018). a persona-based modelling for contextual requirements. in requirements engineering: foundation for software quality: 24th international working conference, refsq 2018, utrecht, the netherlands, march 19-22, 2018, proceedings, volume 10753, page 352. springer. anvari, f., richards, d., hitchens, m., and babar, m. a. (2015). effectiveness of persona with personality traits on conceptual design. in proceedings of the 37th international conference on software engineering-volume 2, pages 263–272, florença, itália. ieee press. aranda, a. m., dieste, o., and juristo, n. (2016). effect of domain knowledge on elicitation effectiveness: an internally replicated controlled experiment. ieee transactions on software engineering, 42(5):427–451. ardito, c., costabile, m. f., marsico, m. d., lanzilotti, r., levialdi, s., roselli, t., and rossano, v. (2006). an approach to usability evaluation of e-learning applications. universal access in the information society, 4(3):270–283. berti, s., paterno, f., and santoro, c. (2004). natural development of ubiquitous interfaces. communications of the acm, 47(9):63–64. bhattarai, r., joyce, g., and dutta, s. (2016). information security application design: understanding your users. in international conference on human aspects of information security, privacy, and trust, pages 103–113. springer. billestrup, j., stage, j., nielsen, l., and hansen, k. s. (2014). persona usage in software development: advantages and obstacles. in the seventh international conference on advances in computer-human interactions, achi, pages 359–364, barcelona, espanha. citeseer. brown, j. m., lindgaard, g., and biddle, r. (2011). collaborative events and shared artefacts: agile interaction designers and developers working toward common aims. in 2011 agile conference, pages 87–96. castro, j. w., acuña, s. t., and juristo, n. (2008). integrating the personas technique into the requirements analysis activity. in 2008 mexican international conference on computer science, pages 104–112. chimalakonda, s. and nori, k. v. (2013). what makes it hard to apply software product lines to educational technologies? in 4th international workshop on product line approaches in software engineering. choma, j., zaina, l. a. m., and beraldo, d. (2016a). userx story: incorporating ux aspects into user stories elaboration. in human-computer interaction. theory, design, development and practice 18th international conference, hci international 2016, toronto, on, canada, july 17-22, 2016. proceedings, part i, pages 131–140. choma, j., zaina, l. a. m., and da silva, t. s. (2016b). softcoder approach: promoting software engineering academia-industry partnership using cmd, dsr and ese. j. software eng. r&d, 4:8. clark, r. c. and mayer, r. e. (2007). e-learning and the science of instruction: proven guidelines for consumers and designers of multimedia learning. pfeiffer, 2nd edition edition. cooper, a., reimann, r., and cronin, d. (2014). about face 2.0 the essentials of interaction design. john wiley & sons wiley. davis, f. d. (1989). perceived usefulness, perceived ease of use, and user acceptance of information technology. management information systems research center, 13(3):319–340. de la vara, j. l., wnuk, k., svensson, r. b., sanchez, j., and regnell, b. (2011). an empirical study on the importance of quality requirements in industry. in 23rd international conference software engineering and knowledge engineering, pages 438–443. seke. dias, g. a., da silva, p. m., no junior, j. b. d., and de almeida, j. r. (2011). technology acceptance model (tam): avaliando a aceitação tecnológica do open journal systems (ojs). informação & sociedade, 21(2):133–149. dodero, j. m., garcía-peñalvo, f.-j., gonzález, c., morenoger, p., redondo, m.-a., sarasa-cabezuelo, a., and sierra, j.-l. (2014). development of e-learning solutions: different approaches, a common mission. ieee revista iberoamericana de tecnologias del aprendizaje, 9(5):72–80. easterbrook, s., singer, j., storey, m.-a., and damian, d. (2008). selecting empirical methods for software engineering research. in guide to advanced empirical software engineering, chapter 11, pages 285–311. springer. faily, s. (2008). towards requirements engineering practice for professional end user developers: a case study. in proceedings of the 2008 requirements engineering education and training, pages 38–44. ieee. fernandez, d. m. and wagner, s. (2015). naming the pain in requirements engineering: a design for a global family of surveys and first results from germany. information and software technology, 57(1):616–643. ferreira, b., barbosa, s., and conte, t. (2018a). creating personas focused on representing potential requirements to support the design of applications. in proceedings of the 17th brazilian symposium on human factors in computing systems, page 15. acm. on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 ferreira, b., silva, w., barbosa, s. d. j., and conte, t. (2018b). technique for representing requirements using personas: a controlled experiment. iet software, 12(3):280–290. ferreira, b., silva, w., jr., e. a. o., and conte, t. (2015). designing personas with empathy map. in the 27th international conference on software engineering and knowledge engineering, seke 2015, wyndham pittsburgh university center, pittsburgh, pa, usa, july 6-8, 2015, pages 501–505. filho, n. f. d. and barbosa, e. f. (2013). a requirements catalog for mobile learning environments. in proceedings of the 28th annual acm symposium on applied computing, pages 1266–1271. acm. fisher, r. a. (1922). on the interpretation of χ 2 from contingency tables, and the calculation of p. journal of the royal statistical society, 85(1):87–94. garcia, a., silva da silva, t., and selbach silveira, m. (2017). artifacts for agile user-centered design: a systematic mapping. in proceedings of the 50th hawaii international conference on system sciences (2017). garrett, j. j. (2010). the elements of user experience: usercentered design for the web and beyond. new riders publishing, thousand oaks, ca, usa, 2nd edition. gomes, t. c. s., falcão, t. p., and de azevedo restelli tedesco, p. c. (2018). exploring an approach based on digital games for teaching programming concepts to young children. international journal of child-computer interaction, 16:77 – 84. gothelf, j. (2012). using proto-personas for executive alignment. uxmagazine. gothelf, j. and seiden, j. (2013). lean ux: applying lean principles to improve user experience. o’reilly media. grudin, j. (2006). why personas work: the psychological evidence. in the persona lifecycle, chapter 12, pages 642–663. elsevier inc. grudin, j. and pruitt, j. (2002). personas participatory design and product development: an infrastructure for engagement. in pdc’02, pages 144–152. hadar, i., soffer, p., and kenzi, k. (2014). the role of domain knowledge in requirements elicitation via interviews: an exploratory study. requirements engineering, 19(2):143– 159. jansen, a., van mechelen, m., and slegers, k. (2017). personas and behavioral theories: a case study using selfdetermination theory to construct overweight personas. in proceedings of the 2017 chi conference on human factors in computing systems, pages 2127–2136. acm. kashfi, p., nilsson, a., and feldt, r. (2017). integrating user experience practices into software development processes: implications of the ux characteristics. peerj computer science, 3:e130. kortbeek, c. (2016). interaction design for internal corporate tools. maceli, m. and atwood, m. (2011). from human crafters to human factors to human actors and back again: bridging the design time – use time divide. in end-user development. is-eud 2011. lecture notes in computer science, volume 6654, pages 76–91. springer. nielsen, j. (1995). 10 usability heuristics for user interface design. https://www.nngroup.com/articles/ ten-usability-heuristics/. online; acessado em 12 de agosto de 2016. nielsen, j. and norman, d. (2013). the definition of user experience. osborn, a. f. (1979). applied imagination. newyork: scribner. palomares, c., quer, c., and franch, x. (2017). requirements reuse and requirement patterns: a state of the practice survey. empirical software engineering, 22(6):2719– 2762. rogers, y., sharp, h., and preece, j. (2015). interaction design: beyond human-computer interaction. john wiley & sons, united states, 4th edition. salman, i., misirli, a. t., and juristo, n. (2015). are students representatives of professionals in software engineering experiments? in proceedings of the 37th international conference on software engineering-volume 1, pages 666–676. ieee press. sharma, s. and pandey, s. k. (2014). requirements elicitation: issues and challenges. in 2014 international conference on computing for sustainable global development (indiacom), pages 151–155. steinmacher, i., conte, t. u., treude, c., and gerosa, m. a. (2015). overcoming open source project entry barriers with a portal for newcomers. in icse ’16 proceedings of the 38th international conference on software engineering, pages 273–284, austin, estados unidos. strauss, a. and corbin, j. (1998). basics of qualitative research: techniques and procedures for developing grounded theory, volume 4. thousand oaks, ca: sage, 2 edition. winckler, m., bach, c., and bernhaupt, r. (2013). identifying user experience dimensions for mobile incident reporting in urban contexts. ieee transactions on professional communication, 56(2):97–119. https://www.nngroup.com/articles/ten-usability-heuristics/ https://www.nngroup.com/articles/ten-usability-heuristics/ introduction fundamentals and related work proto-persona+ study context first round: using proto-persona+ planning execution analysis threats to validity findings of the first round ux requirements acceptance of proto-persona+ importance of quadrants usage and relevance of the guideline questions perception of usefulness and ease-of-use second round: using the proto-personas+ in design planning execution analysis threats to validity findings of the second round developers' preferences application of ux requirements discussion study limitations conclusions and future work acknowledgements journal of software engineering research and development, 2022, 10:5, doi: 10.5753/jserd.2021.1992 this work is licensed under a creative commons attribution 4.0 international license. first step climbing the stairway to heaven model results from a case study in industry paulo sérgio dos santos júnior [ federal institute of education, science and technology of espírito santo | paulo.junior@ifes.edu.br] monalessa perini barcellos [ federal university of espírito santo | monalessa@inf.ufes.br ] rodrigo fernandes calhau [ federal institute of education, science and technology of espírito santo | calhau@ifes.edu.br] abstract context: nowadays, software development organizations have adopted agile practices and data-driven software development aiming at a competitive advantage. moving from traditional to agile and data-driven software development requires changes in the organization´s culture and structure, which may not be easy. the stairway to heaven model (sth) describes this evolution path in five stages. objective: we aimed to investigate how systems theory tools, gut matrix, and reference ontologies can help organizations in the first transition of sth, i.e., moving from traditional to agile development. method: we performed a participative case study in a brazilian organization that develops software in partnership with a european organization. we applied systems theory tools (systemic maps and archetypes) to understand the organization and identify undesirable behaviors and their causes. thus, we used gut matrices to decide which ones should be addressed first and we defined strategies to change the undesirable behaviors by implementing agile practices. we also used the conceptualization provided by reference ontologies to share a common understanding of agile and help implement the strategies. results: by understanding the organization, a decision was made to implement a combination of agile and traditional practices. the implemented strategies improved software quality and project time, and cost. problems due to misunderstanding agile concepts were solved by using reference ontologies, process models, and other diagrams built based on the ontologies conceptualization, allowing the organization to experience agile culture and foresee changes in its business model. conclusion: systems theory tools and gut matrix aid organizations to move from traditional to agile development by supporting better understanding the organization, finding leverage points of change, and enabling to define strategies aligned to the organization characteristics and priorities. reference ontologies can be useful to establish a common understanding about agile, enabling teams to be aware of and, thus, more committed to agile practices and concepts. the use of process models and other diagrams can favor learning the conceptualization provided by the ontologies. keywords: stairway to heaven, agile, systems theory, gut matrix, ontology 1 introduction typically, fast-changing and unpredictable market needs, complex and changing customer requirements, and pressures of shorter time-to-market are challenges faced by organizations. to address these challenges, many organizations have started adopting agile development methods with the intention to enhance the organization´s ability to respond to change. in emphasizing flexibility, efficiency and speed, agile practices have led to a paradigm shift in how software is developed (williams and cockburn 2003) (olsson et al. 2012). different flavors of the agile methods have become the de facto way of working in the software industry (rodriguez et al. 2012). in allowing for more flexible ways of working with an emphasis on customer collaboration and speed of development, agile methods help organizations address many of the problems associated with traditional development (dybå and dingsøyr 2008). the adoption of agile practices has enabled organizations to shorten development cycles and increase customer collaboration. however, this has not been enough. there has been a need to learn from customers also after deployment of the software product. this requires practices that extend agile practices, such as continuous deployment (i.e., the ability to deliver software more frequently to customers and benefit from frequent customer feedback), which enables shorter feedback loops, more frequent customer feedback, and the ability to more accurately validate whether the developed functionalities correspond to customer needs and behaviors (olsson et al. 2012). therefore, organizations should evolve from traditional development towards datadriven and continuous software development. continuous software engineering (cse) aims to establish a continuous flow between software-related activities, taking into consideration the entire software life cycle. it seeks to transform discrete development practices into more iterative, flexible, and continuous alternatives, keeping the goal of building and delivering quality products according to established time and costs (fitzgerald and stol 2017). therefore, a continuous software engineering approach is based on agile and continuous practices driven by development and customer data. considering that organizations struggle with the changes to be made along the path and with the order in which to implement them, olsson et al. (2012) proposed the stairway to heaven model (sth), which describes the typical successful evolution of an organization from traditional to continuous and customer data-driven development. the model comprises five stages, where the first transition consists in moving from traditional to agile development. this transition requires a careful introduction of agile practices, a shift to small development teams, and a focus on features rather than components. in this paper, we report the experience of a brazilian organization (here called organization a for anonymity rea first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 sons) which decided to evolve from traditional to agile, continuous, and data-driven software development. for that, we have followed the sth model (olsson et al. 2012). we selected this model because it represents in a simple way the main stages an organization should follow to move from a traditional to a continuous software engineering approach based on data-driven and agile development. moreover, sth does not prescribe the practices that should be performed at each stage, thus, there is flexibility to define and implement them according to the organization characteristics and priorities. in this paper, our focus is on the first transition of the sth model. although there is an increasing number of organizations moving from traditional to agile, implementing the changes needed to the first transition prescribed in sth is not trivial because it involves changes not only in the development process, but also in the organization culture. moreover, there is no “one and right” way to implement agile practices in an organization because each agile practice needs to be tailored to fit the business goals, culture, environment, and other aspects of the organization. therefore, organizations should find their own way to go through the path from traditional to agile (karvonen et al. 2015). organization a has a particular characteristic that needs to be considered when defining strategies to implement agile practices: the software projects of organization a are built in partnership with a european organization (here called organization b). in this partnership, organization b is responsible for the software requirements specification process, while organization a is responsible for design, coding, testing, and deployment processes. furthermore, organization b is responsible for the communication between organization a and the project client. both organizations a and b work in traditional but many times ad hoc manners. this way of working has brought problems, such as budget overloading, teams divided into disciplines (testers, architects, programmers, etc.) causing many intermediary delivery points in the organization and increasing delays between them, and large periods required to deploy new versions of the software products (williams and cockburn 2003) (olsson et al. 2012)(karvonen et al. 2015). organization a was in the first stage of sth and, in order to evolve, the first step was to go towards becoming an agile organization. two main challenges were faced in this context: (i) how to move from a traditional development culture to an agile culture and (ii) how to implement agile practices in an organization that shares requirement-related activities with another organization and does not have direct access to the project client. to overcome these challenges, it would be necessary to get to know the organization so that it would be possible to define suitable strategies to implement agile practices. thus, we employed an approach that combined systems theory tools (mainly systemic maps and archetypes) (meadows 2008) (sterman 2010), gut matrix (kepner and tregoe 1981) and reference ontologies (guizzardi 2007) to identify the path to implement agile practices and get into agile culture based on the organizational characteristics and context. systems theory tools were chosen because they allow understanding how different variables relate to each other in an organizational environment. thus, by using such tools, it is possible to understand how processes, practices, culture and other factors affect the software development process and produced results. this helps identify aspects that should be addressed in improvement actions. the first and third authors have knowledge of and experience with system theory and saw an opportunity to apply it in organization a. gut matrix was selected because it helps prioritize actions and was already known by organization a. finally, reference ontologies were used because they have been recognized as an important instrument to deal with knowledge-related problems, supporting communication and learning (guizzardi 2007). the authors have successfully experienced the use of ontologies as knowledge artifacts in different contexts (e.g., (ruy et al. 2017), (santos et al. 2019), (fonseca et al. 2016)). they developed the scrum reference ontology (sro) (santos jr et al. 2021a), which provides knowledge that aids in the understanding of scrum in a broader software engineering context and is suitable for meeting a learning need identified in the study addressed in this paper. as main results perceived from the experience reported here, we highlight: (i) it was possible to understand the organization behavior, identify behavior patterns and leverage points of change; (ii) strategies were defined to implement agile practices by changing undesirable behaviors and focusing on leverage points, taking the organization characteristics into account; (iii) by implementing the strategies, organization a improved software quality, project time and cost and started to develop agile culture; (iv) by using the conceptualization provided by reference ontologies, the team learned agile concepts and practices, which is useful to implement strategies aiming at the agile organization; and (v) a process based on systems theory to aid organization define strategies to implement agile practices arose from the study. this work brings contributions to researchers and practitioners. the study can serve as an example for other organizations similar to organization a and the process resulting from the study can be used by other organizations. moreover, the way ontologies were used to provide knowledge for the team can inspire others to make the most of this powerful instrument to knowledge structuring, representation and sharing. furthermore, researchers can reflect and provide advances on the use of systems theory to support the definition of strategies in the agile software development context. this paper extends (santos jr et al. 2020) mainly by exploring how reference ontologies were used to help the team learn about scrum concepts and practices in the case reported here. we also illustrate the roles of systems theory tools, gut matrix, and reference ontologies in the study, present additional information about organizations a and b and a new systemic model produced during the study. the paper is organized as follows: section 2 presents the theoretical background; section 3 discusses related work; section 4 presents the study planning, execution, and results; section 5 discusses threats to validity and section 6 presents our final considerations and future works. 2 background 2.1 stairway to heaven traditional software development is organized sequentially, handing over intermediate artifacts (e.g., requirements, designs, code) between different functional groups in the organization. this causes many handover points that lead to first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 problems such as time delays between handovers of different groups and amounts of resources are applied to creating these intermediate artifacts that, to a large extent, are replacements of human-to-human communication (bosch 2014). in agile software development, the notion of cross-functional, multidisciplinary teams plays a central role. these teams have the different roles necessary to take a customer´s need all the way to a delivered solution. moreover, the notion of small, empowered teams, the backlog, and daily stand-up meetings and sprints guide software development through shorter cycles and help bring the software development closer to the client (bosch 2014). moving from traditional to agile development is the first transition prescribed in stairway to heaven model (sth) (olsson et al. 2012). sth describes the evolution path organizations follow to successfully move from traditional to datadriven software development. it comprises five stages: traditional development, agile organization, continuous integration, continuous deployment, and r&d as an innovation system. in a nutshell, organizations evolving from traditional development start by experimenting with one or a few agile teams. once these teams are successful, agile practices are adopted by the organization. as the organization starts showing the benefits of working agile, system integration and verification become involved, and the organization adopts continuous integration. once it runs internally, lead customers often express an interest to receive software functionality earlier than through the normal release cycle. they want continuous deployment of software. the final stage is where the organization collects data from its customers and uses a customer base to run frequent feature experiments to support customer data-driven software development (olsson et al. 2012). many organizations have moved from traditional to agile. there are many ways of doing that and each organization should consider its business goals, culture, environment and other aspects to find the best way to go through the path. in the experience reported in this paper we have used systems theory tools, gut matrix and reference ontologies, which are briefly introduced in the following. 2.2 system theory it has been used in industry and academy to support (re)design of organizations (sterman 1994) (meadows 2008) (sterman 2010). it sees an organization as a system, consisting of elements (e.g., teams, artifacts, policies) and interconnections (e.g., the relation between the development team, the software artifacts it produces and the policies that influence their production) coherently organized in a structure that produces a characteristic set of behaviors, often classified as its function or purpose (e.g., the development team produces a software product aiming to accomplish its function in the organization)(meadows 2008). in the systems theory literature, there are several tools that support understanding the different elements and behaviors of a system, such as systemic maps and archetypes (meadows 2008)(sterman 2010). a systemic map (also known as causal loop diagram) allows representing the dynamics of a system by means of the system borders, relevant variables, their causal relationships, and feedback loops. a positive causal relationship means that two variables change 1 seon specification is available at http://nemo.inf.ufes.br/en/projects/seon/ in the same direction (e.g., increasing the number of bad design decisions causes increasing in software defects), while a negative causal relationship means that two variables change in opposite directions (e.g., increase test efficacy causes decreasing in software defects). feedback loops are mechanisms that change variables of the system. there are two main types: balancing and reinforcing feedback loops. the former is an equilibrant structure in the system and is a source of stability and resistance to change. the latter compounds change in one direction with even more change. one beneficial effect of using systemic maps is that they help identify archetypes. an archetype is a common structure of the system that produces a characteristic pattern of behavior. for example, the archetype shifting the burden occurs when a problem symptom is “solved” by applying a symptomatic solution, which diverts attention away from a more fundamental solution (kim 1994). the archetype fix that fail, in turn, occurs when an effective fix in the shortterm creates side effects, a “fail”, for the long-term behavior in the system (kim 1994). usually, fix that fail appears inside of another complex archetype as shifting the burden. each archetype has a corresponding modeling pattern. therefore, by analyzing a systemic map is possible to identify archetypes by looking for their modeling patterns. archetypes and systemic maps can be useful to identify problems and possible leverage points to solve them. leverage points are points in the system where a small change can lead to a large shift in behavior (meadows 2008). 2.3 gut matrix it allows to prioritize the resolution of problems, considering that resources are limited to solve them (kepner and tregoe 1981). the prioritization is based on: gravity (g), which describes the impact of the problem on the organization; urgency (u); referring to how much time is available to address the problem; and tendency (t), which measures the predisposition of a problem getting worse over time. 2.4 reference ontology ontologies have been recognized as important instruments to solve knowledge-related problems. an ontology is a formal, explicit specification of a shared conceptualization (studer et al. 1998). ontologies can be developed for communication purposes (reference ontologies) or for computational solutions (operational ontologies). a reference ontology is a special kind of conceptual model representing a model of consensus within a community. it is a solution-independent specification with the aim of making a clear and precise description of the domain in reality for the purposes of communication, learning and problem-solving (baskerville 1997). in the work described in this paper, we used the scrum reference ontology (sro) (santos jr et al. 2021a), which addresses the main aspects of scrum, such as, ceremonies, activities, roles, artifacts, and so on. the first and second authors of this paper are also authors of sro. it is a reference ontology of the software engineering ontology network (seon)1 (ruy et al. 2016). seon is an ontology network that contains several integrated ontologies describing various subdomains of the software engineering domain (e.g., http://nemo.inf.ufes.br/en/projects/seon first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 software requirements, software process, software measurement, software quality assurance, software project management, etc.). by providing a comprehensive and consistent conceptualization of the software engineering domain, seon has been successfully used to solve knowledgerelated and interoperability problems in that domain (e.g., (fonseca et al. 2017)(ruy et al. 2017)(bastos et al. 2018) (santos jr et al. 2021a)). sro reuses concepts of other seon ontologies, namely: software process ontology (spo) (briguente et al. 2011), enterprise ontology (eo) (ruy et al. 2014) and reference software requirements ontology (rsro) (duarte et al. 2018). by doing that, sro connects scrum concepts to more general software engineering concepts, enabling it to better understand scrum in a broader software development context. sro was developed by following the sabio method (falbo 2014) and it was evaluated through verification and validation activities. detailed information about sro, including its conceptual models, descriptions and a study in which we used sro for semantic interoperability purposes can be found in (santos jr et al. 2021a). 3 related work some works have reported the use of systems theory in the agile development context. for example, vidgen and wang (2009) proposed a framework based on the systems theory that identifies enablers and inhibitors of agility and discusses capabilities that should be present in an agile team. gregory et al. (2016) discuss challenges to implementing agile and suggest some organizational elements that could be used to do that. considering the sth context, karvonen et al. (2015) used bapo categories (business, architecture, process, and organization) to identify some practices to each sth step. however, they do not discuss how to understand the organization to establish proper strategies to implement them. considering scenarios involving more than one organization to produce software, de sousa et al. (2016) discuss agile transformation in brazilian public institutions. different from organizations a and b, which work together to produce software for the client, brazilian public institutions hire software organizations to develop software (i.e., the public institution is a client of the hired organization). moreover, different from the scenario discussed in (de sousa et al. 2016), in our study, organization a needed to develop skills, processes, and culture that enabled it to work with multicultural issues, because organization a, organization b and clients are in different countries, and have different cultures. none of the aforementioned works use systems theory tools, gut matrix, and reference ontologies to help organizations to define strategies to agile practices, as we did in our study. some works address aspects related to developing software with distributed teams (jim et al. 2009)(prikladnicki and audy 2010). they show that there are many challenges related to communication, knowledge management, coordination and requirement management caused by different location, time and culture. aiming to address these issues, l’erario et al. (2020) propose a framework that provides some concepts, a structure and a flow of communication in distributed software projects. ali and lai (2018), in turn, focus on requirements communication and propose to use a requirements graph combined with a software requirement specification document to help the stakeholders in the establishment of a better understating of software requirements. similar to our work, the aforementioned works aim to support organizations in which the software development process is distributed. however, differently from our work, those works consider software development geographically distributed among several development teams of the same organization. as we previously discussed, our work considered two organizations working as one in the projects, with two teams in different countries, and each team controlling part of the software development process. we propose to use system theory tools, gut matrix and reference ontologies to create strategies that minimize the impact caused by culture, time, and distance and, sometimes, use them as a competitive advantage. we believe that our work can contribute to organizations that work with geographically distributed teams by providing useful knowledge to create tailored strategies. for example, they can be inspired by our strategy to communicate requirements, which uses bdd (behavior driven development) (wynne et al. 2017) as a protocol to specify, communicate and validate requirements. 4 case study, planning, execution, and results participative case study was selected as the research method in this study because two researchers acted as consultants in organization a and ,thus, were participants in the process being observed (baskerville 1997). together with other participants, they gathered information to understand the organization and defined strategies to implement agile practices. thus, the researchers had some control over some intervening variables. 4.1 study design 4.1.1 diagnosis organization a is a brazilian software development organization that works together with a european organization (organization b) to develop software products for european clients. it has 30 developers organized in teams managed by tech leaders. organization b elicits requirements with clients and organization a is in charge of developing the corresponding software. as a consequence of the increasing number of projects and team members, added to the lack of flexible processes, some problems emerged, such as projects late and over budget, increasing in software defects, overloading of the teams due to rework on software artifacts, and communication issues among client, organization a, and organization b. aiming to minimize these problems, in the first semester of 2019, organization a decided to implement scrum practices, but without success. according to the directors, the main difficulties were due to non-direct communication with the client and included: difficulty to define product backlog, select a product owner and carry out scrum ceremonies that need the client’s feedback. furthermore, they pointed out that agile culture demands knowledge and its clients, business partners and developers were not prepared for it. first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 other factors that harmed scrum implementation were: (i) teams without self-management characteristics, (ii) difficulties in internal communication, (iii) lack of feedback culture, (iv) lack of openness and other scrum’s values. moreover, organization a has had a lot of systemic issues, such as: (a) directors have been much focused on operational and technological issues, (b) lack of management professionals, (c) focus on short-term issues instead of long-term ones and, (d) lack of focus on applying strategic and systemic thinking. in addition, the first and third authors noticed that organization b has had a traditional culture based on linear and nonadaptative processes and methods. this scenario indicated to us that a particular characteristic and context of the organization had not been considered in the first try to implement agile practices. therefore, at the beginning of 2020, we proposed to use sth as a reference model to evolve organization a from traditional to datadriven software development, in a long-term process improvement program. the first step: move from traditional to agile. considering the peculiar scenario of organization a, we decided to use systems theory to understand the organization in a systemic way. then, we used gut matrix to support prioritization of problems resolution, and reference ontologies to provide common knowledge about agile development. 4.1.2 planning the study goal was to analyze the use of systems theory tools (particularly systemic maps and archetypes), gut matrix, and reference ontologies to help define strategies to implement agile practices when the organization is moving from traditional to agile development. by strategies, we mean actions or plans established to implement agile development. aligned to this goal, the following research question was defined: are systems theory, gut matrix, and reference ontologies useful to define suitable strategies for an organization to move from traditional to agile development? the expected outcomes were: (i) a view of important aspects of the organization by means of systemic maps; (ii) prioritization of problems and causes to be addressed; (iii) strategies to address problems and implement agile practices;(iv) artifacts built based on reference ontologies and that help the team to learn agile concepts and practices; (v) a systems theory based process to define strategies to move from traditional to agile. figure 1 illustrates how systems theory tools (particularly systemic maps and archetypes), gut matrix, and reference ontologies (blue circles in figure 1) were used in the study. reference ontologies and systems theory tools were used in the problem domain (represented in the yellow region in figure 1). ontologies provide the conceptual perspective, while systemic maps and archetypes afford a dynamic perspective. in other words, the former supports understanding the domain itself (agile) by providing structural knowledge, while the latter helps understand the organization in which the problems manifest and how they manifest. gut matrix, in turn, was used in the solution domain (represented in the green region in figure 1) as a means to prioritize the problems to be addressed, providing, this way, a problem-solving perspective. figure 1. overview of the approach used in this work. to be more specific, ontologies were used to provide a common conceptualization to support communication among the organizations and their employees in the software development context. systemic maps, in turn, aimed to make explicit the variables and relations present in the dynamic of the system between the organizations. finally, gut matrix was used to support the decision-making process that guided the solution process. the study participants who directly participated in interviews to data collection and results in the evaluation were the two directors (software development director and sales director), one tech leader, and two developers. the first and third authors worked as consultants in organization a and, thus, also participated in the study. working together with the other participants, they were responsible for creating systemic maps, gut matrices, as well as for defining the strategies to be implemented to move from traditional to agile software development. once these artifacts were created, they were validated with the team. for example, systematic maps were created based on information provided by the team. then, the team evaluated them in meetings and provided feedback so that we reached the maps shown in the next section. the second author did not interact directly with organization a. she worked as an external reviewer, evaluating the produced artifacts and helping other authors improve such artifacts. 4.2 study execution and data collection 4.2.1 data collection data collection involved interviews, development of systemic maps and gut matrix, and definition of strategies to implement agile practices. a. initial interviews data collection started with interviews to gather general information about the organization. six interviews were conducted, four with the directors and two with the developers, and the tech leader. participants were told to feel free to talk as much as they wanted to. each interview lasted about 90 minutes. the funnel questions technique was used, i.e., the first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 interview started with general questions (e.g., “what kind of software does the organization develop?”, “how is the software development process?”), and then went deeper into more specific points of each one (e.g., “tell me more about the software test activity”). the interviews were recorded, transcribed, and validated with each participant. the interviews with the directors aimed to get information about the following aspects: organizational environment, culture, rules of relationship with partners, future plans, software development process, software development issues, and agile knowledge. among the information provided by the directors, they pointed out that some problems were caused by misunderstood software requirements or project scope not clearly defined. according to them, organization b did not describe requirements in a consistent and clear way. the interviews with the tech leader and developers aimed at understanding software development problems under their perspective and how familiar they were with agile methods and practices. the problems mentioned by the directors were also reported by the tech leader and developers. when asked about team organization, they pointed out that the teams were not self-organized. contrariwise, tech leaders were responsible for allocating tasks, coordinating team members, establishing deadlines, and monitoring projects. moreover, the team knowledge of agile was limited. b. systemic maps information obtained in the interviews was used to build systemic maps. figure 2 shows a fragment of one of the developed systemic maps. the elements in blue in the figure form a modeling pattern that reveals the presence of the archetype shifting the burden. figure 2. fragment of systemic map (1). as previously said, organization b is responsible for eliciting requirements with the client, specifying and sending them for organization a to develop the software. the development teams of organization a often misunderstand requirements that describe the software, component, or functionality to be developed, since organization b produces requirements poorly specified, neither adopting a technique nor following a pattern to describe them. misunderstood requirements contribute to increasing the number of defects in software artifacts, since design, code, and test are produced based on the requirements informed by organization b. defects in software artifacts make organization a mobilize (and often overload) the development team to fix defects by performing new urgent development activities, which decrease the number of defects in software artifacts. these urgent activities are performed as fast as possible, aiming not to delay other activities. thus, they do not properly follow software quality good practices. moreover, they contribute to increasing the project cost and time (late and over-budget project). defects in software artifacts increase the need of using software quality techniques that, when used, lead to fewer defects in software artifacts. this causal relationship has a delay since the effect of using software quality techniques can take a while to be perceived. as shown in figure 2, the archetype shifting the burden is composed of two balancing feedback loops and one reinforcing feedback loop. the balancing feedback loops (between new urgent development activities and defects in software artifacts, and between defects in software artifacts and software quality techniques) mean that the involved variables influence each other in a balanced and stable way (e.g., higher/lower the number of defects in software artifacts, more/less new urgent development activities are performed). in the reinforcing feedback loop, new urgent development activities are a symptomatic solution that leads to defects fixed through rework, a side effect, because once urgent development activities fix the defects in software artifacts, organization a feels like the problem was solved. this, in turn, decreases the need for using software quality techniques, which is a more fundamental solution. as a result, software artifacts continue to be produced with defects, overloading the development team with new urgent development activities. shifting the burden is a complex behavior structure because the balancing and reinforcing loops move the system (organization a) in a direction (new urgent development activities) usually other than the one desired (software quality techniques). new urgent development activities contribute to increasing project cost and time (project is late and over-budget) because these activities were not initially planned in the project. when organization b does not properly define the project scope (scope poorly defined), organization a may allocate a team not suitable for the project, contributing to defects in software artifacts and to changes in the project team during the project. usually, when the team is changed, the new members need to get knowledge about the project. moreover, often the new members are more experienced and thus more expensive, which contributes to late and overbudget project. to change the project team, members can be moved from one project to another, causing deficit in other project teams. furthermore, there is a balancing loop between changes in the project team and defects in software artifacts. the former may cause the latter due to instability inserted into the team. the latter, in turn, contributes to the former because defects in software artifacts may lead to the need to change the team. there is a delay in this relationship because it can take a while to notice defects and the need to change the team. finally, scope poorly defined causes unrealistic deadlines, which contributes to late and over-budget projects. figure 3 illustrates another fragment of the developed systemic maps, showing variables related to different organizational levels. as observed in figure 3, organization b is first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 responsible for the direct communication with the client, i.e., organization a depends on organization b to obtain information from the client. this causes in organization a lack of contact with the final client, which contributes to low commitment from the team with the project´s goals, since the team is not empowered and loses motivation. this, in turn, leads to non-self-organized teams, because the team members do not have the opportunity to implement values and practices to become self-organized, which keeps the team away from the final client. these three variables create a reinforcing loop that prevents the organization from having more proactive and committed teams. figure 3. fragment of the systemic map (2). non-self-organized teams and low commitment from the team with the project´s goals contribute to a high involvement of the directors at the operational level, because they need to support the teams to solve problems (e.g., scope poorly defined, unrealistic deadline and late and overbudget project, shown in figure 2). as a consequence, they do have not enough time to concern with tactical and strategic levels, which causes damage to organization growth, because the directors do not have time to plan and implement strategies that allow getting new clients, reducing costs, etc. the previous paragraph describes an example of the archetype fix that fails that impacted the operational, tactical, and strategical level of organization a. the archetype fix that fails is composed of a balancing feedback loop that is intended to achieve a particular result or fix a problem, and a reinforcing feedback loop of the unintended consequences. the balancing feedback loop occurs when there is a high involvement of directors at the operation level trying to resolve problems of projects because of the low commitment from the team with the project´s goals. the reinforcing feedback loop, in turn, occurs when the directors do have not enough time to concern with the tactical, and strategic level because there is a high involvement of the directors at the operational level, resulting in damage to organization growth. this loop affects different organizational levels, from operational to strategic, and hampers organization evolving and growing. c. gut matrix after getting a comprehensive view of the organization and how it behaves, we reflected on the behaviors on which the strategies should be focused. thus, we created a gut matrix to identify and prioritize behaviors of the system that are not fruitful, i.e., undesirable behaviors. they were identified mainly from the systemic maps. for example, from the fragment depicted in figure 2 based on the positive causal relationship between misunderstood requirements and defects in software artifacts, the following undesirable behavior was identified: software artifacts are developed based on misunderstood requirements. from the shifting the burden archetype, we identified: software quality techniques are not often applied to build software artifacts. to complement the information provided by the systemic maps, we used information from the interviews to look for behaviors the literature points out as desirable in organizations moving to agile (e.g., self-organized teams) (leffingwel 2016). after identifying the undesirable behaviors, the study participants validated and prioritized them considering the gut dimensions. each dimension was evaluated considering values from 1 (very low) to 5 (very high). 13 undesirable behaviors were identified. table 1 shows a fragment of the gut matrix. table 1. fragment of gut matrix. # undesirable behaviors g u t gxuxt ub1 software artifacts are developed based on misunderstood requirements 5 5 5 125 ub2 software quality techniques are not often applied to build software artifacts 5 5 4 100 ub3 projects are late and over budget 5 5 4 100 ub4 organization has inconsistent knowledge of agile methods 5 5 4 100 ub5 teams are not self-organized 5 4 4 80 for each undesirable behavior, we analyzed the systemic maps and the interviews and identified its causes. (ub1) software artifacts are developed based on misunderstood requirements because (c1) requirements are not satisfactorily described and (c2) poor communication between client and development team. c1 was identified directly from the systemic map. c2 was based on information about the procedure followed by organization a to communicate with the client. when there is any doubt about requirements, the contact was made mainly through email or comments on issues in the project management system. only organization b has direct contact with the client. c1 and c2 are also causes of (ub2) software quality techniques are not often applied to build software artifacts, since the lack of well-defined requirements and direct contact with the client impact verification and validation activities. moreover, there is a (c3) lack of clear and objective criteria to evaluate results and (c4) large deliverables, which make it difficult to evaluate results. as it can be noticed in figure 1, projects are late and over budget (ub3) mainly because c1 and (c5) unstable scope and deadline. moreover (c6) unsuitable team allocation and c4 also affect projects cost and time. the former because low productivity impacts on project time and, thus, first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 cost. the latter is because it is difficult to estimate large projects. regarding (ub4) organization has inconsistent knowledge of agile methods, some members of the organization had previous experience with agile methods in other companies, others had a previous unsuccessful experience in organization a and others did not have experienced agile methods. most of the members were not sure about agile concepts and practices. therefore, this undesirable behavior is caused by (c7) organization’s members had different experiences with agile and (c8) agile concepts and practices are not well-known by the organization. finally, teams are not self-organized (ub5) due to the (c9) traditional development culture that produces functional and hierarchical teams. after identifying the causes of undesirable behaviors, the study participants validated them. table 2 shows the identified causes and respective undesirable behaviors. table 2. causes of undesirable behaviors. # causes ub1 ub2 ub3 ub4 ub5 c1 requirements are not satisfactorily described x x x c2 poor communication between client and development team x x c3 lack of clear and objective criteria to evaluate results x c4 large deliverables x x c5 unstable scope and deadline x x c6 unsuitable team allocation x c7 organization’s members had different experiences with agile x c8 agile concepts and practices are not well-known by the organization x c9 traditional development culture x d. strategies the causes of undesirable behaviors and the prioritization made in the gut matrix showed us leverage points of the system, i.e., points that if changed could change the system behavior. therefore, we defined strategies to help organization a move towards the second stage of sth by changing leverage points of the system and thus creating new behaviors in the system in that direction. we started by defining strategies to change undesirable behaviors at the top of the gut matrix and causes related to more than one undesirable behavior. after we had defined the strategies, we presented them to the team in a meeting and they provided feedback that helped us to make the strategies more suitable for the organization. next, we present four strategies defined to address the causes presented in table 2. considering organization a characteristics, mainly its partnership with organization b, the strategies combined agile and traditional practices. agile approaches bring the culture of self-organized teams, shorter development cycles, user stories, smaller deliverables, among other notions (karvonen et al. 2015)(leffingwel 2016). traditional approaches were used to complement agile practices. after all, agile methods usually do not detail how to manage some aspects of a software project, such as costs and risks. the first strategy, the new procedure to communicate requirements (s1), consisted in establishing a new procedure to be followed by organizations a and b regarding requirements and communication, aiming to address c1 and c2. due to business agreements, a big change in organization b was not possible. for example, we could not change the fact that only organization b could directly contact the project client. hence, it was defined that requirements would be sent from organization b to the project tech leader, who would rewrite the requirements as user stories and validate them with organization b. by representing requirements as user stories, the project tech leader also needs to represent their acceptance criteria, which aids to address c3. moreover, to properly define the acceptance criteria, the tech leader needs to obtain detailed information about the requirement, stimulating organization b to get such information from the client, which indirectly improves communication with the client. only user stories defined according to the defined template and validated with organization b follow to the next development activities. we also suggested the use of a template based on bdd (behavior driven development) (wynne et al. 2017) and gherkin syntax (binamungu et al. 2020), describing business rules, acceptance criteria and scenarios to serve as a protocol to communicate requirements among organizations a, b and the client. it is worth mentioning that we were not allowed to ask organization b to write the requirements itself by following the new guidelines, because this change was beyond the partnership agreements. in this strategy, we designated organization b to play the product owner role. this way it is not only a business partner, but it represents the client interests and has responsibilities in this context. with this strategy, we also aimed to minimize the symptomatic solution (new urgent development activities) indicated in the shift the burden archetype identified in the systemic map. according to meadows (2008), the most effective strategy for dealing with a shifting the burden structure is to employ the symptomatic solution and develop the fundamental solution. thus, it is possible to resolve the immediate problem, and also work to ensure that it does not return. by improving requirements descriptions and defining clear acceptance criteria, software quality techniques (e.g., verification and validation), which are the fundamental solution identified in the shifting the burden, can be properly applied. another strategy, budget and time globally and locally managed through short development cycles (s2), focused on changing the undesirable behavior ub3 (projects are late and over budget). again, to change that, organization a depended on changes in organization b. therefore, it was established that at the beginning of a project, organization a and b should agree on the project scope, deadline, budget and involved risks. the project characteristics (e.g., technologies, domain of interest, platform, etc.) should also be clearly established. the project team would not be allocated before this agreement. by properly aligning information about the project between organizations a and b, it would be possible to allocate a development team with skills and maturity suitable for the project. by doing that, c5 and c6 would be minimized. complementary, it was defined to change the development process as a whole. in the organization a business model, when a project is contracted by a client, usually there is a cost and time associated to it. this prevented us from using a pure agile development process, first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 where costs are dynamically established. as a strategy to implement tailored agile practices, it was defined: after requirements are validated, the development team (tech leader and developers) selects the requirements to be developed in a short cycle of development (i.e., a sprint), defines tasks and estimates the time and costs related to them. this information is aligned between organizations a and b. this way, organization b manages time and budget at a project level, while organization a manages time and budget in the sprint context. once a week, monitoring meetings are performed to check time and budget performance. during the sprint, meetings based on the scrum ceremonies are carried out in a flexible way. for example, if the team informs that there is nothing to report at the day, the daily meeting is not performed. meetings that depend on the client’s feedback should be carried out with organization b (in the product owner role). by breaking the development process into shorter cycles, c4 is addressed, since the product is also decomposed in smaller deliverables. this strategy also contributes to treating c9, as it changes the traditional development culture. aiming to change the way teams are organized in organization a (ub5) and thus address c9, the strategy self-organized teams (s3) was defined to implement squad and guild concepts (leffingwel 2016). a squad is a team with all skills and tools needed to develop and release a project. it is self-organized and can make decisions about its way of working. for example, a squad can define the project development timebox (sprint) and how to implement some practices of strategies s1 and s2 (e.g., the use of bdd and how flexible scrum ceremonies can be in the project). the members are responsible for creating and maintaining one or more projects. a squad is composed of developers and a tech leader, who is responsible for communicating with organization b mainly regarding aspects related to budget, time, and requirements. a guild is a team responsible for defining standards and good practices that will be used for all squads. a guild is composed of members with expertise in the subject of interest (e.g., a senior programmer can define good programming practices). its purpose is to record and share good practices among the squads in the organization, aiming at achieving a homogeneous level of quality in the projects. to address c7 and c8, which cause the organization to have inconsistent knowledge of agile methods (ub4), we defined agile common conceptualization (s4) as a strategy to use reference ontologies to provide a common conceptualization about the software engineering domain as a whole, and about the agile development process in particular. we used ontologies from seon (ruy et al. 2016) to extract the view relevant to understand agile development. it contains a conceptual model fragment, axioms and textual descriptions that provide an integrated view of agile and traditional development, defining concepts in a clear, objective, and unambiguous way. we suggested the use of seon because its ontologies have been developed based on the literature and several standards, providing a consensual conceptualization. moreover, as we discussed in section 2, we have successfully used it in several interoperability and knowledge-related initiatives. the seon view used in the study focuses on the scrum reference ontology (sro) and can be seen in (santos jr et al. 2021a). to make it easier for the teams to learn and apply the conceptualization provided by the ontology, the authors created complementary artifacts that combined graphical and textual elements. we show some of the produced artifacts in section 4.3.2. table 3 summarizes the defined strategies, the leverage points (causes) addressed by them, and main agile concepts involved. it is worth noticing that some agile concepts were indirectly addressed. for example, although we did not directly use product backlog in s1, the set of requirements agreed with organization b works as such. similarly, in s3, when the team selects the requirements to be addressed in a development cycle, we are applying the sprint backlog notion. we decided not to use some of the original terms because organization a had a previous bad experience trying to implement agile practices by following scrum “by the book”, which did not work and provoked resistance to certain practices. thus, we tried to give some flexibility even to the practices’ names, to avoid bad links with the previous experience. table 3. strategies, causes and agile concepts. # strategies agile concepts causes s1 new procedure to communicate requirements user story, bdd, product owner and product backlog c1, c2, c3 s2 budget and time globally and locally managed through short development cycles sprint, sprint backlog, scrum meetings and small deliverables c4, c5, c6, c9 s3 self-organized teams squad and guild c9 s4 agile common conceptualization concepts related to agile software development c7, c8 after defining and validating the strategies with the team, they were executed by the organization in two projects with the supervision of the first and third authors. the first project started and finished during this study. the second project started before the study and was still ongoing at the time we wrote this paper. the new practices started to be used in early february 2020. about four months later, we conducted an interview to obtain feedback. at that point, one of the projects had already been concluded and the other was ongoing. 4.3 study analysis, interpretation and lessons learned in this section, we present results from the interviews that helped us to answer the research question, the resulting systems theory-based process that arose from this study and some lessons learned. 4.3.1 results to answer the research question, we carried out an interview with the software development director and the tech leader aiming to obtain their perception about the use of systems theory tools, gut matrix and reference ontologies, as well as to get information about results obtained from the use of the defined strategies. they were interviewed together in a single section. the director said that, in his opinion, systems theory tools provided means to understand how different organizational aspects (e.g., business rules and quality software practices) are interrelated and influence each other, and how these aspects first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 and interrelations produce desirable and undesirable behaviors. for example, he said that “the systemic maps allowed me to understand how poorly specified requirements can negatively impact different parts of the project and of the organization”. moreover, according to him, “systems theory helped create strategies to change undesirable behaviors, since it provided a comprehensive understanding of the organization behavior and supported identifying causes of undesirable behaviors”. for example, by knowing the impacts of poorly specified requirements, “i perceived the need to implement practices to guarantee the quality of the requirements and that development tasks should only start if the developer truly understood the requirement”. regarding gut matrix, the director stated that it found it easy to use, and important to prioritize the undesirable behaviors to be changed first. according to him, using these tools “was easier and clearer when compared to ishikawa and pareto diagrams, because systemic maps allow more comprehensive and freer views and gut matrix has a simple way of prioritization.” concerning reference ontologies, he reported that they were useful to create a common communication among project stakeholders and business partners, eliminating some misunderstandings not only about agile practices but also about software engineering in general. for example, the director said that “by using the conceptualization provided by the ontology, the team truly understood the “done” concept”, commonly used in agile projects, in the sense that a software item (e.g., a functionality, a component) is done (i.e., ready to be delivered to the client) only if it met all the acceptance criteria established to the user stories materialized in that software item. the tech leader commented that “by using the ontology conceptualization, it was clearer the necessary information a requirement description should contain so that it can be properly understood.” an interesting aspect pointed out by the interviewees was that the conceptualization provided by the reference ontologies was used by the development teams as a basis to quality rules in the projects (e.g., when a software item is done) and, also to business rules in new business contracts (e.g., acceptance criteria need to be defined). the director and tech leader informed that the first project in which the strategies were implemented was considered a successful experience and served as a pilot. in similar projects, organization a used to be 30% to 50% over time and budget due to spending extra resources on new urgent development activities to fix defects. by adopting the defined strategies, the project delivered a better product (at the moment of the interview, the client did not have reported any defect in the production environment). however, the project was about 15% over budget and time due to changes in the agreed requirements. this may suggest that strategies s1 and s2 need adjustments. although they seek to give some agility features to the development process, the project had its scope predefined by organization b, which established it together with the client and set cost and time considering that scope. as organization a started to develop the agreed requirements, organization b noticed that some of them needed to change to better satisfy the client needs. although the project was late and over budget, the deviation in relation to the agreed cost and time was smaller than in similar projects that did not follow the strategies. the director pointed out that being able to show this difference to organization b, indicating the causes that contribute to increase or decrease it, was an important result and can even be used to motivate organization b to be more involved in the changes to improve the software development process as a whole. this would make it possible, for example, to adjust strategies s1 and s2 to make requirements elicitation, cost, and time estimation more flexible. the tech leader reported that using the strategies reduced misunderstandings in software requirements among the stakeholders and enabled better managing budget and time locally, in short development cycles. moreover, according to him, in the second project adopting the strategies (ongoing project), the development team spent only 45 hours in new urgent development activities in a total of about 2000 hours of performed development activities. he also highlighted the use of user stories and bdd as an effective way to communicate requirements in this project. in addition, the interviewees said that the self-organization culture has been developed in the teams and that the use of squads has been very helpful. the use of guilds was still in progress. finally, they commented that, although the proposed strategies were used to address some undesirable behaviors by applying agile practices and concepts, they felt that “changing the entire traditional culture can be a complex work”, mainly because it requires to change mental models, processes and culture that also involve the organization partners (particularly organization b) and clients. aiming to obtain quantitative data to complement the feedback provided by the software development director and the tech leader and help us identify the effects of the adopted strategies, we collected data from the two projects (one finished and another ongoing) where the strategies were implemented and from other projects that did not use the strategies. data was extracted from jira, which is used by organization a to support part of the software development process. considering that the strategies were applied in the projects in different moments (the first project adopted the strategies from its beginning to its end, while the second adopted the strategies when it was already ongoing), we decided to analyze them separately. first, we collected data regarding the tasks performed in the first project and in other 22 projects that did not adopt the strategies and were carried out in the same time-box of our study. the tasks were classified into development tasks, which create new features, and bug-fixing tasks, which fix problems (found by the quality assurance team or by the client) in the developed features. for each project, we calculated the percentage of effort spent on tasks dedicated to developing new features and the percentage spent on tasks performed to fix bugs. thus, we calculated the median of the obtained values for the 22 projects that did not use the strategies, so that we could compare the resulting value with the project where the strategies were adopted. table 4 shows the results. table 4. effort spent on development and bug-fixing tasks in different projects. task project that adopted the strategies projects that did not adopt the strategies development 97,62% 81,07% bug-fixing 2,38% 18,93% first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 as it can be observed in table 4, when compared with the other projects developed in the same time-box, the development team from the project that adopted the defined strategies spent more effort on developing new features (97,62%) than fixing problems (2,38%). this corroborates interviewees’ perception that the proposed strategies improved product and process quality. aiming to verify changes caused by the strategies in the same project, we also collected data from the beginning of the second project (i.e., jan/2019), until the last month of our study. our purpose was to compare the effort spent on each type of task before and after applying the strategies. table 5 presents the obtained values. table 5. effort spent on development and bug-fixing tasks before and after applying the strategies in the project. task before the strategies after the strategies development 62,15% 88,21% bug-fixing 37,85% 11,79% as it can be noticed, after applying the strategies, there was an increase in the effort spent on developing new features and a reduction in the effort to fix bugs, which is consistent with the interviewees’ perception. it is worth noticing that there was more time spent on the project before applying the strategies (about one year) than after that (about four months). this should be considered together with the obtained data (e.g., we do not know if the amount of effort spent on which type of tasks may significantly change over time). 4.3.2 using reference ontologies to learn scrum although reference ontologies are a good way to structure and represent knowledge, it may not be much easy for some people to capture and internalize the conceptualization represented in the ontology. thus, in the case reported in this paper, we used some complementary artifacts to help in this matter. first, we asked the team which artifacts they were used to. based on their answers, we decided to use mainly textual descriptions and process models, since the team considered them user-friendly, and they were present in its daily activities. we also used other diagrams to illustrate scrum concepts and a kanban board to map scrum concepts to concepts already familiar to the team. the seon extract addressing agile aspects and connecting them to traditional aspects provided the common conceptualization and knowledge about the domain of interest. for example, the ontology makes it explicit that only deliverables (i.e., software items, such as a functionality or a component) that met all the acceptance criteria established to the user stories they materialize can be added to the sprint deliverable (e.g., a software module) and, thus, to the project deliverable (e.g., a software product). the complementary artifacts, in turn, present the conceptualization to the team by using alternative representations. as we previously said, the seon extract used in this study focuses mainly on the scrum reference ontology (sro) and can be found in (santos jr et al. 2021a)). table 6 summarizes some concepts from sro used in this study. table 6. some concepts from sro. concept description scrum project software project that adopts scrum in its process. sprint backlog artifact that contains the requirements of the product to be developed in the scrum project. planning meeting ceremony performed in a sprint where the development team plans it. user story requirement artifact (i.e., a requirement recorded in some way) that describes requirements in a scrum project. it indicates a goal that the user expects to achieve by using the system and, thus, represents value for the client. a user story can be an atomic user story, when it is not decomposed into others, or an epic, when it is composed of other use stories. acceptance criteria requirement established to a user story and that must be met when the user story is materialized. thus, it is used to verify if the user story was developed correctly and meets the client needs. intended scrum development task development task planned to be performed in a sprint. performed scrum development task development task performed in a scrum project. deliverable software item that materializes user stories. accepted deliverable deliverable that is in conformance to all the acceptance criteria established to the user stories materialized by that deliverable. not accepted deliverable deliverable that is not in conformance to at least one acceptance criteria established to the user stories materialized by that deliverable. sprint deliverable accepted deliverable resulting of a sprint. figure 4 illustrates the relationship between the reference ontologies and the complementary artifacts. as a result of this approach, we shortened the distance between the team and the conceptualization provided by the ontologies, improving domain understanding and communication. figure 4. reference ontologies and complementary representation artifacts. first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 to address behavioral aspects of scrum (e.g., activities, the flows between them and objects they manipulate), we created process models. for that, we first mapped concepts from seon to constructs of bpmn (omg 2013), which was the modeling language used to represent the process models. put it simply, we identified which bpmn constructs should be used to represent seon concepts or their instances. for example, the activity bpmn construct should be used to represent performed scrum development task, ceremony, planning meeting and other seon concepts referring to activities, tasks or processes. the actor bpmn construct, in turn, should be used to represent seon concepts referring to people or roles, such as developer and product owner. then, we represented and complemented knowledge provided by the reference ontologies by creating process models like the one illustrated in figure 6. in the process models, by following the approach suggested in (guizzardi et al. 2016), we used the event construct to represent the state of affairs (i.e., a situation) caused by the execution of an activity or when a temporal constraint started or ended. the process model shown in figure 5 was used to illustrate the creation of the sprint backlog in the planning meeting ceremony, the selection of user stories to be implemented in performed scrum development tasks and materialized by deliverables, and the validation of the deliverables that, if accepted (accepted deliverable), are integrated into the sprint deliverable. if not accepted (not accepted deliverable), they must be addressed in new tasks. the bold terms aforementioned refer to the seon concepts addressed in the process model presented in figure 5. the process complements the conceptualization provided by the ontology by making explicit some activities, the flow between them and the state of affairs resulting from the activities execution. figure 5. example of process model created based on seon conceptualization. in addition to process models, we also used some diagrams to better illustrate some concepts. for example, to help the team visualize that (i) an epic is a complex user story composed of others, (ii) user stories must have acceptance criteria established to them, and (iii) in the sprint backlog, tasks are planned (i.e., intended scrum development tasks) to implement the user stories, we used the diagram shown in figure 6. organization a did not have a clear semantic distinction between epic, user story and task. many times, these concepts were treated in the same way, being considered as a simple issue by the developers. this lack of conceptual distinction caused problems in project management, estimation, requirements prioritization and communication with organization b and the client. by using the conceptualization provided by seon and a simple diagram (as the one shown in figure 6), the team better understood these concepts and was able to properly use them in backlog management. figure 6. diagram used to illustrate the relation among sprint backlog, epic, user story and intended task. we also used a kanban board to illustrate some concepts. for example, figure 7 depicts a sketch where we explored tasks and deliverables, showing that if a card is moved to the “done” column, that means that the deliverable produced by the corresponding task must have been evaluated (considering acceptance criteria related to the respective user story) and accepted. first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 figure 7. kanban board illustration used to explore task and deliverable concepts. to complement the created artifacts, we also created a dictionary of terms (similar to table 6) containing textual definitions of seon concepts and some constraints. the ontology and complementary artifacts were used in two workshops where the first and third authors presented the reference ontology and explained its conceptualization by using its conceptual model and the complementary artifacts. 4.3.3 systems theory-based process an important result that arose from this study is a process that combines systems theory tools and gut matrix to aid organizations to move from traditional to agile. figure 8 shows the process, and we briefly explain it next. figure 8. process to aid defining strategies and implementing agile practices. understand the organization: it consists in obtaining information to understand the organization as a whole so that it will be possible to define strategies to implement agile practices in a suitable way for the organization, considering its culture, environment, business rules, software processes, agile experience and knowledge, people, and so on. information can be obtained by using techniques such as interviews, document analysis and observation, among others. build a systemic view: this consists in using information obtained in the previous step to build systemic maps to understand organization behaviors relevant in the agile development context. organization borders, relevant variables that drive organization behavior, causal relationships between them and feedback loops must be represented. archetypes describing behavior patterns must also be identified from the systemic maps. identify leverage points: this involves analyzing systematic maps and archetypes to identify undesirable behaviors and their causes. at this point, desirable behaviors in agile organizations suggested in the literature can also be used to verify if the organization fits them. undesirable behaviors should be prioritized by using a gut matrix, so that it is possible to identify which ones represent leverage points and will be addressed in the strategies. establish strategies: this consists in defining strategies (i.e., plans and actions) to implement agile practices focusing on the leverage points and considering the organization culture, business, rules, environment, people, etc. implement strategies: this involves implementing the defined strategies. it is suggested to start with one or two projects. after that, if the strategies work, they can be extended to other projects and then to the entire organization. monitor strategies: this consists in evaluating if undesirable behaviors changed as expected after strategies execution. the new behaviors caused by the strategies need to be evaluated and, depending on the results, strategies can be extended to other projects, aborted or adjusted. 4.3.4 lesson learned in this section, we discuss some lessons we learned in the study. in the lessons learned, we adopt terms such as should and may instead of mandatory terms such as must because we learned the lessons from a single case study. thus, we believe that other studies are needed to corroborate what we have learned. systemic maps should be built with a goal in mind: since systemic maps allow to represent a comprehensive view of how the organization behaves and this may involve many aspects, it is important to focus on variables relevant to the goal to be achieved from the use of the systemic maps. otherwise, the maps can be too complex and involve variables that do not provide meaningful information for the desired purpose. the boundaries of the system should be clearly identified: to understand how external elements can influence organization behaviors, it is important to identify the organization boundaries as well the elements that the organization controls and the ones controlled by external agents. this way, it will be possible to create suitable strategies considering both the organization and the external agents. changes in leverage points may change the system as a whole: we noticed that when the changes are made in leverage points, particularly in the ones connected to undesirable behaviors with higher priority, the changes tend to provoke a meaningful shift in the organization behavior as a whole, first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 changing existing behaviors and creating others. for example, by changing the way organizations a and b deal with project scope, time and budget, there were also changes in the way organization a allocates teams and selects requirements to be implemented, and the need for changes in the partnership rules with organization b was perceived. strategies should be integrated into the software processes: for strategies to be performed as part of the organization daily activities, it is important that they are incorporated into the processes performed by the organization. in the study, the strategies were incorporated into the organization software process, involving development, management, and quality assurance activities. strategies should be gradually implemented and start in relevant projects: implementing the changes gradually and starting with one or two projects it was positive and the obtained results contributed for the organization to keep the intention of expanding the changes to other projects. we selected projects in which the teams were interested in using agile practices and that was important for the organization, so that the commitment of the team would be higher. this helped to minimize resistance to the new practices. once they experienced the benefits of following the strategies, team members became disseminators of the new practices and concepts, helping to extend agile culture to other team members. strategies results should be measurable: when defining the strategies, we did not define any indicator to measure its effectiveness. however, the tech leaders used some metrics in the projects (e.g., number of hours spent in new urgent development activities, budget deviation, etc.) that helped us to evaluate the strategies. thus, when defining the strategies, it is important to define the indicators to be used to evaluate them. using system theory tools may be costly and not trivial: although system theory tools were very useful to provide an understanding of the organization, they may be a costly choice, because they demand time, effort and knowledge of the tools and organization. hence, depending on the scope to be considered, it may be difficult or unfeasible to use them. other methods can be helpful in this context. considering this learned lesson, we created zeppelin (santos jr et al. 2021b), a diagnosis instrument that helps get a “big picture” of the organization by identifying software practices performed by it. thus, zeppelin can be used to provide initial knowledge about the organization scenario, allowing to narrow the scope to be further investigated through system theory tools. representing the ontology conceptualization using process models, textual descriptions and simple diagrams can be more palatable than conceptual (structural) models: the reference ontologies of seon are represented by means of conceptual (structural) models, textual descriptions, and axioms. although the conceptual model of the seon view used in the study provides an abstract view showing all the relevant concepts and relations in a single model (santos jr et al. 2021a), we noticed that the team preferred textual descriptions and other representations to the seon conceptual model. thus, we prepared a document containing the concepts relevant to the study and their detailed description, also including information about constraints and relationships. we also prepared complementary artifacts using process models and other diagrams to illustrate and complement knowledge provided by seon. this way, the conceptualization provided by the ontologies was represented in a more palatable way for the team. a consolidated and accessible body of knowledge may help achieve a common conceptualization: in the study, we used ontologies as a reference to establish a common conceptualization of agile development. we are very familiar with ontologies and two of the authors are also authors of the ontologies used in the study, which were established based on the literature and standards and, thus, provide a consensual view of the domain of interest. considering organization a needs and the participation of the authors in the study, the used ontologies perfectly fit. however, we are aware that, for an organization not familiar with ontologies, using an ontology as the starting point to establish a common conceptualization can be challenging. we believe that organizations should use a body of knowledge suitable for its characteristics to establish a common conceptualization about the domain of interest. for example, some organizations may prefer to use textual references, such as the scrum body of knowledge (satpathy 2013). changes involving business partners can be hard to implement and demand more flexibility and time: the way organization b works directly affects organization a. due to business arrangements, organization a does not have enough influence to make changes in organization b. it can suggest changes, but it cannot demand them. thus, it was necessary to define strategies that caused only small changes in organization b (e.g., help to better describe requirements, allow shared control of time and cost). by noticing improvements from the use of the proposed strategies, organization b may be more willing to further changes. squads should have autonomy to choose methods and tools: the organization can have a set of tools, techniques and methods to be adopted in the projects. guilds can help define that. according to the project team and characteristics, some tools, methods and techniques can suit better. we noticed that the squad became more self-organized when its members could choose the techniques to solve the project problems. for example, in the study, a squad decided to adopt user stories and bdd to describe requirements, while the other used the complete user story template. in both cases, information about requirements was clear and complete. however, each squad chose the technique more suitable for the project and team characteristics. agile-related human aspects need to be developed gradually: agile culture demands some soft skills (e.g., self-organized teams, proactivity, empathy) (lima and porto 2019) that not are much common in a traditional plan-driven environment. we observed that some members had problems materializing what it means to be self-organized, proactive and empathetic, because they were used to command-control from traditional culture. we noticed that by using short-, mediumand long-term actions (what is short, medium and long is established by the organization), it is possible to gradually develop agile culture. short-term actions should focus on understanding the needed skills (e.g., promoting debates about soft skills in software development) and practicing them in the projects. medium-term actions should empower the use of soft skills combined with hard skills (e.g., human-centered design (smith et al. 2012)). finally, long-term actions should institutionalize the soft and hard skills and truly change the whole organization culture. first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 5 threats to validity to the study results the validity of a study denotes the trustworthiness of the results. every study has threats and limitations that should be addressed as much as possible and considered together with the results. in this section, we discuss some threats considering the classification proposed in (runeson et al. 2012). the main threat in this study is related to the researchers who conducted the study. participative case studies are biased and subjective as their results rely on the researchers (baskerville 1997). the first and third authors acted as consultants in organization a and were responsible for conducting the interviews, creating systemic maps and gut matrix, and defining strategies. moreover, the authors created the complementary artifacts used to share knowledge of scrum. since the authors were very familiar with seon, they did not have difficulties creating the artifacts. other people, less familiar with seon, could have difficulties to create the artifacts or could have created different artifacts. furthermore, to create the artifacts, the authors took the team preferences into account (the team told us that process models and diagrams were a good choice for it). the researchers participation affects internal validity, which is concerned with the relationship between results and the applied treatment; external validity, which regards to what extent it is possible to generalize the results from the case specific findings to different cases; and reliability validity, which refers to what extent data and analysis depend on specific researchers. aiming to reduce researchers’ bias, the members of organization a that participated in the study (i.e., two directors, one tech leader and two developers) participated in the activities and validated results. moreover, another researcher (the second author), external to the organization, evaluated data collection and analysis and was involved in discussing and reflecting on the study and results. concerning construct validity, which is related to the constructs involved in the study, the main threat is that we did not define indicators to evaluate results. data collection was performed through interviews, which are subjective. to minimize this threat, we used some measures collected in the projects to evaluate the new behaviors caused by the proposed strategies. however, since the measures were not previously defined, they are limited to enable a proper evaluation of the strategies and the effects caused by them. another threat concerns the notations used to create the complementary artifacts, since the team could misunderstand the represented concepts due to different semantics assigned to the constructs. to address this threat, the authors asked the team to choose the notations to be used and types of artifacts to be created, so that it was possible to produce artifacts consistent with the team knowledge. in case-based research, after getting results from specific case studies, generalization can be established for similar cases. however, the threats aforementioned constraint generalization. moreover, the study involved only one organization. thus, it is not possible to generalize results for cases without researcher intervention or for organizations not similar to organization a. 6 conclusions, future work and implications this paper presented a case study carried out in a brazilian organization towards the first transition in the path prescribed by the stairway to heaven (sth) model (olsson et al. 2012). organization a develops software in partnership with a european organization (organization b) and it does not have direct contact with clients. after an unsuccessful attempt to implement agile practices “by the book”, the organization started a long-term process improvement program. to support it, we have used sth to describe the evolution path to be followed by organization a. to aid in the first transition and move from traditional to agile, we combined systems theory tools, gut matrix and reference ontologies. in summary, systems theory tools and gut matrix were helpful to better understand the organization, find leverage points of change and define strategies aligned to the organization characteristics and priorities. reference ontologies were useful to establish a common understanding of agile methods, enabling teams to be aware of and, thus, more committed to agile practices and concepts. by using process models, textual descriptions and other diagrams, the conceptualization provided by seon, the software engineering ontology network (ruy et al. 2016), became more palatable to the team, helping achieve a common understanding. as a result of the initiative, the organization has implemented agile practices in a flexible way and combined with some traditional practices, which is more suitable for the organization characteristics. due to the obtained results, the organization kept its intention to continue evolving by following the sth stages. in the first transition, it was not possible to propose big changes in the way organization b works. however, organization a expects that considering the positive results, organization b will be more willing to be involved in the evolution path. this will be crucial in the more advanced stage, where data from the clients are needed to support decision-making and identify new opportunities. regarding human aspects, we focused mainly on soft skills related to agile culture. strategy s3 is directly related to human aspects, being responsible for implementing squads and guilds. squads promoted self-organization, trust, leadership, and other important skills in agile organizations. guilds promoted the creation of processes and organizational culture that enabled sharing and managing knowledge at individual, team, and organizational levels. this knowledge is valuable to the continuous improvement of organization a. by changing human aspects, s3 enabled organization a to create processes, vocabulary, and mindset, i.e., an organizational culture that supported the movement from traditional to agile. moreover, the soft skills developed by s3 supported other strategies. for example, s1 and s2 were possible because s3 developed some soft skills (e.g., effective communication, self-organization and adaptability) that supported s1 and s2. as for the limitations of our approach, we highlight that it involves a lot of tacit knowledge and judgment. besides knowledge about system thinking tools and gut matrix, it is necessary to have organizational knowledge to apply them (e.g., one must be able to properly identify problems, investigate causes, define strategies etc.). moreover, the evaluation of our proposal was limited. we have used it only in the first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 study reported in this paper, which involved the participation of the authors. furthermore, the evaluation was mainly based on qualitative data. thus, new studies are necessary to evaluate the proposal in other organizations and quantitatively evaluate the effects of using it. as future work, we plan to add knowledge (e.g., by means of guidelines) to help others to use our approach. we also intend to explore other systems theory tools and combine them with enterprise architecture models to connect system variables, undesirable behaviors and causes to elements of the organization architecture. concerning organization a, we plan to monitor the implemented strategies and extend them to other projects. once the new practices become solid, we plan to aid organization a in the next transitions, where continuous integration and continuous deployment are performed. concerning the use of seon, we must point out that the authors were familiar with its conceptualization. in fact, as we previously said, the first and third authors are also authors of the scrum reference ontology (santos jr et al. 2021a), the seon ontology concerning scrum that provided the central concepts explored in the study. this made it easier to create the complementary artifacts and use them to share knowledge with the team to achieve a common understanding and conceptualization in organization a. it is also worthy highlighting that, although the complementary artifacts are simple artifacts, the conceptualization behind them, provided by seon is the key point to achieve a common conceptualization and understanding. we have applied the portion of seon used in the case reported in this paper to integrate data from different applications and provide consolidated information to support decision making in agile organizations, as we reported in (santos jr et al. 2021a). we intend to use seon with this purpose in organization a. since the team has learned seon conceptualization, we believe that the first step towards this goal has already been given. finally, the contributions of this paper have implications for practice and research. regarding implication for practice, this paper promotes the use of systems thinking tools as a means to identify leverage points relevant to moving an organization from traditional to agile development. furthermore, the proposed strategies can be used by practitioners and organizations to address problems similar to the ones of organization a. in addition, we showed how ontologies could be used to create artifacts and share a common conceptualization and understanding of agile development. other people can be inspired by that to solve knowledge problems in agile and other contexts. the systems theorybased process also has implications for practice, since it can be used by other organizations to help the transition from traditional to agile development. concerning implications for research, this paper introduces the combined use of systems theory tools, gut matrix and reference ontologies to support the transition from traditional and agile development. the combined use of them and the proposed system theory-based process can bring new research questions to be explored in further research. moreover, the successful use of ontologies to create more palatable artifacts to practitioners can be a starting point to new research aiming to make the most of this powerful instrument of knowledge structuring and representation. using reference ontologies in the industry is still a challenge. the use of operational ontologies is more common in this context, mainly due to the semantic web and also to data and systems interoperability solutions. however, reference ontologies are also valuable artifacts and provide structured, common and well-founded knowledge useful to learning and communication. we believe that new research should be conducted to investigate how to make reference ontologies more palatable for the industry. in the study reported in the paper, we gave the first step towards that. however, other advances are needed. in this sense, we believe that new research aiming to overcome the challenges of using ontologies in industrial settings are necessary. references ali n, lai r (2018) requirements engineering in global software development: a survey study from the perspectives of stakeholders. j softw 13:520–532. https://doi.org/10.17706/jsw.13.10.520-532 baskerville r (1997) distinguishing action research from participative case studies. j syst inf technol 1:24–43. https://doi.org/10.1108/13287269780000733 bastos ec, barcellos mp, falbo r (2018) using semantic documentation to support software project management. j data semant 7:107–132. https://doi.org/10.1007/s13740-018-0089-z binamungu lp, embury sm, konstantinou n (2020) characterising the quality of behaviour driven development specifications. springer international publishing bosch j (2014) continuous software engineering: an introduction. in: continuous software engineering. springer international publishing, cham, pp 3–13 bringuente ac, falbo r, guizzardi g (2011) using a foundational ontology for reengineering a software process ontology.in: journal of information and data management (jidm), vol. 2, pp. 511–526 de sousa tl, venson e, figueiredo rmdc, et al (2016) using scrum in outsourced government projects: an action research. in: 2016 49th hawaii international conference on system sciences (hicss). ieee, pp 5447–5456 duarte bb, leal castro al, falbo r, guizzardi g, guizzardi rss, souza vs (2018) ontological foundations for software requirements with a focus on requirements at runtime. in: applied ontology, vol.13, pp. 73-105 dybå t, dingsøyr t (2008) empirical studies of agile software development: a systematic review. inf softw technol 50:833–859. https://doi.org/https://doi.org/10.1016/j.infsof.2008.0 1.006 falbo r (2014) sabio: systematic approach for building ontologies. in: onto. com/odise@ fois. falbo r, ruy f, guizzardi g, barcellos mp, almeida jpa (2014) towards an enterprise ontology pattern language. in: proceedings of the 29th acm symposium on applied computing (acm sac 2014) fonseca v, barcellos mp, falbo r (2017) an ontologybased approach for integrating tools supporting the software measurement process. sci comput program first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 135:20–44. https://doi.org/10.1016/j.scico.2016.10.004 fitzgerald b and stol k. (2017). continuous software engineering: a roadmap and agenda. journal of systems and software, volume 123, pp 176-189, issn: 0164-1212, https://doi.org/10.1016/j.jss.2015.06.063 guizzardi g (2007) on ontology, ontologies, conceptualizations, modeling languages, and (meta)models. in: proceedings of the 2007 conference on databases and information systems iv: selected papers from the seventh international baltic conference db&;is’2006. ios press, nld, pp 18–39 guizzardi g, guarino n, almeida jpa (2016) ontological considerations about the representation of events and endurants in business models. in: 14th international conference, bpm 2016. rio de janeiro, pp 20–36 jim m, piattini m, vizca a (2009) challenges and improvements in distributed software development : a systematic review. 2009:. https://doi.org/10.1155/2009/710971 karvonen t, lwakatare le, sauvola t, et al (2015) hitting the target: practices for moving toward innovation experiment systems. in: international conference of software business (icsob 2015). springer, pp 117– 131 kepner ch, tregoe bb (1981) the new rational manager. princeton research press princeton, nj kim dh (1994) systems archetypes i. diagnosing systemic issues and designing highleverage interventions.(toolbox reprint series) cambridge ma: pegasus communications l’erario a, gonçalves ja, fabri ja, et al (2020) cfdsd: a communication framework for distributed software development. j brazilian comput soc 26:. https://doi.org/10.1186/s13173-020-00101-7 leffingwel d (2016) safe® 4.0 reference guide: scaled agile framework® for lean software and systems engineering. lima t, porto j (2019) análise de soft skills na visão de profissionais da engenharia de software. in: anais do iv workshop sobre aspectos sociais, humanos e econômicos de software. sbc, porto alegre, rs, brasil, pp 31–40 meadows dh (2008) thinking in systems: a primer. chelsea green publishing olsson hh, alahyari h, bosch j (2012) climbing the stairway to heaven: a mulitiple-case study exploring barriers in the transition from agile development towards continuous deployment of software. in: 2012 38th euromicro conference on software engineering and advanced applications. ieee, pp 392–399 omg (2013) business process model and notation (bpmn). version 2.0.2, object management group (technical report, object management group) prikladnicki r, audy jln (2010) process models in the practice of distributed software development: a systematic review of the literature. inf softw technol 52:779–791. https://doi.org/10.1016/j.infsof.2010.03.009 rodriguez p, markkula j, oivo m, turula k (2012) survey on agile and lean usage in finnish software industry. in: proceedings of the acm-ieee international symposium on empirical software engineering and measurement. association for computing machinery, new york, ny, usa, pp 139– 148 runeson p, host m, rainer a, regnell b (2012) case study research in software engineering: guidelines and examples, 1st edn. wiley publishing ruy f, souza e, falbo r, barcellos m (2017) software testing processes in iso standards: how to harmonize them? in: in proceedings of the 16th brazilian symposium on software quality (sbqs). pp 296–310 ruy fb, falbo r, barcellos mp, et al (2016) seon: a software engineering ontology network. in: lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). pp 527–542 santos la, barcelos mp, falbo r, reginaldo cc, campos pmc (2019) measurement task ontology. in 12nd seminar on ontology research in brazil (ontobras 2019). santos jr ps, barcellos mp, calhau rf (2020) am i going to heaven? in: proceedings of the 34th brazilian symposium on software engineering. acm, natal, brazil, pp 309–318 santos jr ps, barcellos mp, falbo r de a, almeida jpa (2021a) from a scrum reference ontology to the integration of applications for data-driven software development. inf softw technol 136:106570. https://doi.org/https://doi.org/10.1016/j.infsof.2021.1 06570 santos jr ps, barcellos mp, ruy fb (2021b) tell me: am i going to heaven? a diagnosis instrument ofcontinuous software engineering practices adoption. in: evaluation andassessment in software engineering (ease 2021). acm, trond-heim satpathy t (ed) (2013) a guide to the scrum body of knowledge : sbok guide. scrumstudy a brand of vmedu, inc schwaber, ken; sutherland j (2013) the scrum guide-the definitive guide to scrum: the rules of the game smith pj, beatty r, hayes cc, et al (2012) human-centered design of decision-support systems. in: jacko ja (ed) the human computer interaction handbook, 3rd edn. crc press, boca raton, fl, pp 589–622 sterman j (2010) business dynamics. irwin/mcgraw-hill c2000.. sterman jd (1994) learning in and about complex systems. syst dyn rev 10:291–330 studer r, benjamins vr, fensel d (1998) knowledge engineering: principles and methods. data knowl eng 25:161–197. https://doi.org/10.1016/s0169023x(97)00056-6 williams l, cockburn a (2003) agile software development: it’s about feedback and change. ieee comput 36:39–43 wynne m, hellesoy a, tooke s (2017) the cucumber book: behaviour-driven development for testers and first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 developers. pragmatic bookshelf first step climbing the stairway to heaven model results from a case study in industry 1 introduction 2 background 2.1 stairway to heaven 2.2 system theory 2.3 gut matrix 2.4 reference ontology 3 related work 4 case study, planning, execution, and results 4.1 study design 4.1.1 diagnosis 4.1.2 planning 4.2 study execution and data collection 4.2.1 data collection a. initial interviews b. systemic maps c. gut matrix d. strategies 4.3 study analysis, interpretation and lessons learned 4.3.1 results 4.3.2 using reference ontologies to learn scrum 4.3.3 systems theory-based process 4.3.4 lesson learned 5 threats to validity to the study results 6 conclusions, future work and implications references journal of software engineering research and development, 2023, 11:6, doi: 10.5753/jserd.2023.2646 this work is licensed under a creative commons attribution 4.0 international license.. insights from the application of exploratory tests in the daily life of distributed teams: an experience report jarbele c. s. coutinho [ federal rural university of the semi-arid | jarbele.coutinho@ufersa.edu.br ] wilkersonl.andrade [ federaluniversityofcampinagrande | wilkerson@computacao.ufcg.edu.br ] patrícia d. l. machado [ federal university of campina grande | patricia@computacao.ufcg.edu.br ] abstract the exploratory testing (et) approach has been adopted in the context of agile development due to the effectiveness of its application. due to these benefits, the need arose to train agile professionals based on the practical application of this type of test to contribute to its incorporation into the daily work of teams. in this sense, the objective of this article is to investigate the contributions and limitations of adopting problem-based learning (pbl) and just-in-time teaching (jitt) in et teaching-learning, and the main aspects that favor or limit the incorporation of et into the day-to-day of agile teams. for this, we conducted a course in remote teaching format with agile professionals from a software development company, distributed geographically. at the end of the course, data were collected through an online questionnaire and examined with quantitative and qualitative analysis. then, the et activities performed by the participants in their daily lives were monitored and a brainstorming session was conducted to evaluate this experience. our main findings are that (1) the collaboration between participants and the adoption of a real problem, along with (2) activities and resources made available before the class, and (3) the existence of specific tool support for et sessions optimized learning in the context of remote teaching. other main results refer to the planning and registration of et and the need for guidelines to guide the execution of et. therefore, integrating theory and practice in et is necessary for a better understanding of the effects of tests in the agile environment. additionally, it is necessary to investigate specific approaches and tools that contribute to the execution of the et and, consequently, to the incorporation of this test into the daily lives of the teams. keywords: software testing, exploratory testing, testing education, testing learning and teaching, active learning, jitt, just-in-time teaching, pbl, problem based learning 1 introduction aligning theory and practice regarding the teaching of software engineering (se) is a persistent challenge, both in the academic context and in the industry (leite et al., 2020). providing and stimulating experiences that contribute to the technical and non-technical training of students and professionals in this area requires actions to plan the curriculum and curricular components, articulate new teaching methodologies, and include innovative pedagogical elements (cheiran et al., 2017). in this context, the teaching of software testing (st) also stands out. for cheiran et al. (2017), st is one of the areas of se that presents challenges for teaching. it may be difficult and inefficient to teach st through lectures and lectures. additionally, the simplicity of the criteria is a factor that makes it possible for st contents to be part of non-specific subjects, such as se (paschoal and souza, 2018). moreover, st contents may be part of the training provided by companies when their employees do not know a given st practice or technique. among the existing st practices, we have exploratory testing (et). et emphasizes the responsibility and freedom of the tester to explore the system, allowing the tester to acquire knowledge of the program in parallel with the execution of the tests (costa et al., 2019; hendrickson, 2013; bach, 2003; whittaker, 2009), as there is no script planning or the definition of test cases defined in test plans (hendrickson, 2013). for bach (2003), et is learning, designing, and executing tests performed simultaneously. as a way to meet the need for management and measurement of et, bach (2003) proposed (1) to divide the testing activities into sessions, which would be the basic unit of work, (2) to stipulate a mission for each session, and (3) adopt time metrics related to testing activities (castro, 2018), thus giving rise to the session-based test management (sbtm) approach. although the problem associated with st teaching is being discussed with greater visibility by the academic and scientific community (paschoal and de souza, 2018; garousi et al., 2017, 2020; scatalon et al., 2019; aniche et al., 2019) and is producing more specific developments (cheiran et al., 2017; de andrade et al., 2019; martinez, 2018; coutinho and bezerra, 2018; paschoal and souza, 2018; paschoal et al., 2017; queiroz et al., 2019), few studies investigate the possibilities of streamlining the teaching and application of et in practice (costa et al., 2019; ferreira costa and oliveira, 2020). adopting more dynamic strategies that bring theory and practice closer together to provide academic-professional training in the real scenario of the software industry is not a trivial task, especially when this experience is conducted with geographically distributed teams that work in an agile environment. when conducting experiences like this, some challenges https://orcid.org/0000-0002-7058-7631 mailto:jarbele.coutinho@ufersa.edu.br https://orcid.org/0000-0003-0656-6139 mailto:wilkerson@computacao.ufcg.edu.br mailto:patricia@computacao.ufcg.edu.br coutinho et al. 2023 emerge, such as (1) integrating the team that works in a crossfunctional way, due to the adoption of agile practices; (2) creating conditions for the flow of knowledge to develop, considering the different ways in which people assimilate information; and, (3) dealing with contextual challenges, such as communication, time, internet connection, among others. there is a need to investigate ways to conduct et teaching for agile teams working with distributed software development (dsd). therefore, our research question is: how to encourage practical et learning with geographically distributed agile teams, seeking integration among members, and promoting active learning, in order to encourage their insertion in the daily work? in these circumstances, learning in a participatory way, from real problems and situations, can contribute to learning evolution. problem-based learning (pbl), is an active learning approach (bonwell and eison, 1991; mcconnell, 1996) that, from problem-solving, enables students to live experiences that portray the reality of the professional context in the academic environment (cheiran et al., 2017) and aims to encourage the collaborative resolution of challenges through research, reflection and development of solutions. in an associated way, just-in-time teaching (jitt) (novak, 2011) also aims to contribute to student learning. based on activities carried out before class, jitt encourages the development of prior knowledge of students (novak, 2011; martinez, 2018) so that we can further develop discussions about a given content during class. this study aims to investigate the contributions and limitations of adopting pbl and jitt in et teaching-learning with agile dsd teams in a remote learning context, in order to encourage the incorporation of et practices in the daily lives of these teams. thus, it is expected to contribute to the mitigation of the main challenges mentioned above faced in the execution of courses conducted in a dsd context, and encourage the adoption of et in the st practices developed by agile teams. for this, we carried out an et course with agile professionals from a software development company, distributed geographically. at the end of the course, data were collected through an online questionnaire and examined. then, the et activities performed by the participants in their daily lives were monitored and a brainstorming session was held to evaluate this experience. it is worth noting that this paper is an extended version of the award-winning paper “teaching exploratory tests through pbl and jitt: an experience report in a context of distributed teams”, published in the proceedings of the 35th brazilian symposium on software engineering (sbes 2021), education track (coutinho et al., 2021). in addition to this introductory section, this paper is structured as follows: section 2 discusses an overview of st teaching. section 3 describes the methodological procedure used in this study. section 4 presents the results obtained, in response to the defined research questions. section 5 discusses the perspectives, challenges, and limitations of this study, based on the results obtained. section 6 discusses the threats to the validity of this study. section 7 exposes the analysis of some related works. finally, section 8 presents the final considerations and perspectives for future work. 2 background in this section, we discuss important aspects related to teaching software testing and exploratory testing. we then present and discuss some relevant concepts about two main approaches to active methodologies, pbl and jitt. 2.1 teaching software testing st is an essential activity to guarantee the quality of software. seeking to meet the need to use teaching methods that make the learning of this activity more effective, some studies have been dedicated to investigating systematic approaches to contribute to the teaching in this area of se (paschoal and de souza, 2018; garousi et al., 2017, 2020; scatalon et al., 2019; aniche et al., 2019). one of the most significant difficulties for teaching st is the need to apply the process in practice (paschoal and de souza, 2018; coutinho and bezerra, 2018). at university, sometimes, the teaching of st is distributed in disciplines in the se area and does not provide an opportunity for the st learned in depth. this aspect causes students to graduate with deficiencies in software testing skills (scatalon et al., 2019). on the other hand, the industry needs professionals with formation and more solid training in testing. in practice, testing professionals (test analysts, test engineers, or testers) have been looking for options to improve the effectiveness and efficiency of testing (garousi et al., 2017) both to perform a more effective job and to find better positions in their professional career. thus, university graduates and se professionals self-learn (self-train) st through books or online resources or by participating in industry training and obtaining certification in the st area (garousi et al., 2020), such as those provided by international software testing qualifications board (istqb), for example. 2.2 exploratory testing one type of testing that has become widespread in the agile environment is et. in this method, test professionals can interact with the system the way they want and explore, without restriction, its functionality (suranto, 2015). in layman’s terms, it can be said that et allows professionals to learn quickly, adjust their tests, and, in the process, encounter software problems that are often not anticipated in test plans or scripts. for bach (2003) et is the learning, design, and execution of tests performed simultaneously. thus, the test professional adapts to the system being tested, creates, and improves the tests based on the knowledge acquired during the exploration of the system, without the aid of instructions about the system (castro, 2018). in et, test design and execution are performed at the same time (whittaker, 2009). however, we can perceive some disadvantages in the application of this test. for instance, the lack of preparation, structure, and guidance can lead to many unproductive hours (suranto, 2015). also, we can test the same functionality more than once while others are not tested (castro, 2018), especially when multiple testers or test coutinho et al. 2023 teams are involved. moreover, it can be not easy to track the progress of testing professionals (suranto, 2015; castro, 2018); among others. to overcome some of these disadvantages and as a way of meeting the need for et management and measurement, bach (2003) proposed (1) to divide the testing activities into sessions, which would be the basic unit of work, (2) to stipulate a mission for each session and (3) adopt time metrics related to testing activities, originating the sbtm strategy. the sbtm strategy is used to make et is more effective and with clearer goals (castro, 2018). for these reasons, too, et has gained greater popularity in the agile industry (suranto, 2015; raappana et al., 2016; garousi et al., 2017), requiring testing professionals to display a little knowledge, experience, and skills with et. thus, although garousi et al. (2020) highlight that most courses have trained little about et, it also recommends more et coverage in st education. ghazi (2017) highlights that an et session should start with a document, called a charter, which contains the mission described in a succinct way. the purpose is to ensure that the tester remains focused only on executing the session described in the charter. some guidelines are indicated to define the mission, in the charter: (i) the mission must not be too specific, nor too generic; (ii) the mission determines what is to be tested (not how the test is to be carried out); (iii) at the end of the et session, new ideas, opportunities or problems, found by the tester, can be used to create new missions; (iv) after completion of the mission, it is important to have an evaluation of the session in order to discuss the results found. for hendrickson (2013) the mission format should be based on the following premise: define the mission and what should be explored. the mission of an et, can be defined with the estimation of test points.a test points is related to each test job performed on the et mission. each mission can contain one or several test points that must be investigated during the time of the et session. it is important to note that the test points list is dynamic, that is, new points can be added, based on errors found and corrections (ghazi, 2017); and, they must be tested according to risk (high, medium or low), being the most at risk first. 2.3 active learning as a way to streamline teaching and offer students differentiated strategies that lead to effective learning, active methodologies emerge as an alternative proposal to traditional teaching-learning approaches (bonwell and eison, 1991; mcconnell, 1996). currently, active methodologies are being adopted in teaching-learning from different areas of knowledge as a way to improve current techniques and involve students in this process (paiva et al., 2016), not limiting their learning only during class. active learning is characterized by stimulating students’ autonomy and continuous participation in the learning process (bonwell and eison, 1991), through different teaching approachessuch as problem based learning (pbl), teambased learning (tbl), the flipped classroom, just-in-time teaching (jitt), among others. some other trends in active methodologies have emerged such as peer instruction (pi) (crouch and mazur, 2001), design thinking (brown and katz, 2011), storytelling (andrews et al., 2009) and maker culture (milne et al., 2014), for example. among these modalities of active methodologies, pbl and jitt were adopted in a complementary way during the course of exploratory tests (et). as the course was conducted remotely, pbl contributed to initiating and motivating participants to learn through real-life problems and encouraging group work skills and autonomous learning (bonwell and eison, 1991; coutinho and bezerra, 2018; de andrade et al., 2019). jitt influenced active participation in different activities before and during classes, encouraging participants to read the material and perform online tasks. for these reasons, these active methodology modalities were selected and applied in this study. next, we describe pbl and jitt separately. 2.3.1 problem based learning pbl pbl is a teaching method that is characterized by the use of problems to initiate and motivate the learning of concepts and promote skills and attitudes necessary for their solution (figuerêdo et al., 2011). in addition, pbl also aims to include the acquisition of an integrated and structured knowledge base around real-life problems, as well as promoting group work skills and autonomous learning (figuerêdo et al., 2011; de andrade et al., 2019; cheiran et al., 2017), through collaboration and ethics. pbl is considered a methodology strongly oriented to processes and accompanied by instruments that can assess its effectiveness (figuerêdo et al., 2011). therefore, the practical immersion promoted by pbl requires a teaching plan. this plan includes well-defined learning objectives, the structuring of a practical environment, the determination of roles for the subjects involved (teacher and student), and result evaluation strategies (figuerêdo et al., 2011; cheiran et al., 2017). in summary, pbl starts with the proposition of a problem and ends with the resolution of this problem. for this, some steps are indicated: (1) clarify terms that are difficult to understand; (2) list the problem(s); (3) discuss the problem(s); (4) summarize the discussion; (5) formulate learning objectives based on the problems; (6) seek information; and, (7) integrate the information gathered to resolve the case. to carry out the pbl steps, the participation of a group of 10 to 12 students is indicated (with the figure of a coordinator and a secretary), a tutor and the definition of a script, with a description of the problem and a recommended bibliography or material support, if necessary. pbl is suggested as a teaching-learning practice when there is a need to encourage the participation of students or professionals in the learning process, placing them as protagonists in this process, and consequently removing them from the condition of receiver of knowledge. 2.3.2 just-in-time teaching jitt jitt is a pedagogical strategy developed by novak (2011), whose essence is to connect activities inside and outside the class through warm-ups (martinez, 2018). in this approach, coutinho et al. 2023 students are encouraged to read material about the content of the class and complete a small task online, a few hours before the class takes place (martinez, 2018). this activity allows the teacher to plan the next class or make considerations in class according to the student’s expectations or doubts (answers). jitt also aims to encourage students to participate actively in different classroom activities, through greater control over their learning, motivation, and engagement (novak, 2011). with jitt, class time is used more effectively because less time is spent on material that students have learned from reading, and more time is spent on more difficult subjects (martinez, 2018). in summary, the development of jitt encompasses three basic stages, centered on the student: (1) warmup exercise, where the student is encouraged to read support materials and answer conceptual questions from there, the teacher prepares the class; (2) class discussions on reading tasks (rt), through the re-presentation of questions and (some) answers from some students, maintaining anonymity; and, (3) group activities involving the concepts worked in the tl and in the class discussion, which can be expository, fixation exercises, among others. jitt is indicated when you want to stimulate, in the student or professional, the construction of prior knowledge about the content that will be discussed in class. and also to create the habit of studying before class. other factors involve oral and written communication skills, maximizing effectiveness and class time, among other factors. jitt is mainly suggested for the execution of short courses or for content taught in a short class time. 3 methodology this research examines the contributions of the use of active methodologies, pbl and jitt, used in association, to assist in the teaching-learning process of et, during the application of a course conducted remotely with members of agile teams distributed geographically. thus, the research is classified as an experience report (wohlin et al. (2012)), as it precisely describes: the planning, in section 3.1.1; the execution, in section 3.1.2; and, the analysis procedures, in section 3.1.3, as a way to contribute with relevant considerations for the st teaching area, as well as to allow the replication of this experience in other se teaching contexts. in order to learn more about et, a bibliographic research was initially conducted to understand the main approaches and tools that have been used to support the practice of et in the agile environment. this study culminated in a et course, aimed at agile professionals, to validate the practical application of the sbtm approach. to understand the development phases of the experiment conducted, figure 1 illustrates the activities developed, from planning to evaluation. 3.1 study design 3.1.1 planning the goals of this experience were defined following the guidelines of the goal question metric (gqm) paradigm. thus, we seek to analyze the pbl and jitt approach in teaching exploratory tests, with the purpose of realizing their contributions, concerning collaboration and integration between participants of a remotely conducted course, from the researchers’ point of view in the context of geographically distributed agile teams. to achieve this goal and conduct this research, we defined the following research question (rq): “how to encourage practical et learning with geographically distributed agile teams, seeking integration among members, and promoting active learning, in order to encourage their insertion in the daily work?”. thus, rq aims to identify the main contributions and limitations of the implementation of active learning pbl associated with jitt in an et course in remote format, about content learning, integration, the collaboration between participants, practical activities, and other aspects inherent in solving problems based on real scenarios. thus, to answer this rq a course on et was planned and executed (see section 3.1.2) with an agile dsd team. and, in the end, an online questionnaire was applied to collect the participants’ feedback on the adopted teaching-learning methodology. as shown in figure 1, the planning phase of the et course consisted of four well-defined steps, described below. step 1. define the course plan. in this stage, we defined the course syllabus, the number of hours to be taught, the date of the course, the target audience, the objectives to achieve, the materials needed, and classes to be produced in a detailed manner according to the adopted methodology. it is important to remark that the definition of this course plan was widely discussed, reviewed, and evaluated by two specialists in the st field. moreover, we defined the tools to be used in the course, considering the context of remote learning as the following: google meet, for video communication during classes; discord, for communication between participants during practical activities; google drive, for storing and sharing class materials and resources (in documents, spreadsheets, and presentations); google forms, for the elaboration and availability of the evaluation questionnaire, after the course; and, the xray exploratory app1, for et planning and execution, only in the last practical activity. it is important to highlight that the contributions of the xray exploratory app tool were as follows: (i) it has desktop and mobile versions; (ii) it is possible to integrate it with jira software although it was not possible to apply it in this study; (iii) assists in bug detection, while et sessions are recorded in video, audio and/or screen capture format; (iv) et sessions are detailed and executed directly in the tool; (v) when closing the session, a report is automatically attached to the test run this strategy provides quick feedback to testers. in summary, the xray exploratory app assists in the test report and in the documentation produced. step 2. develop class materials. the classes in this course are intended to train participants on the subject of et in the agile context and balance the level of knowledge among all 1xray exploratory app: https://www.getxray.app/ exploratory-testing-app/ https://www.getxray.app/exploratory-testing-app/ https://www.getxray.app/exploratory-testing-app/ coutinho et al. 2023 figure 1. development phases of this experience. table 1. relation of questions addressed in the questionnaire with the purpose of each section. section goal questionnaire questions question format i identify the profile of the professional participating in this research and their experience in the st area. 01, 02, 03, 04, and 05 all questions are objective. ii identify the organizational procedures and practices in relation to the practice of st in the sprints of the projects developed by the agile teams, before the course is offered. 06, 07, 08, 09, 10, 11, and 12 all questions are objective, except question 12 it was subjective. questions 09 and 10 follow the response format based on the likert scale. iii identify the participants’ perceptions about the teaching-learning obtained with the et course. 13, 14, 15, 16, 17, 18, 19, 20, 21, and 22 all questions follow the response format based on the likert scale, except for question 22. question 22 is objective. iv identify the contributions of the pbl and jitt approach, used in an associated way, in the et course. 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, and 41 all questions follow the response format based on the likert scale, except for question 41. question 41 is subjective. agile professionals participating in the course. in this context, the content covered in the class materials was based on bach (2003); castro (2018); hendrickson (2013); whittaker (2009); crispin and gregory (2009); ghazi (2017) and on current lectures, conducted by renowned experts in the field of et. it is important to highlight that (1) lecture notes (slides) with theoretical content and practical examples on et were prepared and (2) a handout with a detailed synthesis of the content covered in the course; as well as, we selected (3) a list of tools that support et planning and execution, (4) a list of videos (tutorials and lectures) available on the web, and (5) a list of technical articles and books on et in the agile context. the class material adopted in the et course can be accessed at https://bityli.com/puehugfn. step 3. develop practical activities. to exercise and reinforce learning about the content taught in each module of the course, examples and practical activities were prepared, based on the guidelines provided by the pbl and jitt methodologies. at this stage, the materials and resources needed to carry out these activities were defined and elaborated, for example, the selection of the web system to be tested; a guide with basic guidelines for each practical activity; templates of the test artifacts (such as charters, test points, and session report) to optimize the time devoted to each activity; requirements artifacts (such as a system requirements specification document and a use case diagram); and, selecting an installation manual for the xray exploratory app program. some of these materials and resources needed to be improved during the course, to meet the doubts and needs of the students, diagnosed in advance (ie, before the class) through the application of the jitt methodology. it is important to highlight that it was possible to follow all the stages of pbl and jitt in full (figuerêdo et al., 2011; novak, 2011), even though the course was carried out in a remote teaching format. step 4. elaborate on the evaluation questionnaire. to collect information about the experience and learning of the participants, a questionnaire was created online2, with objective and subjective questions. a total of 41 (fortyone) questions were included, distributed between 39 (thirtynine) multiple-choice questions and 02 (two) open questions, whose answer was optional for the participant. the questionnaire was designed in google forms and or2access to the evaluation questionnaire: https://cutt.ly/ym5vek2 https://bityli.com/puehugfn https://cutt.ly/ym5vek2 coutinho et al. 2023 table 2. structure of the exploratory testing course topics contents practical activity (pa) course workload (h) module i introduction 1.1. what is et? 1.1.1. et characteristics 1.2. what is not et? 1.2.1. randomness and testing ad hoc 1.2.2. scripted tests 1.3. when to use et? pa1 goal: understand the product, create hypotheses, and plan test scenarios. 02h module ii et in practice 2.1. et heuristics 2.2. et planning 2.3. writing et cases: charters 2.4. introduction to sbtm 2.5. running tests based on sessions 2.6. evaluation of a session pa2 goal: investigate heuristics, run tests, and log failures. pa3 goal: apply task breakdown structure (tbs) metrics. 04h module iii a little more about et 3.1. problems, challenges, solutions 3.2. et good practices 3.3. et support tools pa4 goal: practice using the xray exploratory app tool through the execution of an et session. 02h ganized into four sections. so, the first section aimed to briefly characterize the professional profile of respondents. the second section sought to identify the organizational procedures and practices about the st practice in the sprints of the projects developed by the agile teams before the course was offered. the third section sought to identify the respondent’s perceptions about the teaching-learning obtained during the et course. finally, the fourth section aimed to identify the contributions of pbl and jitt in conducting the et course. table 1 relates the questions addressed in the questionnaire with the goal of each section (see section 3.1.1). it is important to highlight that to answer the questionnaire, participants should: (1) have participated in all modules of the course; (2) have carried out the practical activities developed in each module; and, (3) right at the beginning of the questionnaire, have agreed to a free and informed consent form (ficf) for the research. in table 2 the structure of the course is presented, together with the description of the topics and contents covered in the syllabus, and the practical activities planned for the end of each class module. additionally, the workload defined for each module of the course is informed. 3.1.2 execution the population of this study included twelve professionals from the software development industry who work with agile methodologies, in the same organization. currently, these professionals work in geographically distributed locations, due to the corona virus disease (coronavirus disease) or sars-cov-2 pandemic, which specifically affects the brazilian population since february 2020. for this reason, too, the et course was conducted in a completely remote teaching context. it is important to highlight that 50% of the course participants have already performed et, even without knowing the definition of the practice in detail. in general, the execution of the experience took place as internal training with agile teams of that organization and as planned, in four virtual meetings, with a duration of 02 hours each meeting, on the dates of 06, 07, 12, and 13 april 2021. it is important to highlight that module ii was divided into two meetings, due to the extent of the content taught. at each meeting, the content was taught and participants were able to ask questions and resolve their doubts throughout the class. then, to exemplify the discussed theory, a demonstration was made with real examples. and then, participants were instructed to exercise the knowledge obtained through a practical activity based on a real web system. for this, some guidance on the activity was provided. participants were distributed in teams and encouraged to interact and collaborate, through the dynamics of each activity. the resolution of a real problem also sought to encourage participants to research, reflect and develop et relevant to the context analyzed in the activity. this strategy was based on the guidelines provided by the pbl. at the end of each meeting, class materials and resources were made available to participants so that they had prior knowledge of the next content to be discussed in the course. this strategy, based on jitt, sought to encourage interaction between the teacher and the course participants, in addition to enabling more in-depth discussions during the class and anticipating feedback on the materials and resources adopted for the next meeting. to solve the proposed problem, the participants were monitored for approximately 40 to 50 minutes. the time was stipulated according to the complexity of the activity proposed in each module. for that the activities were defined in order to build knowledge about the execution of et sessions and each activity involved a practice related to the content studied in the respective course module (see table 2). in each activity, a set of practices was defined that served as guidance for the execution of the et sessions (see the course material). at the end of the course, participants were instructed to fill out a questionnaire online, whose purpose was to collect information about the experience and learning about et coutinho et al. 2023 through pbl and jitt practices. 3.1.3 analysis procedures after data collection, through the online questionnaire, individual reports were generated, according to the objective of each section investigated in the questionnaire. it is worth noting that the information in these reports was anonymized to preserve the identity of the participants. thus, to analyze the data extracted from the content of the responses provided by the participants, a quantitative analysis was conducted (wohlin et al., 2012), mainly in the responses provided through the likert scale, with options from 1 to 5 (being: 1 totally disagree; 2 partially disagree; 3 neither agree nor disagree; 4 partially agree; 5 totally agree). in this sense, the answers were analyzed by class: disagreement, indecision, and agreement. additionally, a qualitative analysis was conducted on the answers to the subjective questions in total, alone two questions 12 and 41), but as they were optional or complementary answers to the objective questions, there was little need to apply this type of analysis. thus, when necessary, we synthesize and analyze the responses from open, axial and selective coding oriented in the grounded theory (corbin and strauss (2014)). 3.2 checking the use of et in practice after the execution of the et course, the activities performed by the participant were monitored during their daily work with agile development. based on the guidelines learned in the course, the et sessions were planned and executed. then, a brainstorming session was conducted with the professionals to understand the real advantages and difficulties experienced in this context. 3.2.1 brainstorming planning brainstorming is a technique used in groups to generate innovative ideas or insights into a particular topic (bonnardel and didier, 2020). overall, brainstorming should (i) generate as many ideas as possible, (ii) extend the interpretation of ideas, (iii) present original ideas, and (iv) perform the combination and improvement of existing ideas. to conduct the brainstorming, the main question was defined, that is, a problem to be solved, and a set of activities to be followed. thus, the following question was defined: “what to do to be able to integrate exploratory tests, as a test practice, in the team’s daily life?”. from this main question, other specific questions were presented to guide and contribute to the generation of ideas (see table 3). regarding the set of activities followed in brainstorming, the following were planned (see figure 2): 1. activity 1. brainstorming in silence. this activity consists of generating ideas, individually, to try to solve the presented problem. thus, participants must write their ideas on post-its. 2. activity 2. sharing ideas. this activity consists of presenting the ideas that were generated and transcribed in the post-its. other participants are allowed to ask questions or add any new information or ideas. figure 2. activities performed in brainstorming. 3. activity 3. filtering ideas. the objective of this activity is to discard ideas that are not aligned with the context of the problem or that generate disagreements. 4. activity 4. first vote. in this activity, all participants must select the ideas that best solve the exposed problem. only the 6 most-voted ideas are listed for the next activity. 5. activity 5. improvement of ideas. the objective of this activity is to improve the most voted ideas, adding important new information, through more post-its, with details of artifacts, testing activities, documentation, platforms or te tools, and team organization, among others. 6. activity 6. second vote. finally, the participants vote for the second time on the most applicable idea to solve the presented problem. 3.2.2 execution of brainstorming after the execution of the te course, the participants were led to apply te sessions in projects they develop. in total, nine te sessions were held, divided into two specific moments of the project. then the brainstorming took place. the brainstorming took place in a completely remote context, with the participants distributed locally. for this reason, the online tool lucidspark 3was adopted to facilitate the transcription of ideas and collaboration between physically distant participants. after some initial orientations, the participants were led to the brainstorming activities. in total, the brainstorming lasted a period of seventy-nine (79) minutes. it is important to note that the brainstorming lasted longer than anticipated in the initial planning. figure 3 illustrates the execution of post-its brainstorming, presented by the lucidspark online tool. the results are discussed in section 4.2. 4 results after the experience was carried out, data were collected and analyzed. in total, the information provided by the twelvecourse participants was considered, as they all agreed to participate in this study, followed the discussions during classes, and performed all the practical activities provided for at the end of each module. thus, these results are discussed in section 4.1. then, the results of the brainstorming carried out with the participants, after et insertion with agile teams, in section 4.2. 3lucidspark: https://lucidspark.com/pt https://lucidspark.com/pt coutinho et al. 2023 specific questions 1. how did the session-based testing strategy get in the way of et execution? 2. does something prevent et from being routinely practiced by the team? what actually prevents it? (process, tool, team, project, time, etc.) 3. what can we do to improve the execution of et? 4. which requirements strategy or artifact is most useful to assist in the realization of et? 5. what can be done to make these requirements clearer to the team? 6. what information is important to record/plan before performing the et, in addition to what was indicated? 7. what information is important to record during the execution of the te, in addition to those indicated? 8. what information is important to record after performing the et, in addition to what was indicated? 9. which practices were most interesting? 10. what practices did you not find interesting? 11. what benefits for the team’s day-to-day activities were observed in the course? 12. in light of what was learned, what was the most difficult thing to implement on a day-to-day basis? 13. has anything changed in the team’s testing practice after the course? what has changed? 14. what do you see that would change in test practice after the course? 15. did the et course influence the incorporation of testing practices? what really influenced you? 16. is et useful as a testing practice in the context of remote work? what could be incorporated to contribute to remote work? table 3. brainstorming specific questions. figure 3. execution of brainstorming. 4.1 results of experiment next, sections 4.1.1 to 4.1.4 present the characterization of the participants, the most common agile st practices adopted by the participants before the course, the perception of et after the course, and the main contributions of pbl and jitt to the teaching-learning et. 4.1.1 characterization of participants initially, to characterize the participants’ professional profiles, an analysis was made regarding each team member’s attributions and professional experience. considering that the composition of agile teams is multidisciplinary, that is, each team member can perform different functions during the developed software project, among the participants in this study, we identified different attributions distributed among the team members (see table 4), among them are developer back-end and developer front-end, played by 50% of participants; software engineer, 41.7%; project manager, 25%; database administrator and tester or quality analyst, 16.7% each; software architect, scrum master, designer, mobile developer, and infrastructure engineer, with 8.3% each. other attributions such as analyst or business leader, analyst or requirements engineer, and product owner (po), among others, were not informed. table 4. assignments of participants in agile teams. assignments answers (nº) answers (%) database administrator 2 16.7% architect 1 8.30% back-end developer 6 50% front-end developer 6 50% designer or humancomputer interaction specialist 1 8.30% project or product manager 3 25% scrum master 1 8.30% software engineer 5 41.70% quality tester or analyst 2 16.70% mobile developer 1 8.30% infrastructure engineer 1 8.30% regarding the level of academic education of the participants, 58.3% have completed graduation, while 33.3% have a stricto sensu post-graduation at the master’s level, and only 8.3% are still attending graduation. another factor observed was the professional experience of the participants: 1. working experience in the industry software: 50% of them work in this context between 1 and 2 years; 16.7%, between 3 and 5 years; 25%, between 6 and 10 years; and, 8.3%, for more than 11 years. none of them reported little experience with software development, that is, less than 1 year of experience in the market. 2. working time with agile methodologies: 50% work in this context between 1 and 2 years; 25%, between 3 and 5 years; 16.7%, between 6 and 10 years; and, 8.3%, for more than 11 years. none of them reported little excoutinho et al. 2023 perience (less than 1 year) or no experience with agile methodologies. 3. working time with agile st: 50% perform tests between 1 and 2 years; 16.7%, between 3 and 5 years; and, 16.7%, between 6 and 10 years. however, another 16.7% reported not working with testing at all. 4.1.2 common practices in agile st additionally, to identify how tests are commonly conducted by agile teams, an analysis of the main st organizational practices performed in the sprints of the projects was carried out. generally those responsible for testing the software or the software module developed are the back-end developer (41.70%), the front-end developer (25 %), the product owner (po) (16.70%), the project or product manager (41.70%), scrum master (8.30%), the engineer software (16.70%), the tester or quality analyst (25%) and, in some cases, everyone on the team (41.70%). we also identified that tests are usually performed throughout the software lifecycle (41.7%). in some phases with more emphasis such as, during (33.3%) or after coding the software (58.30%); and, during (25%) or after the software integration phase (25%). in other phases the test takes place with less intensity such as, during (8.3%) or after the software verification phase (16.70%); during (8.3%) or after the production of software documentation (8.30%); or, during (25%) or after the software maintenance phase (16.70%). in agile st the test types are categorized in quadrants crispin and gregory (2009). considering this categorization, we notice that the tests performed most frequently by the participants are unit tests (50%), exploratory tests (50%), component/integration tests (41.7%), functional tests (41.7%), usability tests (41.7%), performance and load tests (41.7%), simulations (33.3%), scenarios (16.7%), user acceptance tests (16.7%), alpha/beta ( 8.3%) and examples (8.3%). to assess the participants’ perception of the types of tests performed on their teams, with regard to their professional activities, we expose the following questions: • question 09: “i believe that the software testing strategies adopted so far, and reported above, have been sufficient to detect bugs in the system”. • question 10: “i believe we need to extend and improve the software testing practices used so far to try to ensure higher quality in whatever product we develop”. these statements contained multiple-choice items, according to the likert scale, which is detailed in section 3.1.3. table 5 presents the result of the answers to questions 09 and 10, and highlights the choices of the likert scale as follows: (1) totally disagree, (2) partially disagree, (3) neither agree nor disagree, (4) partially agree and (5) completely agree. regarding question 09, it was possible to observe a predominance of responses following agreement, which may symbolize that the participants consider the st strategies used to detect system bugs to be sufficient. although, in question 10, they unanimously agree that the st practices, untable 5. results of responses to questions 9 and 10. ((1) totally disagree, (2) partially disagree, (3) neither agree nor disagree, (4) partially agree and (5) completely agree ) (1) (2) (3) (4) (5) question 09 8.3% 25% 8.3% 50% 8.3% question 10 0% 0% 0% 8.3% 91.7% til then adopted by the teams, need to be improved and expanded. these results indicate that although the participants consider the agile testing practices adopted by the team to be sufficient, they also perceive the need to add other st practices to try to ensure greater quality in the developed projects. to understand the main problems related to the execution of tests in the projects developed by the participants in the daily work of their teams, we also expose assertions as response options, with multiple-choice items, according to the likert scale. these assertions can be consulted in the evaluation questionnaire (see section 3.1.1) and are described below. the results of the answers obtained can be seen in table 6. thus, the assertions ‘a’ to ‘s’, presented below, belong to question 11 of the questionnaire and comprise the participants’ perception of the problems encountered in the practice of st. these assertions were represented by letters of the alphabet so as not to be confused with individual questions or statements in the questionnaire. • assertive a: “a weak relationship between the client and the project leader”. • assertive b: “a weak relationship between the leader and other team members”. • assertive c: “constantly changing objectives, business process and/or requirements during sprint”. • assertive d: “lack of collaboration between test analysts and developers (programmers)”. • assertiva e: “failure to communicate within the development team (programmers) of the project”. • assertive f: “software requirements are purposely expressed in general terms, omitting specific implementation details”. • assertive g: “hidden, incomplete or inconsistent requirements”. • assertive h: “sprints too short”. • assertive i: “lack of knowledge about software testing practices and techniques”. • assertive j: “lack of training on specific software testing practices and techniques”. • assertive k: “there is no time to test as it should”. • assertive l: “there is no specific professional to run the tests within the team”. • assertive m: “trainings are time consuming and tiring”. • assertive n: “there is too much effort to plan/design the tests”. • assertive o: “there is an effort to run the tests”. • assertive p: “finding defects during production causes rework, delaying the completion of the sprint”. • assertive q: “use of traditional testing practices in the agile environment does not favor the work developed during the sprint”. coutinho et al. 2023 • assertive r: “a programmer tests their own code or the development team tests their own project”. • assertive s: “test cases are written only for valid and expected inputs”. the assertions that were most in the agreement were those referring to the constant change in objectives, business process, and/or requirements during the sprint (assertive c); the existence of hidden, incomplete, or inconsistent requirements (assertion g); lack of knowledge about software testing practices and techniques (assertive i); lack of training in specific software testing practices and techniques (assertive j); insufficient time to perform the tests as it should (assertive k); inexistence of a specific professional, on the team, to perform the tests (assertive l); and, the effort to perform the tests (assertive o). most of the problems highlighted are typical results of teams that work in an agile context alliance (2016) and can explain, for example, the need pointed out by the participants to expand and improve the adopted st practices. table 6. results of responses on problems in st practice. ((1) totally disagree, (2) partially disagree, (3) neither agree nor disagree, (4) partially agree and (5) completely agree ) (1) (2) (3) (4) (5) assertive a 50% 25% 16.7% 0% 8.3% assertive b 58.3% 16.7% 8.3% 8.3% 8.3% assertive c 16.7% 8.3% 8.3% 50% 16.7% assertive d 58.3% 0% 16.7% 16.7% 8.3% assertive e 58.3% 16.7% 0% 16.7% 8.3% assertive f 25% 16.7% 16.7% 25% 16.7% assertive g 33.3% 8.3% 0% 33.3% 25% assertive h 33.3% 25% 16.7% 16.7% 8.3% assertive i 0% 16.7% 25% 25% 33.3% assertive j 0% 8.3% 25% 33.3% 33.3% assertive k 0% 25% 8.3% 41.7% 25% assertive l 0% 0% 8.3% 25% 66.7% assertive m 25% 25% 16.7% 16.7% 16.7% assertive n 25% 16.7% 25% 8.3% 25% assertive o 25% 8.3% 16.7% 16.7% 33.3% assertive p 50% 8.3% 8.3% 25% 8.3% assertive q 33.3% 25% 16.7% 16.7% 8.3% assertive r 25% 8.3% 33.3% 16.7% 16.7% assertive s 25% 16.7% 33.3% 16.7% 8.3% some participants reported how the st process occurs on your team. figure 4 highlights some of the reports made. 4.1.3 perception of et after the course we also investigate the learning gained by participants during the course by analyzing the information collected on some key topics in the et content covered. for this, we exposed the following questions to the participants and asked that the answers be assigned according to the multiple-choice options, according to the likert scale. • question 13. “i have come to understand the importance of using heuristics in exploratory testing”. figure 4. st process executed in team projects agile described by the participants question 12). • question 14. “i was able to understand that a list of heuristics to be adopted in exploratory tests helps in deciding how to test the functionality/module/system”. • question 15. “i have come to understand the usefulness and importance of test letters in exploratory tests”. • question 16. “i was able to realize that although it is not necessary to prepare a detailed test plan, simple planning helps with the execution of the exploratory test”. • question 17. “i managed to learn how to plan the exploratory test”. • question 18. “i was able to see that requirements artifacts, even if not very detailed, can contribute significantly to the planning (setup) of the session”. • question 19. “i was able to see those test artifacts (mission letter, test point, and session report) generated while conducting the sbtm were useful for the execution of the exploratory test”. • question 20. “from the explanation about sbtm i was able to apply this approach with ease in the practice activity”. • question 21. “i was able to understand the importance of the alignment meeting between the team to register possible failures, create possible formal test cases, create new missions, register possible requirements, and register new test points”. the results associated with questions 13 to 21 demonstrate a predominant agreement on the learning of all content and practices taught during the course. among the questions presented in this evaluation criterion, the following stood out with more emphasis: the importance of simple planning for the execution of the et (question 16); that requirements artifacts, even less detailed, can contribute to session setup (question 18); the importance of the alignment meeting as a strategy to register possible failures, create possible formal test cases, create new missions, register possible requirements, and register new test points (question 21); the usefulness and importance of simple et artifacts generated in conducting the sbtm are useful for the execution of (question 19); the relevance of defining heuristics in et (question 13); among other relevant questions presented in the table 7. the performance of practical activities provided participants with a real experience with challenges common to et, such as little domain knowledge and necessary qualities of coutinho et al. 2023 table 7. results of answers from questions 13 to 21. ((1) totally disagree, (2) partially disagree, (3) neither agree nor disagree, (4) partially agree and (5) completely agree ) (1) (2) (3) (4) (5) question 13 0% 0% 8.3% 33.3% 58.3% question 14 0% 0% 25% 25% 50% question 15 0% 0% 16.7% 33.3% 50% question 16 0% 0% 0% 25% 75% question 17 0% 8.3% 8.3% 50% 33.3% question 18 0% 8.3% 8.3% 16.7% 66.7% question 19 0% 0% 25% 16.7% 58.3% question 20 8.3% 0% 25% 50% 16.7% question 21 0% 8.3% 0% 25% 66.7% testers in the application of et (91.7%) makes it difficult to carry out the tests; the absence of an et plan results in the same functionality being tested several times or an important functionality may not be tested or a serious error may go undetected (91.7%); the lack of a definition of test cases makes it difficult to reproduce the tests performed, if necessary, such as in regression tests (75%); an incorrectly interpreted output can lead to defects that may remain in the system or be eventually detected in future tests (5%); as there is no detailed test guide or plan, and no more complete artifacts are produced other than the crash report, it is difficult to know what has been and has not been tested (50%); and, et is not suitable for real-time systems (8.3%). 4.1.4 contributions from the pbl and jitt approach to identify the contributions of pbl and jitt in the teachinglearning process applied in the et course, an analysis of the characteristics of these methodologies was carried out. in this perspective, a set of eighteen questions (23 to 40) were exposed to the participants to be analyzed and answered through multiple-choice options, also following the likert scale. questions are listed below: • question 23. “the scenario (web system) worked on in the practical activities represented a real scenario of software development.” • question 24. “the scenario (web system) worked on in the practical activities had a high level of complexity.” • question 25. “practicing the theoretical content with a real web system helped me to better understand the concepts of exploratory testing.” • question 26. “through practical activities with a real web system, the course made it possible to learn, autonomously and independently, the main methods and techniques of exploratory testing.” • question 27. “through practical activities with a real web system, the course made it possible to work collaboratively in groups in order to broaden the discussions in the team about the theory learned.” • question 28. “through practical activities with a real web system, the course made it possible to work collaboratively in groups in order to deliver the project activities on time.” • question 29. “through practical activities with a real web system, the course made it possible to work collaboratively in groups in order to deliver the project activities with quality.” • question 30. “although physically separated, interacting with the team during practical activities was not difficult.” • question 31. “the use of conversation tools (such as discord) and collaboration (such as google sheets, google drive, and google docs) contributed to the team’s interaction in practical activities, decreasing the physical distance.” • question 32. “i realized that giving my opinion (feedback) about the class regarding the approach adopted, the exposed content, or the supporting material used (slides, pdf, artifacts, videos, etc. ), contributed to the organization of the class and the instructor’s practice in the next class.” • question 33. “i realized that giving my opinion (feedback) about the class regarding the approach adopted, the exposed content, or the supporting material used (slides, pdf, artifacts, videos, etc. ), helped the instructor to focus on the main difficulties that were expressed by the participants.” • question 34. “i realized that giving my opinion (feedback) about the class regarding the approach adopted, the exposed content, or the supporting material used (slides, pdf, artifacts, videos, etc. ), maximized efficiency and class time.” • question 35. “i realized that the practical activities were also aimed at stimulating my oral and written communication, through discussions with the team and elaboration of the test artifacts.” • question 36. “i realized that the practical activities were also aimed at stimulating group work skills, such as distributing the roles of each member, setting goals, understanding objectives, providing collaboration and communication, among other aspects .” • question 37. “i collaborated more with my team in practical activity 2 (investigating heuristics) and 3 (applying tbs metrics) than in practical activity 1 (creating hypotheses and planning test scenarios) because i felt more secure about the web system i was exploring, only in these activities, as i didn’t know the business scenario well before.” • question 38. “i felt more secure in carrying out the practical activities, only after the course instructor provided more specific guidance on the task, as the guidance in the support material (slides) was not clear enough .” • question 39. “i felt more motivated in practical activity 2 (investigating heuristics) and 3 (applying tbs metrics) after the course instructor made the test artifact templates available.” • question 40. “i had problems collaborating on practical activities because i couldn’t understand them.” to more accurately classify the answers given, we grouped the questions correlated to the main practices of active methodologies, in general perceived in questions 23 to 25; pbl, in questions 26 to 29, and 36; and, jitt, in questions 32 to 36. we highlight that questions 36 to 40, characterized both practices common to pbl and jitt. questions 30 and coutinho et al. 2023 31 sought to understand the participants’ perception of the dynamics of the course in the remote setting. table 8 displays the answers given to the questions. table 8. result of responses from questions 23 to 40. (1) (2) (3) (4) (5) question 23 0% 0% 8.3% 16.7% 75% question 24 25% 25% 33.3% 8.3% 8.3% question 25 0% 8.3% 0% 25% 66.7% question 26 8.3% 16.7% 0% 33.3% 41.7% question 27 0% 0% 0% 25% 75% question 28 8.3% 16.7% 16.7% 25% 33.3% question 29 0% 0% 16.70% 16.7% 66.7% question 30 0% 0% 8.3% 25% 66.7% question 31 0% 0% 0% 8.3% 91.2% question 32 0% 0% 0% 16.7% 83.3% question 33 0% 0% 8.3% 25% 66.7% question 34 0% 0% 33.3% 16.7% 50% question 35 0% 8.3% 8.3% 8.3% 75% question 36 0% 0% 8.3% 16.7% 75% question 37 0% 16.7% 8.3% 33.3% 41.7% question 38 0% 8.3% 16.7% 25% 50% question 39 0% 0% 8.3% 33.3% 41.7% question 40 25% 33.3% 0% 25% 16.7% about general practices guided by active methodologies, we investigated the participants’ perception of the inclusion of real practical examples in the activities carried out in the course. it was possible to observe a predominance in the follow-up of agreement in the answers to questions 23 and 25, which refer respectively to “the scenario (web system) worked on in practical activities represented a real scenario of software development” and “practicing the content theory with a real web system helped to better understand the concepts of et”. however, in question 24 we identified a majority of disagreement with the “high level of complexity of the scenario worked on in practical activities”. the answers provided in questions 26 to 29 and 36, related to pbl practices, which deal with the inclusion of real problems as practical activities in the teaching of content, provide evidence of the efficiency of using this methodology, especially about: learning autonomously and independently about et (question 26); collaborative group work to expand team discussions on the theory learned (question 27), deliver project activities on time (question 28) and with quality (question 29); encouragement of group work skills, such as distributing roles for each member, setting goals, understanding objectives, providing collaboration and communication, among other aspects (question 36). about statements 32 to 36, which deal with information about practices common to jitt, the results provided by the participants express a majority of agreement on the contributions and feedback given to the class from the access of the contents and materials of the class. thus, from the point of view of the participants, this strategy: contributed to the organization of the class and the instructor’s practice in the next class (question 32); helped the instructor to focus on the main difficulties that were expressed by the participants (question 33); and, maximized efficiency and class time (question 34). another jitt characteristic observed in the questions and which showed a predominance of the agreement was related to the objective of practical activities, namely: the encouragement of oral and written communication, through discussions with the team and preparation of test artifacts (question 35) and the encouragement of group work skills (question 36). otherwise, we investigated how collaboration between the team in practical activities stimulated the participants’ learning. agreement in questions 37, 38, and 39 prevailed, which referred, respectively, to security in collaborating more in the final practical activities than in the initial ones, as it is already better adapted to the business scenario provided as a real example in the activity; security in performing the activities after more specific instructions from the instructor; and, motivation after the instructor provides templates for the test artifacts. a positive aspect was the predominant disagreement in statement 40. a large part of the participants disagreed that they had “problems in collaborating in practical activities because they could not understand them”. this result can be explained by the aspects already confirmed in statements 37 to 39. finally, the benefits and difficulties of participating in the theoretical and practical activities of the course were investigated, given its implementation in a completely remote teaching context. table 9 presents the main testimonies of the participants regarding the perceived benefits and difficulties. according to the statements reported in the responses, we identified that the content, the main approaches, and the et tools were not known by some of the participants, as well as the usefulness of this test in agile methodologies. these aspects were pointed out as a benefit of the course for the work developed by the teams. regarding the reported difficulties, we found that the practices could have been conducted with products developed by the teams themselves as a way to facilitate the understanding of the business scenario, and the course load was also considered short for the extension of the content and developed practices. 4.2 results of et insertion with agile teams the six activities planned for brainstorming were conducted. there was no maximum or minimum limit of ideas to be expressed by the participants. thus, participants were encouraged to present their ideas within the time interval defined for each activity (see section 3.2.1). in activity 1, 37 answers were obtained to the specific questions listed in the preparation of the brainstorming. then, in activity 2 some questions were asked in order to clarify doubts related to the exposed ideas. in activity 3, some ideas had to be grouped together, and others discarded. thus, a total of 29 ideas that were not aligned with the main brainstorming context were disregarded. it is important to highlight that not all the answers obtained were considered viable ideas to be applied, as some were repeated, complemented each other, or were outside the research context. in activity 4, 06 ideas were voted on to be included in the next phase. in activity 5, the most voted ideas were briefly discussed and improved, with the aim of favoring the vote to be carried out in the next activity. finally, activity 6 resulted in a single coutinho et al. 2023 table 9. benefits and difficulties of participating in an et course in remote format. benefits participant b: “the et scope and planning to deliver quality software.” participant h: “learn how to document the execution of exploratory tests.” participant i: “learning about the topic. although the team, which also works together on development, somewhat adopted what was proposed in the classes, it was clear how we could improve.” participant j: “the theoretical content and practical activities were of great importance for a more solid understanding of the et... the requirements document should also be detailed enough to enable the planning of the et by the responsible team ... in addition, the et technique presented in the training can contribute a lot to the quality of the developed artifacts.” participant k: “through the course, i had my first contact with et, and participating in it made me learn a lot... what was seen seemed very attractive for the context of agile methodology.” participant n: “i found the use of heuristics in the tests interesting, i didn’t know about it.” difficulties participant b: “not having any type of st course in my graduation.” participant f: “the guidelines of the support material, sometimes it was not clear what should be done, generating doubt in the group.” participant h: “differentiate what information should be placed in each field of the template provided during et planning.” participant l: “it was a little difficult to think of scenarios for an initially unknown system. i think it would be more beneficial if the practice of the course had used a system developed and well-known by the team/class.” participant o: “i had difficulty participating due to the course schedule, as it wasn’t my actual work schedule.” viable idea to implement: the definition of a more viable approach to be implemented in the daily life of the team. the ideas presented in the brainstorming were related to the implementation of et in the participants’ daily lives, from the application of the et course. thus, the exposition of some ideas was crucial to understanding the effectiveness and usefulness of the practices exercised and the artifacts generated. we categorize these ideas to explain what actually applies and what does not apply to the daily lives of agile teams. the following is a summary of the ideas exposed in the brainstorming that favored the incorporation of et in the daily life of agile teams: • the registration of test points and the test report were considered important for the planning and execution of the et. • regarding the benefits for the team’s day-to-day activities, the importance of recording the et made was highlighted, in order to highlight the points that were tested and their pending issues. • the registration of testpoints was seen as a significant contribution of the te sessions because, in practice, there was an improvement in the activity of registration of the test to be done. • considering the context of distributed work, adopting files or artifacts with permission for simultaneous collaboration, and online tools contributed to the te performed remotely. below is a summary of the ideas exposed in the brainstorming that stood out as limiting factors to the incorporation of et in the daily life of agile teams: • the minimum and maximum time limit for executing an et session does not apply in practice, as this factor is relative to the tested functionality. as well as recording the complete execution of an et session. • the unavailability of time to plan et sessions was defined as a difficulty in implementing te in practice. • having a well-defined te process could contribute to the insertion of te as a test practice in the team’s daily life. • the minimum time reserved for an et session, in sbtm, could be smaller for small features. • the lack of experience with the et makes the professional who performs the test, dedicate a significant effort to the preparation of the te session as a whole. • the absence of a well-defined process that optimizes the preparation time of an et session limits the insertion of the et in the team, as well as the organization and commitment of the entire team to find a common moment to perform the et. in a complementary way, some other ideas emerged to facilitate the insertion of et in the daily lives of teams, such as: • include a description and examples in the attributes of each artifact of the et session, to facilitate the understanding of the artifact, or to exemplify the description of a test point. • set the test points in advance. • a document or a use case model could contribute as an artifact of base requirements for the planning of et sessions. • providing a brief description of the functionality to be tested in the et session could facilitate the planning and execution of the test. • capturing the print screen or screen recordings that contain the functionality defects identified in the et session contributes to the record of the test performed. • after running the test, it would be interesting to record the test execution step by step, the scenario tested and any impediments or difficulties encountered by the tester. coutinho et al. 2023 5 discussion of results in this section, we discuss the results obtained and presented in section 4, in order to expand the considerations about the et course applied with agile professionals 5.1 and the monitoring of the incorporation of et in the daily work of these professionals 5.2. 5.1 overview on using pbl and jitt to promote active learning and integration among geographically distributed participants during an et course in a remote learning format, pbl and jitt approaches have proven to be useful in stimulating hands-on learning in this context. according to the agreement of information and reports from the participants, some characteristics of pbl and jitt stood out, such as: 1. the use of a real scenario of software development contributed to the practice of the et concepts covered in the course. the actual scenario practiced encouraged the participants to further investigate the possible failures of the analyzed system from the heuristics and planned et scenarios. this strategy allowed the quick identification of some bugs implemented in the system’s functionalities in total, there were 6 bugs in three different missions (test scenarios) tested in practice 2, and 4 bugs in three distinct missions tested in practice 3. we highlight that each mission was executed in a 30-minute session. the low level of complexity of the adopted web system also contributed to the understanding of its operation, since there were in the first practical activities more detailed requirements or business artifacts. 2. autonomous learning was stimulated through practical activities, simulating the participants’ daily situations through the exploration of the web system; studying or reading classroom material and resources in advance; the discussion of content and activities during classes; the elaboration of questions about the understanding of practical activities; and, the construction of the generated et artifacts. 3. collaborative work stimulated different learning styles among the participants, such as: distributing the roles of each member, setting goals, understanding objectives, and providing communication. it also contributed to the expansion of discussions in the team, with the different points of view of the participants, and with the delivery of the activity on time although, in some cases, additional time to complete the activity was necessary and with quality meeting the requirements of the activity. the use of online conversation and collaboration tools contributed to the team’s interaction, narrowing the physical distance. 4. stimulation of additional skills, such as reading materials, using logical reasoning to understand the features of the web system during practical activities, discussions between teams, and exploring the system, among others. teamwork was also an encouraging practice, although the participants already act in this way in their daily work. 5. the motivation. some clarifications about technical terms, expressions, or et artifacts were useful to keep the participants motivated in carrying out the practical activities. as well as the availability of et artifact templates and the socialization of the generated artifacts at the end of each practical activity. 6. feedback provided before, during, and after classes about the content, materials, resources, and methodology used, contributed to the organization and practice of the instructor in the following class; help the instructor focus on the main difficulties that were expressed by the participants; and, maximize the effectiveness and time of the class. 7. the examples shown, as well as the way to present them, contributed to improving the understanding of te, as all the examples also referred to contexts of real systems. exemplifying the theory in this way helped the participants in the understanding and applicability of et, especially during the performance of practical activities. it is important to highlight that the likert scale helped to identify both the benefits and limitations of the pbl and jitt approaches, through assertions that represent their main characteristics. however, the answers provided by the participants pointed to these (question) characteristics more as benefits than as limitations to the use of these approaches in remote learning. although guidance and some clarifications were provided during the practical activities, some participants agreed on the difficulty in collaborating in practical activities because they were unable to understand them well. perhaps this is justified by the absence of face-to-face contact to facilitate communication. in summary, it is also important to highlight that: (i) although during the course, the participants were geographically dispersed, we did not address any specific dsd process. the purpose of the course was to use strategies and tools that would make et viable in the context of isolated and remote work; (ii) the use of a specific tool (xray exploratory app) for planning, executing and reporting bugs (with video recording, capturing and annotating screenshots, annotations, among other aspects) in et sessions are benefits not found in other tools that aid the execution of et; (iii) the experience of remote learning with geographically distributed participants is challenging and factors such as stimulating participation, collaboration and attention skills need to be considered for learning to actually happen; (iv) we noticed: the interest and engagement of the participants, when practicing the theoretical content through a real problem adopted in the activities and discussions during the classes, mainly due to the pre-constructed knowledge through prior access to the materials of the classroom; the quality of the answers in the exercises, as a good part of the test artifacts generated were in accordance with the criteria suggested in the description of the activities; a motivation to use the xray exploratory app tool, due to the ease in creating, executing and exploring the et performed; among other aspects already discussed in this section. coutinho et al. 2023 5.2 overview of insights in practice the et course facilitated the participants’ understanding of et concepts and practices. in order to facilitate the incorporation of this test practice in the daily lives of the teams, the participants were motivated to apply et sessions in their work context. then, a brainstorming was conducted to evaluate the execution of the et. next, we discuss the information obtained from brainstorming from the questions listed in table 3. therefore, some factors were perceived that favored the incorporation of et in the daily lives of the teams, such as: 1. the planning of test points, and defining the degree of importance of each test point facilitated the understanding of what is a test priority for the moment. 2. documenting what was tested also contributed to the conduct of the alignment meeting with the team, considering that all the information inherent to the test execution, such as bugs, suggestions for improvements, or pending issues, was recorded. on the other hand, some limitations were also noticed in the implementation of et sessions: 1. the artifacts adopted in the et sessions charters, tests points and test report proved to be complex (i.e. difficult to understand) or with underused fields for the test record. in this case, there is a need to implement guidelines or examples to clarify the effective use of the artifact. 2. there was a need for adaptations in the artifacts of planning and recording the test for a more specific adaptation to the agile context in which the et was conducted. 3. the orientation of minimum and maximum time for the realization of the sessions: sometimes, the minimum time (30 minutes, indicated by the sbtm) for the session was not used because the functionality tested was very simple, and the session that could have been done less time needed to be extended to reach the minimum time. in this case, a new orientation for situations like this needs to be considered. 4. the absence of a well-defined process or approach that is compatible with the real work context of agile teams. what is proposed in the literature, such as the sbtm, is not always applicable in its entirety in the real context of agile professionals. 5. experience of the professional who defines the test points because when this activity is performed by professionals who are unaware of the application to be tested, there is a risk of specifying a test point in an inconsistent or incomplete, generating gaps in the et planning and generating difficulties in its subsequent execution. to actually incorporate et as a testing practice in agile teams, it is still necessary to define an approach that fits the context of these professionals, in order to consider practical application guidelines, more specific tools that consider the particularities of planning and et execution, and simple, clear and effective artifacts to be adopted. for this reason, there is a need for an approach that fits the needs of professionals working in agile development and that goes beyond the concept presented in the literature on et. 6 limitations and threats to validity some potential threats to the validity of this study were perceived, such as threats to internal, external, construct, and conclusion validity. for this reason, some measures were taken to minimize them. to mitigate construct validity, the course material and evaluation questionnaire were iteratively planned, updated, and validated by the authors. as well as, elaborated based on works related to the et area, in the context of agile st (bach, 2003; hendrickson, 2013; suranto, 2015; whittaker, 2009; castro, 2018). to mitigate internal validity and ensure the anonymity of responses, participant identification was optional via email address. this allowed the data analysis to be performed in an impersonal way. other aspects inherent to the selection of individuals and conduct of the experiment also contributed and are detailed in sections 3.1.2 and 3.1.3. the external validity was attenuated with the availability of resources and teaching materials to facilitate the application of the active methodologies mentioned. thus, the results can be valid for other course participants either in a remote or face-to-face teaching format. to mitigate the conclusion validity, only percentages were used to identify common patterns. complementarily, the questionnaire validation answers were also discarded, regarding possible errors, such as answer format, and textual expressions used in the questions, among others. we tried to reduce bias using likert scale data. thus, all the conclusions we draw in this study are strictly traceable to the data. 7 related works although the literature shows interest in st teaching and is seeking strategies to boost practical teaching closer to the real context of the software industry through approaches based on active learning (cheiran et al. (2017); de andrade et al. (2019); martinez (2018); figuerêdo et al. (2011)) see section 7.1 -, there are still few studies that present results on the teaching of et (costa et al. (2019); ferreira costa and oliveira (2020)) see section 7.2. some other studies also have been dedicated to the investigation of et in the context of the industry in order to understand the impact or effectiveness of this testing practice in real projects (gebizli and sozer, 2017; mårtensson et al., 2019; pfahl et al., 2014; afzal et al., 2015) see section 7.3. a brief discussion of these works in relation to this study is presented in section 7.4. 7.1 teaching-learning of st using active methodology cheiran et al. (2017) present an account of two experiences on the teaching of st using pbl in an undergraduate course of se at the federal university of pampa (unipampa). in coutinho et al. 2023 total, 51 students participated 25 in the 1st edition and 26 in the 2nd edition of the course. data collection took place through questionnaires. to analyze the collected data, statistical and content analyses were adopted. the results point to evidence of students’ maturity in the context of the curricular component and the benefits and problems faced by integrating pbl and gamification elements. de andrade et al. (2019) also conducts a study on st learning using pbl practices, with students from computer science, at the university of são paulo (usp), and information systems, of the federal university of juiz de fora (ufjf). the results show that (i) classes with many students should have fewer presentations; (ii) courses with an average number of students can choose to keep weekly presentations more dynamic or with fewer presentations; (iii) the approach pbl is not as effective for students who have less time for extra classes. in summary, it was noticed that the successful adoption of an active approach is not directly linked to the infrastructural aspects. figuerêdo et al. (2011) apply pbl to train test engineers. for this, an empirical study was carried out with two groups, where each group was composed of five undergraduate students. the group should test a case tool to support functional testing using et. two evaluations were made with the participants one before and one after the execution. participants’ knowledge, grades, and amount of bugs identified were evaluated. the results obtained highlight that the pbl provides the engagement of participants and obtaining experience in scenarios that simulate real st situations. martinez (2018) describes the results of an experience with jitt based teaching in a graduate course in st over two semesters. the approach adopted was evaluated from the perspective of students, through a survey, and of teachers, from an assessment of strengths and limitations. the results show that a large majority of students (1) believe that their learning has improved when they prepare for class by reading the material in advance and (2) consider jitt to be an adequate teaching strategy for the course. teachers highlighted that students became more involved and participatory in discussions during class. 7.2 et teaching costa et al. (2019) use gamification as a motivating strategy in teaching and learning et. this dynamic consisted of practical activity to apply et in the form of a game, which refers to the “treasure hunt”. an experience was carried out with students from an se discipline of an undergraduate course in computer science. the results indicate that the qualitative results converged with the quantitative results obtained, showing that gamification helped in the teaching and learning process of the students (forehead pains). in another work, ferreira costa and oliveira (2020) replicate a new experience with the gamification strategy for teaching et discussed in costa et al. (2019), with a group of undergraduate students in computer science and with graduate students in a computer technician course. as a result, students achieved good overall performance. some reports highlight that gamification facilitated and significantly contributed to better performance, converging with the quantitative data obtained. this can be evidenced mainly by the fact that both “runs” of the experience (classes) reached a percentage higher than 70% of achievement. 7.3 et in the context of industry gebizli and sozer (2017) evaluate the impact of the education and level of experience of testers on the effectiveness of the et. for this, a case study was carried out with 19 industry professionals, with different educational backgrounds and levels of experience. a digital tv system was tested, and detected failures were categorized according to their severity. thus, the effectiveness of the et was evaluated on two aspects: criticality of detected faults and efficiency in terms of the number of faults detected per unit of time. the results show that et efficiency is significantly affected by training and educational experience. mårtensson et al. (2019) conducted a study based on interviews to understand the success factors in the application of et in industry projects. for this, interviews were conducted with 20 professionals. finally, a list of key factors that enable the efficiency and effectiveness of et in large-scale software systems was presented. the nine factors identified are grouped into four themes: (i) testers’ knowledge, experience, and personality; (ii) purpose and scope; (iii) ways of working; and, (iv) registration and reporting. pfahl et al. (2014) investigated how software engineers understand and apply the principles of exploratory testing, as well as the advantages and difficulties they experience. for this, an online survey was carried out among estonian and finnish software developers and testers. the main results indicate that the majority of testers, developers, and test managers who use et, (1) apply et to critical software for usability, performance, security, and security to a high degree; (2) use et very flexibly at all kinds of levels, activities, and phases; (3) they perceive et as an approach that supports creativity during testing and is effective and efficient; and (4) feel that et is not easy to use and has few support tools. in addition, there was a need for more support for et users, such as guidelines and tools. afzal et al. (2015) sought to quantify the effectiveness and efficiency of et vs. tests with documented test cases (tct). for this, four controlled experiments were carried out, with a total of 24 professionals and 46 students. manual functional tests using et and tct were done. the number of defects identified in the 90-minute test sessions, the difficulty of detection, the severity and types of defects detected, and the number of false defect reports was measured. the results show that et found a significantly higher number of defects. however, the two testing approaches did not differ significantly in terms of the number of false defect reports. 7.4 discussion of related works we could not find works that apply jitt and pbl in et. in general, the application of jitt or pbl in st, as reported in the literature (cheiran et al., 2017; figuerêdo et al., 2011; martinez, 2018), achieved results that converge with ours in the sense that the adoption of these methodologies has provided positive gains, related to motivation, engagement, coutinho et al. 2023 collaboration, and content learning. we also emphasize that most of the works are developed in academic environments (with undergraduate students), others in practical environments (with industry professionals). generally, the types of tests investigated are different, sometimes a more specific type of test or in a more general context, such as defect detection only. but, it is not always possible to identify in which development process the work was applied or which development methodology was adopted. in this way, this paper differs from the others in that it identifies and discusses the contributions of the integration of active methodologies, pbl and jitt, in teaching et in a remote learning course with agile professionals from the software industry and distributed geographically. some strategies and guidelines seeking to optimize teaching-learning with pbl and jitt, as well as the discussion of some perceived challenges, were also highlighted. another differential is the monitoring of et execution in the daily agile development of industry professionals, in order to highlight the aspects that favor or limit the incorporation of et in the daily routine of agile teams. 8 conclusions this work investigates the use of the pbl and jitt methodologies to teach et to a dsd team. based on a literature review and evaluation of the resources available, we planned and performed a training course on et and analyzed the results obtained. while teaching st has been challenging, under the circumstances imposed by social distance, where each team member is working remotely and isolated, teaching such a subject becomes even more challenging. next, we followed the incorporation of et into the daily lives of the teams that participated in the course and made an analysis of the application of this practice in the context of agile development. through brainstorming, ideas were raised about the characteristics that favored or limited the execution of et. the use of these methodologies significantly contributed to the success of the course. they provided the grounds for adopting a real problem, assessing the student’s needs with resources available before the class, adjusting the course to meet students’ expectations and needs, and promoting collaboration. additionally, the existence of a support tool for et was key to optimizing remote learning. these aspects also favored the application of et by agile teams in their projects. so that both the practices and the artifacts were put to good use in the test execution. however, even with this support given in the course, some limitations were perceived such as the absence of more specific support for the planning and execution of the et, such as guidelines and tools. as future works, we intend to (1) propose an approach that facilitates the implementation of et, considering the dsd scenario and the generation of simple and robust et artifacts, for the effective insertion of this test practice in the daily life of agile teams. next, we also intend to (2) validate the approach through experiments with professionals from the software development industry. acknowledgements the authors would like to thank the anonymous reviewers for their valuable comments. the second and third authors were supported by cnpq/brazil (processes cnpq 303773/2021-9, and 311215/2020-3). references afzal, w., ghazi, a. n., itkonen, j., torkar, r., andrews, a., and bhatti, k. (2015). an experiment on the effectiveness and efficiency of exploratory testing. empirical software engineering, 20:844–878. alliance, a. (2016). agile glossary. url: https://www. agilealliance.org/agile101/agile-glossary/ accessed on august 13, 2020. andrews, d. h., hull, t. d., and donahue, j. a. (2009). storytelling as an instructional method: descriptions and research questions. technical report, oak ridge inst for science and education tn. aniche, m., hermans, f., and van deursen, a. (2019). pragmatic software testing education. in proceedings of the 50th acm technical symposium on computer science education, sigcse ’19, page 414–420, new york, ny, usa. acm. bach, j. (2003). exploratory testing explained. online: http://www. satisfice. com/articles/et-article. pdf. bonnardel, n. and didier, j. (2020). brainstorming variants to favor creative design. applied ergonomics, 83:102987. bonwell, c. c. and eison, j. a. (1991). active learning: creating excitement in the classroom. 1991 ashe-eric higher education reports. eric, sl. brown, t. and katz, b. (2011). change by design. journal of product innovation management, 28(3):381–383. castro, a. k. s. d. (2018). testes exploratórios: características, problemas e soluções. b.s. thesis, universidade federal do rio grande do norte. cheiran, j. f. p., de m. rodrigues, e., de s. carvalho, e. l., and da silva, j. a. p. s. (2017). problem-based learning to align theory and practice in software testing teaching. in proceedings of the 31st brazilian symposium on software engineering, sbes’17, page 328–337, new york, ny, usa. acm. corbin, j. and strauss, a. (2014). basics of qualitative research: techniques and procedures for developing grounded theory. sage publications. costa, i., oliveira, s., cardoso, l., ramos, a., and sousa, r. (2019). uma gamificação para ensino e aprendizagem de teste exploratório de software: aplicação em um estudo experimental. xviii simpósio brasileiro de jogos e entretenimento digital (education track–short papers), 2019(1):1232–1235. coutinho, e. f. and bezerra, c. i. m. (2018). uma avaliação inicial do jogo para o ensino de testes de software itestleaening sob a ótica de um software educativo. in congresso sobre tecnologias na educação, volume 3, pages 11–22, fortaleza,ce. sbc open library. coutinho, j., andrade, w., and machado, p. (2021). teachhttps://www.agilealliance.org/agile101/agile-glossary/ https://www.agilealliance.org/agile101/agile-glossary/ coutinho et al. 2023 ing exploratory tests through pbl and jitt: an experience report in a context of distributed teams. in proceedings of the xxxv brazilian symposium on software engineering, sbes ’21, page 205–214. association for computing machinery. crispin, l. and gregory, j. (2009). agile testing: a practical guide for testers and agile teams. pearson education, sl. crouch, c. h. and mazur, e. (2001). peer instruction: ten years of experience and results. american journal of physics, 69(9):970–977. de andrade, s. a. a., de oliveira neves, v., and delamaro, m. e. (2019). software testing education: dreams and challenges when bringing academia and industry closer together. in proceedings of the xxxiii brazilian symposium on software engineering, sbes 2019, page 47–56, new york, ny, usa. acm. ferreira costa, i. e. and oliveira, s. r. b. (2020). the use of gamification to support the teaching-learning of software exploratory testing: an experience report based on the application of a framework. in 2020 ieee frontiers in education conference (fie), pages 1–9, uppsala, sweden. ieee. figuerêdo, c. d. o., dos santos, s. c., borba, p., and alexandre, g. (2011). using pbl to develop software test engineers. in international conference on computers and advanced technology in education, pages 305–322, cambridge, united kingdom. sn. garousi, v., felderer, m., kuhrmann, m., and herkiloğlu, k. (2017). what industry wants from academia in software testing? hearing practitioners’ opinions. in proceedings of the 21st international conference on evaluation and assessment in software engineering, ease’17, page 65– 69, new york, ny, usa. acm. garousi, v., rainer, a., lauvås, p., and arcuri, a. (2020). software-testing education: a systematic literature mapping. journal of systems and software, 165:110570. gebizli, c. s. and sozer, h. (2017). impact of education and experience level on the effectiveness of exploratory testing: an industrial case study. in 2017 ieee international conference on software testing, verification and validation workshops (icstw), pages 23–28. ghazi, a. n. (2017). structuring exploratory testing through test charter design and decision support. phd thesis, blekinge tekniska högskola. hendrickson, e. (2013). explore it!: reduce risk and increase confidence with exploratory testing. pragmatic bookshelf, sl. leite, f. t., coutinho, j. c. s., and de sousa, r. r. (2020). an experience report about challenges of software engineering as a second cycle course. in proceedings of the 34th brazilian symposium on software engineering, sbes ’20, page 824–833, new york, ny, usa. acm. mårtensson, t., martini, a., ståhl, d., and bosch, j. (2019). excellence in exploratory testing: success factors in largescale industry projects. in product-focused software process improvement, pages 299–314. springer international publishing. martinez, a. (2018). use of jitt in a graduate software testing course: an experience report. in 2018 ieee/acm 40th international conference on software engineering: software engineering education and training (icse-seet), pages 108–115, gothenburg,sweden. ieee. mcconnell, j. j. (1996). active learning and its use in computer science. in proceedings of the 1st conference on integrating technology into computer science education, pages 52–54, barcelona,spain. acm. milne, a., riecke, b., and antle, a. (2014). exploring maker practice: common attitudes, habits and skills from vancouver’s maker community. studies, 19(21):23. novak, g. m. (2011). just-in-time teaching. new directions for teaching and learning, 2011(128):63–73. paiva, m. r. f., parente, j. r. f., brandão, i. r., and queiroz, a. h. b. (2016). metodologias ativas de ensinoaprendizagem: revisão integrativa. sanare-revista de políticas públicas, 15(2):145–153. paschoal, l. n. and de souza, s. d. r. s. (2018). a survey on software testing education in brazil. in proceedings of the 17th brazilian symposium on software quality, sbqs, page 334–343, new york, ny, usa. acm. paschoal, l. n., silva, l., and souza, s. (2017). abordagem flipped classroom em comparação com o modelo tradicional de ensino: uma investigação empírica no âmbito de teste de software. in brazilian symposium on computers in education (simpósio brasileiro de informática na educação-sbie), page 476, recife,pe. sbc open library. paschoal, l. n. and souza, s. r. (2018). planejamento e aplicação de flipped classroom para o ensino de teste de software. renote, 16(2):606–614. pfahl, d., yin, h., mäntylä, m. v., and münch, j. (2014). how is exploratory testing used? a state-of-the-practice survey. in proceedings of the 8th acm/ieee international symposium on empirical software engineering and measurement, esem ’14, new york, ny, usa. association for computing machinery. queiroz, r., pinto, f., and silva, p. (2019). islandtest: jogo educativo para apoiar o processo ensino-aprendizagem de testes de software. in anais do xxvii workshop sobre educação em computação, pages 533–542, belém,pa. sbc open library. raappana, p., saukkoriipi, s., tervonen, i., and mäntylä, m. v. (2016). the effect of team exploratory testing– experience report from f-secure. in 2016 ieee ninth international conference on software testing, verification and validation workshops (icstw), pages 295–304, chicago, il, usa. ieee. scatalon, l. p., carver, j. c., garcia, r. e., and barbosa, e. f. (2019). software testing in introductory programming courses: a systematic mapping study. in proceedings of the 50th acm technical symposium on computer science education, sigcse ’19, page 421–427, new york, ny, usa. acm. suranto, b. (2015). exploratory software testing in agile project. in 2015 international conference on computer, communications, and control technology (i4ct), pages 280–283, kuching, malaysia. ieee. whittaker, j. a. (2009). exploratory software testing: tips, tricks, tours, and techniques to guide test design. pearson coutinho et al. 2023 education, sl. wohlin, c., runeson, p., host, m., ohlsson, m. c., regnell, b., and wesslén, a. (2012). experimentation in software engineering. springer science & business media, sl. introduction background teaching software testing exploratory testing active learning problem based learning pbl just-in-time teaching jitt methodology study design planning execution analysis procedures checking the use of et in practice brainstorming planning execution of brainstorming results results of experiment characterization of participants common practices in agile st perception of et after the course contributions from the pbl and jitt approach results of et insertion with agile teams discussion of results overview on using pbl and jitt overview of insights in practice limitations and threats to validity related works teaching-learning of st using active methodology et teaching et in the context of industry discussion of related works conclusions 15-##_source texts-486-1-18-20190826 journal of software engineering research and development, 2019, 7:4, doi: 10.5753/jserd.2019.15 this work is licensed under a creative commons attribution 4.0 international license. on challenges in engineering iot software systems rebeca campos motta [ universidade federal do rio de janeiro and lamih cnrs umr 8201 | rmotta@cos.ufrj.br] káthia marçal de oliveira [université polytechnique hauts-de-france lamih cnrs umr 8201 | kathia.oliveira@uphf.fr] guilherme horta travassos [universidade federal do rio de janeiro | ght@cos.ufrj.br ] abstract contemporary software systems, such as the internet of things (iot), industry 4.0, and smart cities represent a technology changing that offer challenges for their construction since they are calling into question our traditional form of developing software. they are a promising paradigm for the integration of devices and communications technologies. it is leading to a shift in the classical monolithic view of development where stakeholders used to receive a software product at the end (that we have been doing for decades), to software systems incrementally materialized through physical objects interconnected by networks and with embedded software to support daily activities. therefore, we need to revisit the traditional way of developing software and start to consider the particularities required by these new sorts of applications. since such software systems involve different concerns, this paper presents the results of an investigation towards defining a framework to support the software systems engineering of iot applications. to support its representation, we evolved the zachman’s framework as an alternative to the organization of the framework architecture. the filling of such a framework is supported by a) 14 significant concerns of iot applications, recovered from the technical literature, practitioner’s workshops and a government report; b) seven structured facets emerged from iot data analysis, that together represent the engineering challenges to be faced both by researchers and practitioners towards the advancement of iot in practice. keywords: internet of things, iot, contemporary software systems engineering, empirical software engineering 1 introduction the internet of things (iot) contributes to a new technological revolution affecting society. iot is a paradigm that allows composing systems from uniquely addressable objects (things) equipped with identifying, sensing, or acting behaviors and processing capabilities that can communicate and cooperate to reach a goal. from primary devices with simple software solutions to the large-scale, highperformance software systems producing and analyzing massive amounts of data, iot is going to reach all areas of interest (jacobson et al. 2017). due to its far-reaching potential, iot can use all kinds of technologies available today and will drive the development of new software systems to solve new problems, some still unknown (atzori et al. 2010; jacobson et al. 2017). software engineering, as a discipline, has gone through constant changes since its conception. several concepts, methods, tools, and standards have been proposed to support the development of software (ieee 2004; trappey et al. 2017), and the internet has given a significant shift in the area. it makes more explicit the need to evolve the software technologies previously proposed to support the building of systems fitting new features. systems engineering is a research area embracing multidisciplinarity, integrating different disciplines to reach successful systems according to their purposes, including software, which is essential for iot materialization. therefore, iot leads to an era where, rather than developing software, practitioners are going to engineer systems embedding much software into the system´s parts. in this scenery, the initial problem of our research is to identify the concerns regarding the development of iot software systems, and whether the existing software technologies within the areas (facets) related to engineering such systems are enough for supporting their development. overall, this paper describes the results of investigations dealing with the road ahead on iot development. the concerns captured through observations in the technical literature, from practitioners in specific workshops and a national initiative regarding iot in brazil pave this road. the filling of iot facets combined with the concerns is what we call engineering challenges, capturing knowledge necessary to support a specific activity. the conceptual framework aims to contemplate all facets involved in iot and present the recovered concerns, by simplifying and organizing their presentation. the zachman’s proposition for information systems architecture (zachman 1987) (subsection 2.2) was borrowed and tailored to compose such a framework. our motivation to investigate and contribute to the iot paradigm is therefore supported by its relevance (cni 2016; lu 2017) and the need for a holistic approach and multidisciplinary view for the development of new software solutions (bauer and dey 2016; aniculaesei et al. 2018). this reflects in a demand for technical competencies and skills detained by different practitioners to engineer such software systems (desolda et al. 2017; de farias et al. 2017) and the lack of specific software engineering methodologies to support iot (zambonelli 2016; larrucea et al. 2017; jacobson et al. 2017). some of the challenges are focused on interaction issues, whether it is between humans or things, which is essential for the complete establishment of the paradigm (motta et al.). in our proposal, we introduce a multifaceted conceptual framework as a step towards addressing some of these issues. this paper extends (motta et al. 2018), including more details of the studies, deepen the discussions and providing a usage example of the proposed framework. the paper is on challenges in engineering iot software systems motta et al. 2019 organized as follows: section 2 presents the context of this research and the zachman’s framework. section 3 presents the research strategy followed by the primary results: the iot concerns (section 4), iot facets (section 5) and the definition of the framework (section 6) with an example of use. section 7 concludes the paper with final remarks regarding threats to validity and ongoing works. 2 conceptual background this section starts by presenting the source of motivation for this research, the cactus project. next, it presents some basic concepts related to the zachman’s framework, which is the ground for the proposed conceptual framework organizing the results presented in this paper. 2.1 the cactus project the cnpq cactus research project was performed based on the aim of understanding test strategies for quality assessment of actor-computer interaction in context-aware systems, as one of the chief characteristics of ubiquitous systems (spínola and travassos 2012; santos et al. 2017; matalonga et al. 2017). research teams from two brazilian universities (federal university of rio de janeiro and federal university of ceará) and one french university (université polytechnique hauts-de-france) worked together in the project. it started with the assumption that interaction is not limited to human and computers in ubiquitous systems. it encompasses the interaction among different devices, such as sensors, actuators, as well as other systems, which we consider the term actor-computer interaction as more adequate. from the results achieved in the project and the technological evolution of the area, we look iot as related to ubiquitous systems, sharing some characteristics and challenges (andrade et al. 2017). this work is one of the results of this change of perspective. 2.2 zachman’s framework the zachman’s framework (zachman 1987) was introduced in 1987 to comprehend the scope of control within an enterprise and to provide a holistic view of the enterprise architecture that may be used as a base for its management. it still is an essential reference for enterprise architecture and supported by many types of modeling tools and languages (goethals et al. 2006). zachman’s motivation to develop the framework was that “with increasing size and complexity of the implementation of information systems, it is necessary to use some logical construct for defining and controlling the interfaces and the integration of all of the components of the system” (zachman 1987). the framework is suitable for working with complex systems, and despite its original purpose, its use is not limited to enterprise architecture. alongside with that, it has been used to assess the development process (de villiers 2001), for requirement engineering (de villiers 2001; technology 2015), business process modeling (sousa et al. 2007), to instantiate an iec standard (panetto et al. 2007), and applied to systems of systems (bondar et al. 2017). also, zhang et al. used this framework for safety analysis in avionics systems (zhang et al. 2014). more framework evidence use can be observed in different case studies (panetto et al. 2007; nogueira et al. 2013; aginsa et al. 2016), the latter claiming that “zachman’s framework continues to represent a modeling tool of great utility and value since it can integrate and align the it infrastructure and business goals.” from the data recovered in our research, we realize that concepts and properties related to iot change according to the context and actors involved. this multifaceted view of iot shows once again that it is a multidisciplinary paradigm. for this reason, a representation of the concepts should be as comprehensive as possible to represent all aspects involved. this framework is primarily defined considering a table, crossing perspectives, and interrogative questions as presented in table 1 (zachman 1987; sowa and zachman 1992). table 1. zachman framework with cells filled showing examples of description (sowa and zachman 1992). interrogative questions what how where when who why p e r s p e c t i v e s planner things important to the business process performed business locations of operations events and cycles important to the business organizations and agents important to the business business goals and strategies owner semantic model business process model business logistic system master schedule workflow model business plan designer logic data model application architecture distributed system architecture process structure human interface architecture knowledge architecture builder physical data model system design technology architecture control structure presentation architecture knowledge design implementer data definition program network architecture timing definition security architecture knowledge definition user data function network schedule organization strategy on challenges in engineering iot software systems motta et al. 2019 the framework formalization and its conception were presented as a metaphor from the building architecture to system architecture. the perspectives are therefore described as (sowa and zachman 1992): planner it corresponds to an executive summary for a planner or investor who wants a system scope estimate, what it would cost, and how it would perform. owner – it relates to the enterprise business model, which constitutes the business design and shows the business entities and processes, and how they interact. designer – it corresponds to the system model designed by a systems analyst who must determine the data elements and functions representing business entities and processes. builder – it refers to the technology model, which must tailor the information system model to the details of the programming languages, i/o devices, or other technologies. implementer – it relates to the detailed specifications that are given to programmers who code individual modules without being concerned with the overall context or system structure. user the user perspective was added in a later version and represents the view of the functioning building, or system, in its operational environment. the framework presents six fundamental questions in the columns to outline each perspective and support the answering to the questions regarding: what: some entity (that can be real-world objects, logical or physical data types). how: some process. where: some location. who: some role played by a person or a computational agent. when: time, a subtype such as a date, or time that is coincident with some event. why: some goal or subgoal that provides the reason that motivates the model for that row. considering the extensive use of the zachman framework for representing different domains and technologies and its flexibility of being customized to represent the complexity in each context, we decided to take it as a basis of our work. to that end, we analyzed concerns and facets related to the multidisciplinarity of iot applications to be used as inputs of information and requirements for its first organization. 2.3 related work in this work, we propose a holistic engineering view, based on the principles of systems engineering. in the search, we came across the work of patel and cassou that propose a development methodology and framework to support the implementation of iot applications. their approach is designed to address essential challenges (lack of division of roles, heterogeneity, scale, different lifecycle phases) that differentiate iot applications from others (patel and cassou 2015). the proposal of their methodology is based on the separation of concerns: domain, functional, deployment, and platform. each concern has specific steps to guide the development, implemented in a defined process. there are some similarities to our proposal. we highlight their strategy to attack multidisciplinarity by using four concerns with a diverse set of skills performed by five different roles. however, our proposal differentiates from that because it offers a broader view of the concerns and focuses more on supporting the development team to move out of the problem domain with an action plan stepping into the solution domain. two other works (alegre et al. 2016) and (sánchez guinea et al. 2016) are literature reviews, focusing on engineering strategies to develop context-aware software systems (cass) and ubiquitous systems, respectively. in (alegre et al. 2016), the results are based on a literature review, and a survey carried out with specialists in cass. it presents an extensive work in the cass area, analyzing and characterizing the concept of context as well as their interaction types and main features. the most exciting part for the perspective of our work is that they search the literature for developing techniques and methods that have been adapted from conventional systems to cass throughout the most common stages of a development process: requirements elicitation, analysis & design, implementation, and deployment & maintenance. none of the techniques presented fully meets the cass requirements, and the authors conclude the work by recommending a more holistic and unified approach for the development of cass, arguing that it should be different from the conventional software engineering approach for creating these systems (alegre et al. 2016). another work is from costa et al. (costa et al. 2017). it presents more than just the requirements and needs of an iot application, focusing on its challenges and proposing an approach to support the requirements specification of iot systems named iot requirements modeling language (iotrml). we share some of the motivations with this work since it states that different perspectives and the heterogeneous nature of iot should be considered in the development of such software systems. the domain model composes their proposal for the abstraction and a sysml profile for the specification. in their model, a stakeholder expresses a requirement as a proposition, and the requirement may influence or conflict with other requirements. their approach supports both functional and nonfunctional requirements, which is crucial in this scenario. through their solution, four requirements specification activities are supported: elicitation of system’s requirements from the stakeholders that will generate an initial model in their tool, the analysis to identify influences and conflicts among requirements updating the model representing them, resolution of conflicts, and the last activity, to decide on a candidate solution containing the requirements to be addressed. a proof of concept is presented to illustrate the approach used in the context of a smart building, focusing on employees’ safety and energy efficiency. on challenges in engineering iot software systems motta et al. 2019 our proposal can somehow be related to the iot-rml approach (costa et al. 2017). however, we aim to address the problem understanding in the conceptual phase, which focuses on a step before the specification requirements considering a multi-perspective and multidisciplinary strategy. another related work is from aniculaesei et al., which argues that conventional engineering methods are not adequate for providing guarantees to some of the challenges specific of autonomous systems, such as dependability, the focus of their work (aniculaesei et al. 2018). some of the main points discussed is the possibility of adaptive behavior present in iot, as they adapt their behavior to better interact with other systems and people or to solve problems more effectively, and variations in the context, the formerly closed and valid development artifacts may not capture the changes and be inadequate since the environment and the system behavior can no longer be fully predicted or described in advance (aniculaesei et al. 2018). in response to these challenges and gaps, the authors propose an approach based on the notion of dependability cages. their approach deals with external risks (uncertainties in the environment) and internal risks (system changing behavior), both at the development and the operation. at the moment of preparing this manuscript, we observed a lack of more concrete proposals for the materialization of the iot paradigm. we aim to address the challenges presented in (alegre et al. 2016) and (sánchez guinea et al. 2016), filling the gaps from (patel and cassou 2015), (costa et al. 2017) and (aniculaesei et al. 2018) focusing on the issue of multidisciplinarity and providing support to decision-making in the initial development phase of problem understanding. 3 research strategy figure 1 presents our investigation strategy. it is composed of three parts and involves performing different lines of investigation and studies. the first part of our investigation regards iot concerns. it aims at presenting concerns, issues, and difficulties frequently reported regarding the development of iot applications. to recover such concerns, we collected data from different sources of information considering a literature review (subsection 4.1), discussion with practitioners (subsection 4.2) and reading a brazilian government report (subsection 4.3). based on the identified concerns, it was possible to observe research gaps and main iot development issues that need effort in their understanding and evolution. these intermediate results can be useful to researchers looking for research opportunities and practitioners planning the construction of iot applications. the literature review also supported the identification of 29 iot definitions. from this set, we conducted a textual analysis, using coding procedures, from grounded theory (see section 3.1), to assign concepts to a portion of data (strauss and corbin 1990). the result was the identification of iot facets (section 4) necessary for iot materialization, in the sense of being the set of parts composing an iot software system. we understand facets as “one side of something many-sided” (oxford dictionary), “one part of a subject, a situation that has many parts” (cambridge dictionary). these facets are the basis for tailoring the 6x6 matrix of the zachman’s framework (zachman 1987). figure 1: research strategy. the idea of investigating the facets from iot definitions came in the sense of finding a set of parts composing an iot software system. it does not try to be exhausting because due to its far-reaching potential, we do not know to what extent iot will meet or drive the development of new software technologies to solve new problems. we wanted to differentiate concerns from challenges. each application alone has a set of concerns that must be addressed with software technologies and other solutions for the software system development or construction. in the case of iot, we understand that it is a multidisciplinary, multidimensional, and multifaceted paradigm (gluhak et al. 2011; gubbi et al. 2013; jacobson et al. 2017). in this sense, this work presents the iot facets that must meet the concerns, this being the real engineering challenge (to fill a cell in the framework). the procedures and activities performed for each part are detailed, together with a broader discussion of the results. however, the concerns are somewhat related to the facets, and our next activity was to find a way to represent all the concepts that transparently emerged from the sources of information, and that could guide the next research activities. thus, the last step of our research, as a result of the studies, we introduce a conceptual framework to organize the challenges of engineering iot software systems (section 5). our work focusses on discussing the iot perception and its central issues from three perspectives: technical literature, practitioners, and government, using data collected with different studies. we briefly present the studies and dive into the analyzed challenges, from which we propose a conceptual framework to support the development of such software systems by considering different and complementary facets. this paper presents the research path that led us to the framework, not detailing each study, but instead informing how they inspired a structure composed of six questions, six perspectives and seven facets aiming to define an engineering strategy for iot development. on challenges in engineering iot software systems motta et al. 2019 3.1 grounded theory (gt) the gt methodology comes as a mechanism to deal with and understand research data and how they relate to each other, considering the iot domain and features. we rely on these procedures to analyze the recovered data from each study. the principles and procedures of gt according to (strauss; corbin, 1998) were used to assist us in developing and analyzing the concepts in this research, as presented below: planning: initially, it identifies the area of interest and the process to be followed inside the gt paths. in our case, each study was planned individually, with the execution and analysis performed by the researchers. data collection: initially, gt resorted to interviews. however, any method can be used, like focus groups, observations, artifacts, or texts. in our case, we rely on the data extracted from the articles resulting from the literature review, the data recovered from the discussions with practitioners and all the textual documents from the brazilian government. coding: at this step, the researcher should work their sensibility to identify significant data and use the constant comparative analysis method: through iteration, going back and forth in the codes generated observing and comparing to find adequacy, conformity, and coherence among the codes. in our case, the qda miner lite 1tool was used to support this part. all the matching from text to code was performed by one researcher and then revised by another. the procedure followed was to review each extraction and the respective code, contributing to the constant comparison until reaching an agreement on the coding. reporting: writing memos, comments, and decision points during the coding phase can enhance the report. being able to narrate the process of abstraction and describing the rationale behind the codes is the last challenge to sound analysis. in our case, this article comes as the report for the result, from where portions of the extracted data lead to the coding that represents concerns for iot development. this approach has been used in software engineering research (seaman 1999; carver 2007; badreddin 2013) and was selected since gt provides reference support for the procedures and is adequate to work with a large amount of information. considering that some concepts have different meanings, this methodology is suitable to establish the similarities and differences among them. the same analysis strategy was used throughout the study. 4 iot concerns being a multidisciplinary domain, iot covers many topics from socio-technical to business. we conducted different studies to recover iot concerns. each study was planned, considering a specific perspective on the subject. initially, 1 provalisresearch.com/products/qualitative-data-analysis-software/ we contemplate the academy perspective, recovered through a literature review. then we decided to broaden the range to represent two other perspectives collected from practitioners and a government report, contributing to a more comprehensive representation. although they represent different visions, they discuss the same topic. thus they become complementary, giving us a more comprehensive view of the area. 4.1 inputs 4.1.1 from the literature review the concerns presented in the technical literature were extracted from a literature review (our first empirical study). for the review, we followed the recommendations well established in the literature (biolchini et al. 2008) focusing on secondary studies since there were already reviews on iot. the goal of this gqm-based (basili et al. 1994) review was defined as: analyze the internet of things domain with the purpose of characterization with respect to definitions, characteristics and application areas from the point of view of software engineering researchers in the context of the available technical literature. the selected articles were secondary studies as they rely on primary studies and survey other sources of information to present a bigger picture. presents a research protocol summary. the search engine was scopus since it indexes several peer-reviewed databases, and is well-balanced regarding coverage and relevance. snowballing procedures can mitigate the lack of other search engines and complements the search strategy (motta et al. 2016; matalonga et al. 2017). to reduce bias, three researchers executed the review. the process was carried between march and may 2017. eighty-one articles resulted in the search in scopus. after the execution of four trials, a selection from title and abstract according to the established criteria, and one level backward/forward snowballing (wohlin 2014), 12 secondary studies compose the final set. the reviewers read the articles and extracted relevant information according to an extraction form. we used an extraction form to retrieve the following information from the secondary sources: reference information, abstract, iot definition, iot related terms, iot application features, iot application domain, development strategies for iot, study type, study properties, challenges and article focus. from the discussion of rq1, we extracted 34 iot definitions that lead us to understand that iot is a paradigm allowing the composition of software systems from uniquely addressable objects equipped with identifying, sensing or actuation behaviors and processing capabilities that can communicate and cooperate to reach a goal. regarding rq2, we recovered 29 different attributes, where nine of them are discussed with clear evidence from the sources of information. considering that the results retrieved are from secondary studies, the characteristics on challenges in engineering iot software systems motta et al. 2019 represented reflect more than just the selected set, but rather the whole set of primary studies involved in them, which can strengthen these results. one contribution of the review is to present an organized perspective regarding iot state-of-the-art. besides, it allows observing which areas of application are making use of iot (rq3). all of these findings were related and summarized in a report to enrich the iot paradigm comprehension (see link table 2). iot related concepts such as cyber-physical systems (khaitan and mccalley 2015) and systems of systems (nielsen et al. 2015) are also discussed in the final report of the review. the data for discussions and analysis came from part what was extracted from the form, which we treat in this section as concerns. we based our analysis procedure on textual analysis, using codes to assign concepts to a portion of data, identifying patterns from similarities and differences emergent from the extracted data, based on the gt procedures (strauss and corbin 1990). it was conducted by two researchers, with crosschecking to achieve a consensus with the analysis to decrease potential misinterpretation and bias. the 12 papers provided 38 excerpts regarding iot challenges that were organized into seven categories: architecture, data, interoperability, management, network, security, and social. 4.1.2 from practitioners another perspective used to recover iot concerns was the practitioners’ opinion. we performed qualitative studies during two scientific events from which all the participants were developers and/or researchers in the iot domain. for this reason, we considered them representative, insightful, and experienced in the topic. the following questions guided the discussions in both studies: a) regarding product quality between conventional software and iot: what is similar? what is different? what needs to be investigated? b) regarding the software technologies between conventional software and iot: what do we have that can be used directly? what do we have that needs adaptation to be used? what don’t we have but need? the first event (in august 2017) was the 1st qualityiot at the brazilian symposium on software quality (sbqs). the 21 participants were divided by convenience into groups to deal with the mentioned questions in the following perspectives: people focused on the human end-user. challenges and impact of this technology in our daily lives, such as social, legal, and ethical. group of five (5) participants. product focused on iot products that can be generated, considering the inclusion of software and “smartness” in general objects and the possibilities of new products in this scenario. group of nine (9) participants. process focused on the software development process that should be included in the things and consider the big picture of organizing the things together. group of seven (7) participants. the groups had one hour for discussion. a representative of each group wrote down the main points identified and later presented the ideas for all the workshop participants. the second event (carried out in september 2017) was a panel in the brazilian congress on software: theory and practice (cbsoft) conducted by the same first event moderator. in this panel, five (5) iot domain practitioners (experts from academy and industry) and audience were motivated to discuss the same previous study questions. the moderator acted as the reporter in the panel discussion, gathering the central issues, and producing a document reporting the notes. next, the notes from both events were collected and analyzed. besides, open coding procedures based on gt table 2. protocol summary. research questions (rq1) what is internet of things? (rq2) which characteristics define iot applications? (rq3) which are the applications for iot? search string population ("*systematic literature review" or "systematic* review*" or "mapping study" or "systematic mapping" or "structured review" or "secondary study" or "literature survey" or "survey of technologies" or "driver technologies" or "review of survey*" or "technolog* review*" or "state of research") and intervention ("internet of things" or "iot") search strategy scopus (www.scopus.com) + snowballing (backward and forward) inclusion criteria to provide an iot definition; or to provide iot properties; or to provide applications for iot. exclusion criteria not provides an iot definition; and not provides iot properties; and not provides applications for iot; and studies in duplicity; and register of proceedings. study type secondary studies acceptance criteria three distinct readers: all readers accept => paper is accepted all readers exclude => paper is excluded the majority of accept, others in doubt => paper is accepted else => discuss and consensus technical report detailed information about the planning and execution https://goo.gl/czvvdc on challenges in engineering iot software systems motta et al. 2019 (strauss and corbin 1990) were used and allowed the identification of nine categories of iot concerns: architecture, interoperability, professional quality properties, requirements, scale, social, security, and testing. 4.1.3 from the governmental report many initiatives from governments and organizations have demonstrated a growing interested in iot. in this context, the brazilian national bank of economic development (bndes), organized a study to promote economic and social development by analyzing and proposing public policies for iot. the idea is to obtain an overview of the impact of iot in brazil, understanding the country competencies and defining initial aspirations for promoting iot in brazil, as it had been documented in a plan of action. the research is being conducted since the beginning of 2017, and the material2 available was used to recover iot concerns. these results were based on registered initiatives developed by 11 countries and the european union on iot, initiatives developed in global scope and interviews with experts to the implementation of the area in brazil. it was also based on textual analysis on 28 textual documents available. reading the material allowed extracting information focusing on the presented concerns, analyzing, and similarly organizing them as the two previous information sources (the literature and practitioners). from this, seven categories of iot concerns emerged: data, interoperability, network, professionals, regulation, security, and things. 4.2 output: putting all together extracting the perception and concerns regarding iot from different points of view was essential for the strengthening and direction of our research. for instance, it is possible to observe that, although there are different perspectives, they become complementary to represent the concerns to produce quality software systems. together, the three sources provided 14 different concerns, which must be met in favor of higher quality iot software systems (figure 2). we present each of the 14 categories with a definition and some example from the input source to support its comprehension: architecture issues and concerns regarding design decisions, styles, and the structure of iot systems. excerpt example: “finding a scalable, flexible, secure and cost-efficient architecture, able to cope with the complex iot scenario, is one of the main goals for the iot adoption.” (borgia 2014); data it refers to the management of a large amount of data, and how to recover, represent, store, interconnect, search, and organize data generated by iot from so many different users and devices. excerpt example: “this new field offers many research challenges, but the main goal of this line of research is to make sense of data in any iot environment. it has been pointed out 2 https://goo.gl/nmfece that it is always much easier to create data than to analyze them.” (gil et al. 2016); interoperability related to the challenge of making different systems, software, and things to interact for a purpose. standards and protocols are also included as issues. excerpt example: “the end goal is to have plug n' play smart objects which can be deployed in any environment with an interoperable backbone allowing them to blend with other smart objects around them.” (gubbi et al. 2013); management the application of management activities, such as planning, monitoring, and controlling, in the iot system will raise the interaction of different things. excerpt example: “iot is a very complex heterogeneous network, which includes the connections among various types of networks through various communication technologies […]. addressing things management is still a challenge.” (xu et al. 2014); network technical challenges related to communication technologies, routing, access, and addressing schemes considering the different characteristics of the devices. excerpt example: “designing an appropriate topology, routing, and mac layer is critical for scalability and longevity of the deployed network” (gubbi et al. 2013); professionals to invest resources in the training of engineers and other professionals can result in the creation of a strategic differential. however, the scenario is different, so more than proficiency in programming languages of lower level; the professional who develops software for iot should be able to carry out the customization of solutions already developed for specific demands; quality properties although some specific properties such as interoperability, privacy, and security are primarily discussed, several other quality attributes are considered different in the iot domain such as capacity (device and network), installation difficulty, responsiveness, context awareness. contemplate non-functional requirements by considering what the individual sees, feels and how the things can contribute to that; regulation governments are working on crucial issues that require significant investment and coordination between the public and private sectors. within regulatory issues, standardization is one of the most critical, and there is no single strategy to follow. in some cases, it is necessary for the creation of specific laws and institutions regulate privacy and security issues, a topic that is debated today by all the countries mentioned in the report; requirements considering the iot nature, with a tendency for more innovation, mainly based on ideas, the requirements can be presented in a less structured on challenges in engineering iot software systems motta et al. 2019 form. another concern is that the user can also be a developer since the solutions reach different types of individuals and devices and new features can be attached; scale to develop, manage, and maintain a large-scale software system is a concern. as the number of devices in the software system increases along with the number of relationships, new technologies are needed to maintain a software system with the quality level required. security issues related to several aspects to ensure data security in iot systems. for that, a series of properties, such as confidentiality, integrity, authentication, authorization, non-repudiation, availability, and privacy, should be investigated. excerpt example: “security issues are central in iot as they may occur at various levels, investing technology as well as ethical and privacy issues […] this is extremely challenging due to the iot characteristics.” (borgia 2014); social concerns related to human end-user to understand the situation of its users and their appliances. excerpt example: “for a lay person to fully benefit from the iot revolution, attractive and easy to understand visualization have to be created.” (gubbi et al. 2013); testing iot will provide unprecedented universal access to connected devices. testbed and acceptation tests are sophisticated, and there is a greater need for other types of tests, for example, usability, integrity, security, performance; things for the devices, which includes their access and gateways, there are several non-functional restrictions inherent to iot that should be present in the products. these restrictions increase the total cost of the objects, such as an energy consumption alternative when it is not possible to connect to the power grid. it is interesting to notice that the concerns are usually interrelated, confirming the multidisciplinary nature of iot. for example: “for technology to disappear from the consciousness of the user, the internet of things demands software architectures and pervasive communication networks to process and convey the contextual information to where it is relevant” (gubbi et al. 2013), this excerpt is coded for an architectural issue and network as well. another example is “central issues are making full interoperability of interconnected devices possible, providing them with an always higher degree of smartness by enabling their adaptation and autonomous behavior, while guaranteeing trust, privacy, and security.” (atzori et al. 2010), which was coded both for interoperability and for security issues. provide solutions to the issues presented here can be tricky to achieve due to the diversity of concerns, a variety of devices 3 https://aioti.eu/ and https://ec.europa.eu/commission/priorities/digital-single-market_en we can see that each source has its particularities, and some are consistent with its origin. it is expected that practitioners have a more technical and in-depth view presenting more individual and software-oriented issues regarding iot software systems. the concerns with management and quality are transversal to the implementation of such software systems and can be observed in any point of view, but the practitioners have specific concerns of quality, such as meeting non-functional requirements, which bring more specificity and definition to this issue. also, requirements and testing issues are still somewhat open on how to represent, describe, and integrate software systems. these three aspects must be met in the software systems regardless of their scale, which in iot software systems can reach the ultra-large-scale, bringing their associated problems. these three concerns are affected by one aspect that we observed in the literature review. from the characteristics extracted, we could observe that iot properties and its characterization are not explicit, neither the characteristics that can affect the development process of such applications. unclear characteristics can impair requirements, which in turn affects the testing, hindering the overall system quality. we consider that this difficulty is partially due to conceptual aspects, since iot and the related concepts are not yet established and not enclosed by a single definition, being the concept still under discussion (shang et al. 2016). considering the increasing number of interconnected devices, the size or scale of iot software systems can grow consistently. the systems can achieve a more wide-scale coupled with complicated structure-controlling techniques, which brings new challenges to their design and deployment (huang et al. 2017). new solutions for architectural foundations, orchestration, and management are essential for dealing with scale issues, especially for ultra large scale systems such as smart cities and autonomous vehicles (roca et al. 2018). concerning regulation, some actions are being made, from governments 3 and other institutions 4 , to form an adequate legal framework. it is necessary to prompt action to provide guidance and decisions regarding governance and how to operate iot applications in a lawful, ethical, socially and politically acceptable way, respecting the right to privacy and ensuring the protection of personal data (caron et al. 2016; almeida et al. 2018). for the devices, sensors, actuators, tags, smart objects and all the things in the internet of things, or of everything, these are some of the aspects that should be taken into consideration: a) resources and energy consumption, since intelligent devices should be designed to minimize required resources as well as costs; b) deployment since they can be deployed one-time, or incrementally, or randomly depending on the requirements of applications; c) heterogeneity and communication: different things interacting with others, they must be available, able to communicate and accessible (madakam et al. 2015; li et al. 2015). 4 https://www.kiot.or.kr/main/index.nx and https://www.digicatapult.org.uk/ on challenges in engineering iot software systems motta et al. 2019 figure 2. iot concerns. at the intersection between industry and literature, we have architectural and social issues. both concerns are open due to the area novelty in which there is still an uncovering of how to deal and what to expect. architecture is a recurrent issue in the literature being pointed out by (liao et al. 2017) as one of the priority areas for action and reported by (trappey et al. 2017) to be one of the official objectives of iso/iec jtc1. in general, the status is that there still no consolidated standard nor well-established terminologies to uniform advancements for architecture in iot. regarding social concerns, given that the objects, devices and a myriad of things are likely to be connected to many others, being people one of the actors as well (matalonga et al. 2017), it is necessary to explore the potential sociotechnical impacts of these technologies (whitmore et al. 2015). using such devices to provide information about and for people are one of the applications. many challenges and concerns should be addressed to achieve the benefits aimed with iot. in facilitating the development is required the design of data dissemination protocols, and evolve the solutions for privacy, security, trust maintenance, and effective economic models (guo et al. 2012). as affirmed by (dutton 2014), if not designed, implemented, and governed in an appropriate way, these new iot could undermine such core values as equality and individual choice. at the intersection between industry and government, we have the concern of professionals, which is represented by the preparation of their skills and knowledge as for the teams that should be multidisciplinary to meet iot premises. if requirements, testing, and other technical activities are under discussion, we need to think about the professional who will satisfy and perform such activities (yan yu et al. 2010). with the development of iot, different people, systems, and parties will have a variety of requirements; one of the abilities required is how to translate these requirements into new technologies and products. other skills are related to manage the frequency of information generated, manage the ubiquity and actors involved in interactions, develop and maintain privacy and security policies (tian et al. 2018). as the area is new and it is defining the professionals and teams that will work on it too, so it is essential to discuss the professional, develop skills and knowledge necessary for this new generation of innovators, decision-makers and engineers (kusmin et al. 2017). connectivity, communication, network, and the multiple related concepts that enable the evolution of interconnected objects is a critical point for the materialization of iot (gubbi et al. 2013). one of the main challenges of this scenario is a vast amount of information identified, sensed and act upon that must be processed mostly in realor near-real time with an unobtrusive delivery of personalized manner, ensuring data availability and reliability, the channel between devices, and between the human and devices (mihovska and sarkar 2018). many open challenges require new approaches to a quality network in this scenario. therefore research should progress into practice to ensure the benefits for the users. together with network concerns, we have data issues. in a world with “anytime, anyplace connectivity for anyone and connectivity for anything” (conti 2006), we can see how quickly data can be generated and how vast amounts of information are created. some of the challenges are related to the continuous and unstructured creation of connection points (devices, things); the persistence of data objects, unknown scale, and data quality (uncertainty, redundancy, ambiguity, inconsistency, incompleteness) (gil et al. 2016). however, above these, security and interoperability concerns are at the center of all iot related discussions. for iot, for example, it enables computing capabilities in things around us and interoperability is the attribute that enables the interaction among heterogeneous devices, with varied requirements of different applications. interoperability can range from different levels like technical, syntactical, semantic, and organizational, which varies according to the software system needs. complete interoperability is an open question for current software and essential for iot due to its comprehensive nature. issues like encryption, trust, privacy, and any security-related concerns are of utmost importance since iot are inserted in someone’s personal life or into the industry. high coverage procedures should guarantee software system security and trustworthiness. 5 iot facets iot leads to an era where, rather than developing software, we need to engineer software systems embracing multidisciplinarity, integrating different areas for the realization of successful products according to their purposes. it means that software is one of the iot facets, which, together with others, are necessary for iot materialization. aiming at identifying those different facets that characterize this multidisciplinarity, we performed an analysis of the iot definitions identified in the literature review (section 3.1). this analysis was based on gt procedures (strauss and corbin, 1990). the 29 extracted iot definitions were organized in a table with one field of “code” to assign an area, topic, discipline (named here as a facet) on challenges in engineering iot software systems motta et al. 2019 related to a definition excerpt. this coding process was executed by three researchers separately, using separate and independent documents. an example of the document is presented in figure 3. it is composed of three columns: a) index: with the definition number; b) definition: where each definition is presented as extracted from the paper; c) code: with the codes associated with portions of the definition, with a color scheme to help their identification. figure 3. example of a document filed with the definitions and marked with coding. there were two rounds of discussions, first with two, then with all the three researchers. it was done to discuss the similarity and differences in the coding, support the concepts, and reduce bias, until reaching a consensus. from this analysis, we would like to have a set of facets, based on the data we had so far, and be able to sort among the most used to present a minimal set of areas that must be considered when building an iot software system. after the documents merge, meetings for discussions were held, some of the discussion was regarding the coding granularity level. for example, network and telecommunication can all be part of a single facet called connectivity, aiming to encompass several concepts and keep the same level of abstraction. as a result of this process, we came to the consensus that an iot software system should consider seven different facets that are defined below including some examples related to the identified challenges and some potentially used technologies: connectivity – the internet is a relevant concept naturally involved in the iot paradigm. we argue that it is necessary to have available a medium by which things can connect to materialize the iot paradigm. it is essential some form of connection, a network for the development of solutions, and our idea is not to limit internet-only connectivity but to be able to cover other media. o related challenges example: one of the concerns for connectivity is the traffic management and control to deal with the enormous data generated by these devices and guarantee the quality of service (bera et al. 2017; li et al. 2018). o related technologies example: it uses specific solutions according to the application domain and tries to re-use legacy cellular infrastructure and invest in novel communication solutions. it is mostly based on wireless communication technologies that could be divided into shortrange, long-range, and cellular-based. things in this sense, it means the things by themselves in iot. tags, sensors, actuators, and all hardware that can replace the computer, expanding the connectivity reach. o related challenges example: to deal with heterogeneity and scale (rojas et al. 2017), distribution -geographically distributed and sometimes, in inaccessible and critical regions (chen et al. 2018) as well as mobility – iot devices are not static they tend to move between different coverage areas (bera et al. 2017), are issues related to requirements to be covered in iot. o related technologies example: many solutions were combined to build devices like sensors, actuators, smartphones, microcontrollers, interactables, cameras, communication and network enablers, and others. some systems treat things giving a virtual representation of these devices enabling remote access, and control of them. behavior the existence of things is not new nor their intrinsic capacities. what iot provides is the chance of enhancements in the things, extending their behaviors. in the beginning, the things in iot systems were objects attached to electronic tags, so these systems present the behavior of identification. subsequently, sensors and actuators composing the software systems enabled the sensing and actuation behaviors, respectively. it can be necessary the use of software solutions, semantic technologies, data analytics, and other areas to enhance the behavior of things. o related challenges example: some emerged behavior cannot be attributed to a single system but results from the interplay of some or all systems in the network. therefore, each system involved must adjust its behavior according to the common goal, which is an open issue (brings 2017). o related technologies example: the first and most common way to treat behavior is in stages, where the more significant behaviors are constituted by, the smaller ones, with this it is possible to reduce the complexity of taking care of the behaviors. another way to manage behavior is through the use of state machines (jackson, 2015; giammarco, 2017). smartness smartness or intelligence is related to behavior but as to managing or organizing it. it is more referring to orchestration associated with things and to on challenges in engineering iot software systems motta et al. 2019 what level of intelligence technology can evolve their initial behavior. o related challenges example: what makes in many cases a system smart is not only the devices that are used and the decision-making process but the whole solution architecture as well (atabekov et al. 2015), which leads to an architectural challenge in order to achieve smartness. o related technologies example: it uses actuators, decision-makers, and acting according to the data autonomously collected and treated to perform some activity in the environment. it uses techniques from artificial intelligence, machine learning, neural networking, and fuzzy logic to deal with the data. problem domain a problem domain is the area of expertise or application that needs to be examined to solve a problem. iot software systems are developed to reach a goal for a specific purpose. at this point, we are starting from a goal (problem domain) to reach a solution (software system). focusing on a problem domain is merely looking at only the topics of interest and excluding everything else. it, in general, directs the objective of that solution. o related challenges example: iot applications development is a multi-disciplined process where knowledge from many concerns intersects. this development assumes that the individuals involved in the application development have similar skills. it is in apparent conflict with the varied set of skills required during the overall process involving this engineering (patel and cassou 2015) o related technologies example: it varies, but the majority deals with software activities related to analysis, design, and activities to understand the problem domain. interactivity it refers to the involvement of actors in the exchange of information with things and the degree to which it happens. the actors engaged with iot applications are not limited to humans. therefore, beyond the sociotechnical concerns surrounding the human actors, we also have concerns with other actors like animals and the interactions thing-thing. the degree to which it happens works together with the medium through which things can connect (connectivity) so that in addition to being connected, they can understand (interoperability). o related challenges example: the wide range of heterogeneity issues introduced by among different iot devices. standardization, therefore, is a must but is not enough as no single standard can cover everything, as well as some organizations (manufacturers, software companies), would like to follow different standards or even proprietary protocols (dalli and bri 2016). o related technologies example: to guarantee communication: http, xmpp, tcp, udp, coap, mqtt, and others. to guarantee to understand: json, xml, owl, ssn ontology, coci, and others. environment the problem and the solution are embedded in a domain, an environment, or a context. this facet seeks to represent such an environment and how the context information can influence its use. o related challenges example: things can be created, adapted, personalized, and rely on contextual data. the integration of things with the social and natural environment can contribute to improving this contextual data and is both a challenge and a research opportunity (davoudpour et al. 2015). o related technologies example: in general, the environments are composed of sensors and actuators to sense and change an ambient state. technologies like iot, cloud, smart objects, middleware’s, wireless sensor networks, vehicular ad-hoc networks, edge computing, artificial intelligence, machine learning, data mining can be employed on these systems. section 6.1 presents the facets regarding an example of a specific iot application. from the data recovered in our research, we realize that concepts and properties related to iot change according to the context and actors involved. for this reason, the representation should be as comprehensive as possible to represent all aspects involved, motivating the iot facets proposition. 6 defining the framework as observed in our investigations, the iot scenario is covered by concerns (discussed in section 4.4) that are seen and treated according to facets (detailed in section), which leads to challenges for its development. in this context, strategic decisions are essential to the development and need to handle all factors involved in iot without prejudice to the original software life-cycle concerns with deadlines, costs and quality levels of products and processes (pfleeger and atlee 1998; fitzgerald and stol 2017). in our proposal, we consider both concerns and facets of iot development. in the latest technologies, the software is only one of the components since further development is necessary for requirements representation, data infrastructure, network configuration, and others (tang et al. 2004). our aim is regarding the conceptual organization of the recovered data that should consider the requirements of different stakeholders and the activities in the different iot facets. having such a conceptual structure, we do not aim that it will guide the software system development but rather to organize the concepts more explicitly and support the decision making when engineering iot software systems. with this goal in mind, we have identified the zachman’s on challenges in engineering iot software systems motta et al. 2019 framework (zachman 1987) as the structure that could support the organization of the concepts. for this reason, we tailored the framework, presented in section 2, as in figure 4. there are several definitions for iot, but the concept refers to a paradigm that allows composing software systems from uniquely addressable objects (things) equipped with identification, sensing or action behaviors and processing capabilities that can communicate and cooperate to reach a goal. this understanding encompasses the definitions recovered from the literature review and states the composing and characteristic of iot. the main difference between traditional software systems regards heterogeneity, scale, and the possibilities inherent to the iot paradigm. the zachman’s framework is generic and flexible enough to be used in different scenarios embedding different points of view, so our choice to use it to organize the information we gathered. because of the meaning of iot, we will have the following demands to develop an iot software system: a paradigm that allows composing systems: iot is not just the things by themselves. it represents a more substantial aggregate consisting of several parts. it implies that there is not a single iot solution, but a myriad of options that can derive from the things and other systems available. it will require some domain and business-specific strategies. from uniquely addressable objects (things): things should be able to be distinguished using unique ids, a unique identification for every physical object. it concerns the network solutions and hardware technologies required to devise the composing parts of the iot paradigm. equipped with identifying, sensing or acting behaviors and processing capabilities: once the object is identified, it is possible to enhance it with personalities and other information and enable it to connect, monitor, manage and control things. this understanding implies that depending on the “smartness” degree required for a setting,” a software solution can be more robust and involve other technical arrangements, such as artificial intelligence. that can communicate and cooperate: the other part of the paradigm, alongside with the things, is the internet. the internet (in a broader sense) is the connection channel of the available things. together with this network solution, things should be able to communicate, interchange, share, and other issues. for this, a set of characteristics, such as interoperability, should also be present in the things. to reach a goal: this whole scenario is set for a purpose, for a reason, motivated by something. this primary goal is what will guide the development. this description leads us to tailor the zachman’s framework in a faceted scheme; each one represents a facet required for an iot software system (see figure 4). we argue that a solution for iot cannot be done without considering all fundamental paradigm aspects, requiring multidisciplinary technologies and a diverse team to meet them. we consider the iot facets to address this multidisciplinarity. they were extracted from the literature review and cover a set of dimensions needed to be present, in different degrees, in an iot software system. this initial set can be extended if needed, as we progressed in the research since it limited to the set of sources dealt with in this research. alongside with the facets we have: perspectives and communication interrogatives evolved both from the zachman framework (sowa and zachman, 1992). the perspectives were divided as control (business, executive and user), who support the definition of the problem domain, and construction (architect, engineer, technician, and user) parts, that will specialize the facets to solve the problem. we are considering the user perspective as a hybrid because the future vision is that users have active participation in the construction of iot solutions (singh and kapoor, 2017). the framework considers all the perspectives involved in the planning, conception, building, usage, and maintaining activities of iot software systems: executive perspective it focuses on the system scope and management plans, and how it would relate to a particular context. business perspective it is concerned with the business models, which constitute business design, how they relate, and how the system will be used. architect perspective this perspective translates the system model designed and determines the logic behind a system considering data elements, process flows, and functions are representing the business entities and processes. engineer perspective it corresponds to the technology models, which must tailor the model to the programming languages details, devices, or other required supporting technology. technician perspective the developer follows detailed specifications to build modules, sometimes without being concerned with the overall context or structure of the system. user perspective it concerns the functioning system in use. from the guidelines provided in the zackman’s framework, we consider the questions as communication interrogatives for our context since the answer to each question in each perspective, and each facet will give us more direct information leading an engineer closer to the solution specification. these are fundamental questions to outline each perspective: what referring to the information required for the understanding and management of a system. it begins at a high level, and as it advances in the perspectives, the data description becomes more detailed; how it relates to translating abstract goals into the definition of its operations using software technologies (techniques, technologies, methods, and solutions); on challenges in engineering iot software systems motta et al. 2019 where it concerns the activities location; it can be a geographical distribution or something external to the system; who it describes the roles involved with the systems to deal with the facet development, detailing the representation of each one as it advances; when it concerns the effects of time over the system, such as the life cycles, describing the transformations and states of the systems; why it concerns to translate the motivation, goals, and strategies going to what is implemented in the facet. the perspectives in the framework should be mapped and updated according to each iot facet since different stakeholders are concerned with each area. in the questions part, we seek to keep the original questions and adapt the definitions to be clear and cut for the use of iot. for instance, the “what” is the final product to be delivered by each facet, which in turn can be composed of what each perspective delivers. in software, for example, we have the model built by the software architect, the code made by the developer. they are all part of the final product, the software product. the framework structure will be the combination of these concepts filled in as in figure 4. each facet aggregates different knowledge areas, such as software and network. with the simple framework structure, we can organize existing knowledge of software technologies, observing gaps for possible research and development opportunities. the framework can be filled with knowledge using a bottom-up approach with studies from the technical literature, practitioners, and real cases. we aim to achieve a complete solution on a small scale, to be evolved incrementally. if any adjustment is necessary, we sought to make available the protocols at delfos5 (the observatory of the engineering of contemporary software systems) to facilitate access, dissemination, re-execution, and evolution of the findings in order to keep the body of knowledge updated. the existing knowledge may or may not be enough to cover the iot paradigm demands, and this must be investigated from each facet to develop high-quality software systems. also, each facet will be responsible not only to meet its original premises but to cover the concerns and essential needs of iot related to that area, such as those presented in this paper. for instance, security and interoperability (the common concerns from the sources) are transversal concerns and must be addressed in the iot facets related to things, behavior, and connectivity. as we evolve the framework structure and deepen our iot facets of knowledge, we will seek to provide software technologies to meet the concerns as well. the use of the framework can be performed in three steps. by aligning different stakeholders’ perspectives, we want to characterize the problem domain (step 1). then, using the framework structure (figure 4), we aim to extract relevant information for the project (step 2), that support the definition of decision-making strategy (step 3). an example of use detailing the steps is presented in the following section. 6.1 exemplifying the usage of the framework once we have filled the framework with relevant information regarding practices and technologies, considering the different facets and perspectives, we will use it as the basis to support a development strategy. project information is used to direct and specialize in the framework, to present the concerns that should be taken into account and be used in the strategy to decision making for the specific project. with this proposal, the goal is not to replace defined activities that are common in the development of traditional software projects. instead, we hope to address the particularities of iot projects since they present different and additional characteristics that can bring challenges to its engineering. this section aims to exemplify the use of the proposed framework. for this, we rely on the results of an iot software project carried out in the context of an undergraduate software engineering last year discipline of the computer and information engineering course at the federal university of rio de janeiro. five last periods bachelors’ students with previous knowledge in engineering conventional software systems formed the development team. the course is regularly offered to support the students to work in real problem domain demand and tool-based software engineering environment. as a case for practice, the development team should provide a software system solution to support the creation of freshwater shrimp in farms. a sebrae (brazilian service to support micro and small enterprises)’s claim motivated it: “due to the complexity of the production process, and a large number of variables that must be constantly monitored, we suggest the acquisition of software of management, which was not found on the market with enough to be indicated here. most companies that produce software can provide such a solution, provided that there is a customization of the software.” a professor (the last author of this paper) and members of the experimental software engineering group 5 http://146.164.35.157/ figure 4. a framework for engineering iot applications. on challenges in engineering iot software systems motta et al. 2019 mentored the developers. the software project was executed in the first semester of 2018 and the product (camarão iot) deployed in july of 2018. therefore, the proposal was to idealize and to build an iot software system to support freshwater carniciculture in brazil. based on the described motivation, we present a proof of concept of a solution organized in the structure of the proposed framework. this example enables us to show different facets of’ arrangements in a basic solution. because the software system had been implemented and design decisions were taken, we mapped the results to exemplify the use of the framework. 6.1.1 step 1 define problem characterization: given the lack of software solutions and the market opportunity for this product, the proposal was to idealize and to build an iot software system to support freshwater carniciculture in brazil. our intention with this exemplification was to take what they accomplished and translate it into the proposed framework. in this characterization step, from the project context, the executive, business and user perspectives proposed in the framework are used to support the identification of different concerns and relevant information that must be considered in the solution to be developed. different roles expressed their expectations regarding the system in the 5w1h structure. the information was condensed and mapped below: executive perspective the owner represents this perspective and desires a solution that will enable remote and real-time business management. also, to be able to monitor the overall state of production. wants to receive notifications of critical conditions and current status, receive periodic reports and estimations, anywhere. business perspective the manager represents this perspective and wants to receive quick and easy information at any time through the used technologies. modernize production and have greater control to meet the foreign market. need to define deadlines and demands, receive information about the water tank, consult stock and production, notifications of critical conditions and current status, receive periodic reports, and estimations in real-time. user perspective different personas were established for the user perspective, representing the following roles: installation oversees s/he takes care of the installation and stock, reporting back to the manager when it is necessary. s/he needs something that can help the work with clear and direct visual information of when and what actions to take. the system can help to check stock status, receive notification of demands, and notify manager about the need for purchases. shrimp keeper s/he is responsible for preparing the ration and feeding the shrimps. wants a system that can make the tanks documentation and their characteristics simpler and easier to understand, that would make the job less stressful. another point that would help in the day-to-day professional life would be to facilitate the feeding process to avoid repetitive strain injuries. it will be useful to receive feeding schedule, notify biologist shrimp status, visualize tank and shrimp status. tank keeper s/he monitors the tank status, perform measures, and adjust tank conditions. s/he would like to control the tanks more accurately and with a better frequency, without the need to always be running between different tanks. he wants the peace of mind that work is according to the need of the business. wants to monitor tank status, generate reports, notify critical conditions, secure tank to return to normal conditions, biologist shrimp status, check environmental conditions that can affect the tank and visualize tank and shrimp status. biologist s/he sets the conditions and is responsible for the production health. s/he would like to have past information to be able to perform more precise analyzes and to minimize the error of his estimates, besides being able to compare the evolution of the production in addition to obtaining information about the shrimps in a more accurate and faster way. wants to update production demand required, update tank conditions to achieve the production demand, define and monitor shrimp health parameters, define and monitor feeding schedule, visualize tank and shrimp status, and generate reports. as described above, from this step with the framework structure, it is possible to contemplate the different goals for the same solution, thus enriching the initial characterization of the project. due to the full range of perspectives and goals, the team organizes and prioritizes the primary needs. from this initial part, we defined the primary needs of a system that (1) allows the clear visualization of information regarding the whole process in real-time; (2) support the feeding of shrimp; (3) assist in estimating production and (4) monitor the tank status (figure 5). figure 5. system’s needs. alongside with the needs presented by the control perspectives, it is required to identify which information can indicate a match with facets, which will support the analysis of the body of knowledge to identify relevant knowledge to engineer a solution to that context. the problem characterization template (used in step 1) will be defined to map the identified system needs with each facet of the body of knowledge in a way that could support the identification of the relevant knowledge. the next on challenges in engineering iot software systems motta et al. 2019 activity of this research in this step is the design of this template. it will comprise the investigation of the concerns defining the facets. a preliminary example of the problem characterization artifact for this case study is presented in figure 6. the idea is to capture the need using questions and perspectives; then we want to map them throughout the artifact, enlightening which concerns should be considered in a given context. it aims to bridge the problem to the facets. figure 6. a preliminary view of the problem characterization for this project. 6.1.2 step 2 analyzing the framework structure for decision making after characterizing the problem domain with the characterization artifact, the next step is to analyze the body of knowledge. in the context of this proposal, the framework structure can be seen as a body of knowledge. this structure was a body of knowledge, has been proposed at a conceptual level, but it has not yet been wholly populated for the iot systems. hence, it is one of the next proposed activities, in which we plan to conduct studies to provide evidence-based findings to fill in the body of knowledge. once body of knowledge is organized, it can be specialized to the problem context. for instance, as presented in figure 5, one of the needs refers to (4) monitor the tank status. this feature represents goals from the owner (executive perspective), manager (business perspective), tank keeper, and biologist (user perspective) and can be developed by different solutions (table 3). the body of knowledge specialization should assist in the decision-making to implement the desired solution considering this feature’s properties. table 3. possible solutions to monitor tank status. solution option description manually the manager defines the required shrimp production and requests production report; he communicates verbally to the biologist. the biologist sets new parameters for the tank and goes to the tank keeper to inform him. the tank keeper manually adjusts tank conditions to meet demand. he also manually collects information for the production report and deliver the report to the manager. there is no technical support in the process. communication support the manager defines the required shrimp production and requests production report; he uses a communication system to inform the biologist. the biologist defines new parameters for the tank and uses the communication system to inform the tank keeper. the tank keeper manually adjusts tank conditions to meet demand. he also manually collects information for the production report and deliver the report to the manager. there is technical support for communication in the process. control support the manager defines the required shrimp production and requests a production report; he uses a control system. the system notifies the biologist that defines new parameters for the tank. the system notifies the tank keeper that manually adjusts tank conditions to meet demand. he also manually collects information for the production report and make the report available in the system. there is technical support for control in the process. sensing support the manager defines the required shrimp production and requests a production report using the system. the system notifies the biologist that defines new parameters for the tank. the system notifies the tank keeper that manually adjusts tank conditions to meet demand. he automatically collects information from the sensors for the production report and makes the report available in the system. there is technical support for sensing in the process. actuation support the manager defines the required shrimp production and requests a production report using the system. the system notifies the biologist that defines new parameters for the tank. the system notifies the tank keeper that uses the system actuators to adjust the tank conditions to meet demand. he automatically collects information from the sensors for the production report and makes the report available in the system. there is technical support for actuation in the process. the solutions presented are simplified in high-level but are only to exemplify the variety of options dependent on technology, to a greater or lesser degree. for example, if we choose the sensor support solution, exemplified in table 7, it can analyze which relevant knowledge from the body of knowledge should be taken into account, as shown in table 4. in order to support decision-making to guide the choice and development of the proposed solution, the body of knowledge aims to present the practices and technologies that allow engineers to develop the chosen solution. table 4. some examples of possible practices and technologies from the body of knowledge. sensing support connectivity bluetooth low energy, zigbee, zwave, nfc (near field communication), rfid (radiofrequency identification), wi-fi as enabling technologies, low-power wide-area technologies, sigfox, ingenu-rpma (random phase multiple access), 2g, 3g, 4g, software-defined network (sdn) and network function virtualization (nfv) and others. things temperature sensor ttc104, temperature sensor ds18b20, luminosity sensor ldr 5mm, rain sensor fc37, rain sensor grove, humidity and temperature sensor rht03, gravity ph sensor, and others. on challenges in engineering iot software systems motta et al. 2019 behavior collect water ph-value, water level, water turbidity, water oxygenation, water salinity, water temperature. 6.1.3 step 3 – generate decision-making strategy the output from the previous step (table 4) should be presented as a set of software practices and technologies, with options of the body of knowledge specialization, and will compose the strategy to support decision-making to drive the solution. from the problem domain established (the context), the team started the solution engineering. the project was conducted with the team working together. therefore, there was no formal division of work for construction roles such as architect, engineer, and technician perspectives. they worked as a group to achieve the expected results. for this reason, in this exemplification, we cannot represent different perspectives. it is for illustration purposes and does not allow to demonstrate the full framework potential yet, which is a crucial issue in the continuity of this research. the solution implemented for the problem (4) monitor the status of the tank is presented in figure 7 and was implemented in a floater. the floater collects data from the water tank where it was deployed, and it works at each determined interval of time. an operator can adjust the frequency in which the dashboard will update the information received from the sensors, implemented in the floater. a dashboard panel was implemented to enable the visualization of the data collected by the floater and attends the problem stated in (1) allows the clear visualization of information regarding the whole process in real-time. in this context, it is a technological arrangement for data exhibition, where the data producers are the sensors in the floater, which through the connectivity with wi-fi can share data with the dashboard to exhibit the data. the overall floater solution encompasses (figure 7): behaviors: sensing and data collection to collect water level, water turbidity, and water temperature as well as processing, to provide data for the dashboard. things: the water level sensor, water turbidity sensor, water temperature sensor, and water salinity were implemented in an arduino board that worked as the processing unit. interactivity: interacts with the dashboard to provide data. connectivity: supports the provision of data for the dashboard, implemented by a wi-fi module in the arduino. environment: the water tank was the environment settled for the sensors to collect data and the network layer used for connectivity. figure 7. floater solution implemented for the need (4). it is necessary to emphasize that the previously described context is only a simplified example of using the framework and does not represent its full use. it was used for illustrating, in a real case scenario, how the different facets overlap and impact each other. it requires a multidisciplinary view of the problem and an adequate development strategy to embrace different disciplines and skills for the accomplishment of successful iot software systems. we understand that more research is needed to address the open points and to evolve the proposal in general. it represents future tasks to be conducted throughout the continuity of this research. 7 final remarks the emergence of iot software systems brings new challenges in software engineering. to address these challenges, we should change our way of developing a software system from a monolithic structure to a broader multidisciplinary approach. this paper has presented the results obtained by analyzing data acquired through different strategies, which identified challenges in engineering iot software systems and the initial results of a conceptual framework to support its development. first, we identified concerns from the technical literature, practitioners, and a government report. next, we presented the facets that compose iot software systems, derived from a qualitative study. these results can support practitioners in evaluating risks to construct iot applications and highlight some research opportunities for researchers. then we presented a conceptual framework, a way of summarizing the results of the executed studies and structurally present the multidisciplinarity of iot. this structure shall be filled with the existing knowledge of software technologies. empty cells can identify current technology gaps to engineer iot software systems. the contribution of this work explains a set of concerns that need to be investigated, showing that it is necessary to distinguish this new software system from traditional ones. also, the work evolves the zachman´s framework, to allow the necessary multi-facets representation. 7.1 threats to validity the literature review used only scopus as a search engine so that it may be missing some relevant studies. however, from our experience, it can give reasonable coverage when performing together snowballing procedures backward and forward (matalonga et al. 2015; motta et al. 2016). data extraction and interpretation biases were mitigated with crosschecking between two researchers and by having a third researcher to revise the results. all phases of this review were peer-revised; any doubt was discussed on challenges in engineering iot software systems motta et al. 2019 among the readers, to reduce selection bias. we have not performed a quality assessment regarding the research methodology of the selected studies due to the lack of information in the secondary reports. therefore, it is a threat to this study validity. however, the triangulation with data acquired of practitioners and information extracted from the government report strengthened the representativeness of data and reduced the researchers' bias, powering the results. from both the data collected from practitioners and the government, the interpretation of data was supported by the practices of gt, which allowed to get consistency among researchers and shared an understanding of the central concepts. however, other perspectives could be used for data interpretation, imposing a risk of changing the results. it represents a threat to any qualitative study and constitutes a menace that we cannot completely mitigate. 7.2 ongoing works we foresee some scenarios of proposed framework utilization. as envisioned contributions of its use, initially, we expect the production of scientific research which considers essential knowledge to practitioners concerning the problem domain. such knowledge will compose the body of knowledge, which can be useful for both researchers and practitioners sharing and exchanging it. we consider that the evidence-based facets and perspectives have the potential to support the collection of various practices and technologies that can be used in iot. we expect that the more a facet is filled in a given perspective in response to a question (for example, response to the how, in the engineering perspective, in the behavior facet) more evidence of information will exist about it, which aids the decisionmaking in practice. in turn, the lack of answers (for example, an empty cell in the body of knowledge) may represent a research opportunity for the academy. in this sense, opportunities and risks are opposites, since an opportunity for researchers is a risk for practitioners. once we have filled the cells, it is possible that some will continue empty, since current technologies do not meet iot needs, representing research opportunities; and both practice and research can get a better observation of what we know and do not know regarding development, since it will allow visualizing where the engineering stands regarding iot. our next steps include the filling of the facets in the manner proposed by zachman (sowa and zachman 1992). however, our primary research aims to fill the cells in the matrix. we conjecture that some of the slots will be empty or partially filled, which means the available software technologies will not support such activity in the way required for iot. therefore, it can represent research and development opportunities, which are necessary for the establishment of iot as a reality. another conjecture is that some of the concerns can repeat themselves in different slots and different facets, what we call transversal challenges. these cross-sectional slots represent broader concerns that should cover the iot software system as a whole, for example, security and interoperability issues. we aim to investigate transversal challenges in the nearest future. after that, we plan to evaluate and refine this conceptual framework. 8 declarations availability of data and materials details of the protocol are available in https://goo.gl/czvvdc. competing interests the authors declare that they have no competing interests. funding we thank cnpq for the grant. professor travassos is a cnpq researcher. this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior brasil (capes) finance code 001. consent for participation and publication not applicable. acknowledgments not applicable. references agina a, matheus edward iy, shalannanda w (2016) enhanced information security management system framework design using iso 27001 and zachman framework a study case of xyz company. in: 2016 2nd international conference on wireless and telematics (icwt). ieee, pp 62–66 alegre u, augusto jc, clark t (2016) engineering contextaware systems and applications: a survey. j syst softw 117:55–83. doi: 10.1016/j.jss.2016.02.010 almeida vaf, doneda d, moreira da costa e (2018) humane smart cities: the need for governance. ieee internet comput 22:91–95. doi: 10.1109/mic.2018.022021671 andrade rmc, carvalho rm, de araújo il, et al. (2017) what changes from ubiquitous computing to internet of things in interaction evaluation? in: international conference on distributed, ambient, and pervasive interactions. pp 3–21 aniculaesei a, grieser j, rausch a, et al. (2018) towards a holistic software systems engineering approach for dependable autonomous systems. in: proceedings of the 1st international workshop on software engineering for ai in autonomous systems sefais ’18. acm press, new york, new york, usa, pp 23–30 atabekov a, starosielsky m, lo dc-t, he js (2015) internet of things-based temperature tracking system. in: 2015 ieee 39th annual computer software and applications conference. ieee, pp 493–498 atzori l, iera a, morabito g (2010) the internet of things: a survey. comput netw 54:2787–2805. doi: 10.1016/j.comnet.2010.05.010 on challenges in engineering iot software systems motta et al. 2019 badreddin o (2013) thematic review and analysis of grounded theory application in software engineering. adv softw eng 2013:1–9. doi: 10.1155/2013/468021 basili vr, caldeira g, rombach hd (1994) goal question metric paradigm bauer c, dey ak (2016) considering the context in the design of intelligent systems: current practices and suggestions for improvement. j syst softw 112:26–47. doi: 10.1016/j.jss.2015.10.041 bera s, misra s, vasilakos a v. (2017) software-defined networking for the internet of things: a survey. ieee internet things j 4:1994–2008. doi: 10.1109/jiot.2017.2746186 biolchini j, mian pg, candida a, natali c (2008) software and data technologies. springer berlin heidelberg, berlin, heidelberg bondar s, hsu jc, pfouga a, stjepandić j (2017) agile digital transformation of system-of-systems architecture models using the zachman framework. j ind inf integr 7:33–43. doi: 10.1016/j.jii.2017.03.001 borgia e (2014) the internet of things vision: key features, application, and open issues. comput commun 54:1–31. doi: 10.1016/j.comcom.2014.09.008 brings j (2017) verifying cyber-physical system behavior in the context of cyber-physical system-networks. in: 2017, ieee 25th international requirements engineering conference (re). ieee, pp 556–561 caron x, bosua r, maynard sb, ahmad a (2016) the internet of things (iot) and its impact on individual privacy: an australian perspective. comput law secure rev 32:4–15. doi: 10.1016/j.clsr.2015.12.001 carver j (2007) the use of grounded theory in empirical software engineering. in: empirical software engineering issues. critical assessment and future directions. springer berlin heidelberg, berlin, heidelberg, pp 42–42 chen g, tang j, coon jp (2018) optimal routing for multihop social-based d2d communications in the internet of things. ieee internet things j 5:1880–1889. doi: 10.1109/jiot.2018.2817024 cni cn da i (2016) indústria 4.0: novo desafio para an indústria brasileira. indicadores cni 17:13 conti jp (2006) itu internet reports 2005: the internet of things. commun eng 4:20. doi: 10.1049/ce:20060603 costa b, pires pf, delicato fc (2017) specifying functional requirements and qos parameters for iot systems. in: 2017 ieee 15th intl conf on dependable, autonomic and secure computing, 15th intl conf on pervasive intelligence and computing, 3rd intl conf on big data intelligence and computing and cyber science and technology congress(dasc/picom/datacom/cyberscitech). ieee, pp 407–414 dalli a, bri s (2016) acquisition devices in the internet of things: rfid and sensors. j theor appl inf technol 90:194–200 davoudpour m, sadeghian a, rahnama h (2015) synthesizing social context for making the internet of things environments more immersive. in: 2015 6th international conference on the network of the future (nof). ieee, pp 1–5 de farias cm, brito ic, pirmez l, et al. (2017) comfit: a development environment for the internet of things. future gener comput syst 75:128–144. doi: 10.1016/j.future.2016.06.031 de villiers d (2001) using the zachman framework to assess the rational unified process overview of the zachman framework. ration edge desolda g, ardito c, matera m (2017) empowering end users to customize their smart environments. acm trans comput-hum interact 24:1–52. doi: 10.1145/3057859 dutton wh (2014) putting things to work: social and policy challenges for the internet of things. info 16:1–21. doi: 10.1108/info-09-2013-0047 fitzgerald b, stol k-j (2017) continuous software engineering: a roadmap and agenda. j syst softw 123:176–189. doi: 10.1016/j.jss.2015.06.063 gil d, ferrández a, mora-mora h, peral j (2016) internet of things: a review of surveys based on context-aware intelligent services. sensors 16:1069. doi: 10.3390/s16071069 gluhak a, krco s, nati m, et al. (2011) a survey on facilities for experimental internet of things research. ieee commun mag 49:58–67. doi: 10.1109/mcom.2011.6069710 goethals fg, snoeck m, lemahieu w, vandenbulcke j (2006) management and enterprise architecture clic the fad(e)e framework. inf syst front 8:67–79. doi: 10.1007/s10796-006-7971-1 gubbi j, buyya r, marusic s, palaniswami m (2013) internet of things (iot): a vision, architectural elements, and future directions. future gener comput syst 29:1645– 1660. doi: 10.1016/j.future.2013.01.010 guo b, yu z, zhou x, zhang d (2012) opportunistic iot: exploring the social side of the internet of things. in: proceedings of the 2012 ieee 16th international conference on computer supported cooperative work in design (cscwd). ieee, pp 925–929 huang j, duan q, xing c-c, wang h (2017) topology control for building a large-scale and energy-efficient internet of things. ieee wirel commun 24:67–73. doi: 10.1109/mwc.2017.1600193wc ieee (2004) guide to the software engineering body of knowledge. ieee computer society press jacobson i, spence i, ng p-w (2017) is there a single method for the internet of things? commun acm 60:46–53. doi: 10.1145/3106637 khaitan sk, mccalley jd (2015) design techniques and applications of cyberphysical systems: a survey. ieee syst j 9:350–365. doi: 10.1109/jsyst.2014.2322503 kusmin m, saar m, laanpere m, rodriguez-triana mj (2017) work in progress — smart schoolhouse as a datadriven inquiry learning space for the next generation of engineers. in: 2017 ieee global engineering education conference (educon). ieee, pp 1667–1670 on challenges in engineering iot software systems motta et al. 2019 larrucea x, combelles a, favaro j, taneja k (2017) software engineering for the internet of things. ieee softw 34:24–28. doi: 10.1109/ms.2017.28 li s, xu l da, zhao s (2015) the internet of things: a survey. inf syst, front 17:243–259. doi: 10.1007/s10796-0149492-7 li s, xu l da, zhao s (2018) 5g internet of things: a survey. j ind inf integr 10:1–9. doi: 10.1016/j.jii.2018.01.005 liao y, deschamps f, loures e de fr, ramos lfp (2017) past, present, and future of industry 4.0 a systematic literature review and research agenda proposal. int j prod res 55:3609–3629. doi: 10.1080/00207543.2017.1308576 lu y (2017) industry 4.0: a survey on technologies, applications, and open research issues. j ind inf integr 6:1– 10. doi: 10.1016/j.jii.2017.04.005 madakam s, ramaswamy r, tripathi s (2015) internet of things (iot): a literature review. j comput commun 03:164–173. doi: 10.4236/jcc.2015.35021 matalonga s, rodrigues f, travassos g (2015) challenges in testing context-aware software systems. in: 9th workshop on systematic and automated software testing 2015. belo horizonte, brazil, pp 51–60 matalonga s, rodrigues f, travassos gh (2017) characterizing testing methods for context-aware software systems: results from a quasi-systematic literature review. j syst softw 131:1–21. doi: 10.1016/j.jss.2017.05.048 mihovska a, sarkar m (2018) new advances in the internet of things. springer international publishing, cham motta rc, de oliveira km, travassos gh (2018) on challenges in engineering iot software systems. in: proceedings of the xxxii brazilian symposium on software engineering sbes ’18. acm press, new york, new york, usa, pp 42–51 motta rc, de oliveira km, travassos gh a framework to support the engineering of internet of things software systems. eics 19 june 18–21 2019 valencia spain 6. doi: 10.1145/3319499.3328239 motta rc, oliveira km de, travassos gh (2016) characterizing interoperability in context-aware software systems. in: 2016 vi brazilian symposium on computing systems engineering (sbesc). ieee, pp 203– 208 nielsen cb, larsen pg, fitzgerald j, et al. (2015) systems of systems engineering: basic concepts, model-based techniques, and research directions. acm comput surv 48:1–41. doi: 10.1145/2794381 nogueira jm, romero d, espadas j, molina a (2013) leveraging the zachman framework implementation using the action – research methodology – a case study: aligning the enterprise architecture and the business goals. enterp inf syst 7:100–132. doi: 10.1080/17517575.2012.678387 panetto h, baïna s, morel g (2007) mapping the iec 62264 models onto the zachman framework for analyzing products information traceability: a case study. j intell manuf 18:679–698. doi: 10.1007/s10845-007-0040-x patel p, cassou d (2015) enabling high-level application development for the internet of things. j syst softw 103:62–84. doi: 10.1016/j.jss.2015.01.027 pfleeger sl, atlee jm (1998) software engineering: theory and practice. pearson education india roca d, milito r, nemirovsky m, valero m (2018) fog computing in the internet of things. springer international publishing, cham rojas ra, rauch e, vidoni r, matt dt (2017) enabling connectivity of cyber-physical production systems: a conceptual framework. procedia manuf 11:822–829. doi: 10.1016/j.promfg.2017.07.184 sánchez guinea a, nain g, le traon y (2016) a systematic review on the engineering of software for ubiquitous systems. j syst softw 118:251–276. doi: 10.1016/j.jss.2016.05.024 santos i de s, andrade rm de c, rocha ls, et al. (2017) test case design for context-aware applications: are we there yet? inf softw technol 88:1–16. doi: 10.1016/j.infsof.2017.03.008 seaman cb (1999) qualitative methods in empirical studies of software engineering. ieee trans softw eng 25:557– 572. doi: 10.1109/32.799955 shang x, zhang r, zhu x, zhou q (2016) design theory, modeling, and the application for the internet of things service. enterp inf syst 10:249–267. doi: 10.1080/17517575.2015.1075592 sousa p, pereira c, vendeirinho r, et al. (2007) applying the zachman framework dimensions to support business process modeling. in: digital enterprise technology. springer us, boston, ma, pp 359–366 sowa jf, zachman ja (1992) extending and formalizing the framework for information systems architecture. ibm syst j 31:590–616. doi: 10.1147/sj.313.0590 spínola ro, travassos gh (2012) towards a framework to characterize ubiquitous software projects. inf softw technol 54:759–785. doi: 10.1016/j.infsof.2012.01.009 strauss a, corbin j (1990) basics of qualitative research: techniques and procedures for developing grounded theory . sage publications, inc, newbury park tang a, jun han, pin chen (2004) a comparative analysis of architecture frameworks. in: 11th asia-pacific software engineering conference. ieee, pp 640–647 technology i (2015) requirement formalization using owl ontology-based zachman framework tian b, yu s, chu j, li w (2018) analysis of direction on product design in the era of the internet of things. matec web conf 176:01002. doi: 10.1051/matecconf/201817601002 trappey ajc, trappey c v., hareesh govindarajan u, et al. (2017) a review of essential standards and patent landscapes for the internet of things: a key enabler for industry 4.0. adv eng inform 33:208–229. doi: 10.1016/j.aei.2016.11.007 whitmore a, agarwal a, da xu l (2015) the internet of things—a survey of topics and trends. inf syst, front 17:261–274. doi: 10.1007/s10796-014-9489-2 on challenges in engineering iot software systems motta et al. 2019 wohlin c (2014) guidelines for snowballing in systematic literature studies and a replication in software engineering. proc 18th int conf eval assess softw eng ease 14 1– 10. doi: 10.1145/2601248.2601268 xu l da, he w, li s (2014) internet of things in industries: a survey. ieee trans ind inform 10:2233–2243. doi: 10.1109/tii.2014.2300753 yan yu, jianhua wang, guohui zhou (2010) the exploration in the education of professionals in applied internet of things engineering. in: 2010 4th international conference on distance learning and education. ieee, pp 74–77 zachman ja (1987) a framework for information systems architecture. ibm syst j 26:276–292. doi: 10.1147/sj.263.0276 zambonelli f (2016) towards a general software engineering methodology for the internet of things zhang c, shi x, chen d (2014) safety analysis and optimization for networked avionics system. in: 2014, ieee/aiaa 33rd digital avionics systems conference (dasc). ieee, pp 4c1-1-4c1-12 journal of software engineering research and development, 2022, 10:4, doi: 10.5753/jserd.2021.1878 this work is licensed under a creative commons attribution 4.0 international license. investigating knowledge management in humancomputer interaction design murillo v. h. b. castro [ federal university of espírito santo | murillo.castro@aluno.ufes.br ] simone d. costa [ federal university of espírito santo| simone.costa@ufes.br ] monalessa p. barcellos [ federal university of espírito santo| monalessa@inf.ufes.br ] ricardo de a. falbo [ federal university of espírito santo | falbo@inf.ufes.br ] abstract developing interactive systems is a challenging task. it involves concerns related to human-computer interaction (hci), such as usability and user experience. therefore, hci design must be addressed when developing such systems. hci design often involves people with different backgrounds, which makes communication and knowledge transfer a challenging issue. in this scenario, knowledge management can support understanding concepts from different knowledge areas and help learn from previous experiences. aiming at investigating how knowledge management has supported hci design and contributed to the development of interactive systems, we performed a mapping study in the literature and analyzed 15 publications reporting the use of knowledge management in hci design. following that, we conducted a survey with 39 hci design professionals to find out how knowledge has been managed in their hci design practice. in this paper, we present the studies and discuss their main findings. in summary, the results indicate that knowledge management has been used in hci design mainly to improve product quality and reduce the effort and time spent on design activities. however, there is a need for simpler and more practical knowledge-based solutions to support hci design. such approaches would be capable of reaching more hci design practitioners that could benefit from them. keywords: hci design, mapping study, survey, knowledge management, interactive system 1 introduction the interest in interactive systems and their impact on people’s life has promoted the study and practice of usability (carroll, 2014). usability is a key aspect of a successful interactive system and is related to user efficiency and satisfaction when interacting with the system. for an interactive system to reach high usability levels, it is necessary to take human-computer interaction (hci) design aspects into account during its development process (carroll, 2014). hci is concerned with usability and other aspects related to the interaction between users and computer systems, necessary to produce more usable software (carroll, 2014). it involves knowledge from multiple fields, such as ergonomics, cognitive science, user experience, human factors, among others (sutcliffe, 2014). due to the diverse body of knowledge involved when designing interactive systems, interactive system development teams are frequently multidisciplinary, joining people from different backgrounds, with their own technical language, terms and knowledge. collaboration among team members is not straightforward, since hci designers and developers, for example, look at the same problem under different perspectives, which leads to difficulties that include a lack of a shared vocabulary and harsh epistemological conflicts (neto et al., 2020). even the conceptualization of the product may be conflicting among different stakeholders, which hampers communication and knowledge transfer (carroll, 2014; rogers et al., 2011). developing software is a knowledge-intensive task. knowledge management (km) principles and practices have been successfully applied to support knowledge capture, storage, use and transfer in the software development context in general (rus & lindvall, 2002; valaski et al., 2012). km can also be helpful to address challenges in the design of interactive systems since it might provide support to capture and represent knowledge in an accessible and reusable way and facilitate collaboration among team members. for example, design solutions developed by an organization can be stored and related to the requirements that motivate them, components and patterns used to build them and evaluation results. as a result, the team can learn from previous experiences and share a common understanding of the system, producing better products and performing processes more efficiently. considering the challenges of designing interactive systems, mainly due to the diversity of knowledge and people involved, and the potential of km to help address those challenges, we decided to investigate the use of km in hci design. although km can be used in different domains and there are some general motivations for using it (e.g., knowledge structuring) and benefits (e.g., improve knowledge reuse) provided by its use, km can be applied to solve specific problems in each domain, different techniques can be used, investigating knowledge management in hci design castro et al. 2022 and so on. thus, the main question that guided our investigation refers to how km has been used in the hci design domain. besides investigating general motivations and benefits observed in the use of km in the hci design domain, we also intended to identify specificities of the use of km in that domain. first, we searched for secondary studies addressing the research topic. since we did not find any, we decided to perform a systematic mapping in the literature. we analyzed 12 different km approaches used in hci design, identified from 15 publications. in general, km has aided in hci design mainly by enabling replicability of knowledge and solutions, improving product quality and communication. however, difficulty to generalize knowledge, issues related to features of the system and low engagement of the team have been pointed out as challenges to implement km in the hci design context. after investigating the literature, we performed a survey with 39 brazilian hci design practitioners that were asked about how knowledge has been managed in hci design practice. most participants are concerned with managing hci design knowledge and perceive that km helps them to improve product quality and reduce effort and time spent on hci design activities. they follow organizational or individual km practices and apply technologies such as brainstorming, mental models and electronic spreadsheets. this paper presents our studies (the mapping study and the survey) and their main results. it extends our previous work (castro et al., 2020), in which we presented the main results of our mapping study, by adding information about the survey and presenting a more comprehensive view of the mapping results, updating the search period and providing new information (e.g., new graphs and details about the identified km approaches). the mapping and the survey results are further analyzed together, providing an overview of the research and practice of km in hci design and pointing out some gaps that can be addressed in future research. the paper is organized as follows: section 2 provides the background for the paper, addressing hci design and km; section 3 concerns the mapping study; section 4 addresses the survey; section 5 provides a consolidated view of the mapping and the survey results; and section 6 presents our final considerations. 2 background 2.1 hci design hci design focuses on how to design a system to support the user to achieve her goals through the interaction between her and the system (sutcliffe, 2014). it is concerned with usability and other important attributes such as user experience, accessibility and communicability. usability is the extent to which a system, product or service can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use (iso, 2019). it addresses the effort and ease of the user during the interaction, considering her cognitive, perceptive and motor skills. user experience relates to users' emotions and feelings and is essential for interaction design because it takes into account how a product behaves and is used by people in the real world (rogers et al., 2011). accessibility refers to the removal of barriers that prevent interface and interaction access. finally, communicability concerns the ability of the interface to communicate design logic to the user (de souza, 2005). hci design is user-centered, hence it is said user-centered design (ucd) (chammas et al., 2015). ucd is based on ergonomics, usability and human factors. it focuses on the use and development of interactive systems, with an emphasis on making products usable and understandable. it puts human needs, capabilities and behavior first, then designs the system to accommodate them. its main principles are user focus (its characteristics, needs and objectives), observable metrics (user performance and reactions) and iterative design (repeat as often as needed) (chammas et al., 2015; iso, 2019). the term human-centered design (hcd) has been adopted in place of ucd to emphasize the impact on all stakeholders and not just on those considered users (iso, 2019). in general, ucd involves: understand and specify context of use, which aims to study the product users and intended uses; specify requirements, which aims to identify user needs and specify functional and other requirements for the product; produce design solutions, which aims to achieve the best user experience and includes the production of artifacts such as prototypes and mock-ups that will be used in the future as a basis for developing the system; and evaluation, when the user evaluates the results produced in the previous activities (iso, 2019). hci design can be understood as an intensive knowledge process, requiring effective mechanisms to collaboratively create and support a shared understanding about users, the system, its purposes, context of use and the design necessary for the user to achieve her goals. therefore, hci design could take advantage of km solutions. 2.2 knowledge management according to schneider (2009), knowledge is a human specialty stored in people's minds, acquired through experience and interaction with their environment. historically, an organization’s knowledge was undocumented, being represented through the skills, experience and knowledge of its professionals, typically tacit knowledge (rus & lindvall, 2002), which made investigating knowledge management in hci design castro et al. 2022 its use and access limited and difficult (o’leary, 1998). knowledge management (km) aims to transform tacit and individual knowledge into explicit and shared knowledge. by raising individual knowledge to the organizational level, km promotes knowledge propagation and learning, making knowledge accessible and reusable across the entire organization (o’leary, 1998; rus & lindvall, 2002; schneider, 2009). knowledge helps software organizations to react faster and better, supporting more accurate and precise responses, which contributes to increasing software quality and client satisfaction (schneider, 2009). when an organization implements km, its experiences and knowledge are recorded, evaluated, preserved, designed and systematically propagated to solve problems (schneider, 2009). thus, km addresses knowledge in its evolution cycle, which consists in creating, capturing, transforming, accessing and applying knowledge (rus & lindvall, 2002; schneider, 2009). in the software process context, km works for explicitly and systematically managing knowledge, addressing knowledge acquisition, storage, organization, evolution, retrieval and usage. among other aspects, km has been applied in the software development context to support document management, competence management, experts identification, software reuse, support learning and product and project memory (rus & lindvall, 2002). by investigating empirical studies of km in software engineering, bjørnson & dingsøyr (2008) reported that the studies’ major focus has been on explicit knowledge and there is a need to focus also on tacit knowledge. 3 systematic mapping: km in hci design according to the literature considering the challenges involving knowledge transfer and sharing in the hci design context and the benefits of using km in the software development context, we decided to investigate the use of km in hci design through a mapping study. a mapping study is a secondary study designed to give an overview of a research area through classification and counting contributions concerning the categories of that classification. it makes a broad study on a topic of a specific theme and aims to identify available evidence about that topic (petersen et al., 2015). moreover, the panorama provided by a mapping study allows identifying issues in the researched topic that could be addressed in future research. we followed the process defined in kitchenham & charters (2007), which comprises three phases: (i) planning: in this phase, the topic of interest, study context and object of the analysis are established. the research protocol to be used to perform the research is defined, containing all the necessary information for a researcher to perform the research: research questions, sources to be searched, publication selection criteria, procedures for data storage and analysis and so on. the protocol must be evaluated by experts and tested to verify its feasibility, i.e., if the results obtained are satisfactory and if the protocol execution is viable in terms of time and effort. once the protocol is approved, it can be used to conduct the research. (ii) conducting: in this phase, the research is performed according to the protocol. publications are selected and data are extracted, stored and quantitatively and qualitatively analyzed. (iii) reporting: in this phase, the produced research results are recorded and made available to potentially interested parties. next, in section 3.1, we present the research protocol followed in our study. section 3.2 summarizes the mapping study results. section 3.3 discusses the results and section 3.4 regards threats to validity. 3.1 research protocol this section presents the protocol used in the mapping study. it was defined gradually, being tested with an initial set of publications and then refined until we reached the final protocol, which was evaluated by another researcher, resulting in the protocol used in the study and presented in this section. the study goal was to investigate the use of km in the hci design context. for achieving this goal, we defined the research questions presented in table 1. table 1. systematic mapping: research questions and their rationale. id research question rationale rq1 when and where have publications been published? give an understanding of when and where (journal/conference/workshop) publications about km in the hci design context have been published. rq2 which types of research have been done? investigate which type of research is reported in each selected publication. we consider the classification defined in (wieringa et al., 2005). this question is useful to evaluate the maturity stage of the research topic. rq3 why has km been used in the hci design context? understand the purposes and reasons for using km in the hci design and verify if there have been predominant motivations. investigating knowledge management in hci design castro et al. 2022 rq4 which knowledge has been managed in the hci design context? investigate which knowledge items have been managed in the hci design context, aiming to verify if some of them have been managed more frequently and if there has been more interest in certain hci aspects. rq5 how is the managed knowledge related to the hci design process? understand, in the context of the hci design process, where the managed knowledge has come from and where it has been used. rq6 how has km been implemented in the hci design context? investigate how km has been implemented in the hci context in terms of the adopted technologies. rq7 which benefits and difficulties have been noticed when using km in the hci design context? identify the benefits and difficulties of using km in the hci design context and analyze if there is a relation between them. rq1 and rq2 are common systematic mapping questions that provide a general panorama of the research topic. the other questions aim to investigate why (rq3 and rq7), how (rq4 and rq6) and when (rq5) km has been used in hci design, which are important questions to provide an understanding of the research topic. the search string adopted in the study contains two groups of terms joined with the operator and. the first group includes terms related to hci design. the general term “human-computer interaction” was used to provide wider search results. the second group includes terms related to knowledge management. within the groups, we used the or operator to allow synonyms. the following search string was used: ("human-computer interaction" or "user interface design" or "user interaction design" or "user centered design" or "human-centered design" or "ui design" or "hci design") and ("knowledge management" or "knowledge reuse" or "knowledge sharing"). for establishing the string, we performed tests using different terms, logical connectors and combinations among them, selecting the string that provided better results in terms of the number of publications and their relevance (i.e., the number of publications returned by the search string and, considering a sample, the inclusion of the really relevant ones for the study). if a new term added to the search string resulted in a much larger number of returned publications, without adding new relevant ones to the study, then that term was not considered in the search string. in that sense, more restrictive strings excluded important publications identified during the informal literature review that preceded the study. more comprehensive strings (e.g., those including “usability”) returned too many publications out of the scope of interest. the search was performed in four sources, namely scopus, science direct, engineering village and web of science. we selected these sources because scopus is one of the largest databases of peer-reviewed literature. it indexes papers from other important sources such as ieee and acm, providing useful tools to search, analyze and manage scientific research. complementarily, to increase coverage, we selected sci 1 http://bit.ly/start-tool ence direct, engineering village and web of science, which are also widely used in secondary studies recorded in the literature and on other experiences in our research group. publications selection was performed in five steps. in preliminary selection and cataloging (s1), the search string was applied in the search mechanism of each digital library used as a source of publications (we limited the search scope to the title, abstract and keywords metadata fields). after that, in duplications removal (s2), publications indexed in more than one digital library were identified and duplications were removed. in selection of relevant publications 1st filter (s3), the abstracts of the selected publications were analyzed considering the following inclusion (ic) and exclusion (ec) criteria: (ic1) the publication addresses km in the hci design context; (ec1) the publication does not have an abstract; (ec2) the paper was published only as an abstract; (ec3) the publication is not written in english; (ec4) the publication is a secondary study, a tertiary study, a summary, an editorial or a tutorial. in selection of relevant publications 2nd filter (s4), the full text of the publications selected in s3 were read and analyzed considering the cited inclusion and exclusion criteria. in this step, to avoid study repetition, we considered another exclusion criterion: (ec5) the publication is an older version of an already selected publication. when the full text of a publication was not available either from the brazilian portal of journals, from other internet sources or by contacting its authors, the publication was also excluded (ec6). publications that met one of the six cited exclusion criteria or that did not meet the inclusion criteria ic1 were excluded. finally, in snowballing (s5), as suggested in kitchenham & charters (2007), the references of publications selected in s4 were analyzed by applying the first and second filters and, the ones presenting results related to the research topic were included in the study. we used the start tool1 to support publications selection. to consolidate data, publications returned in the publication selection steps were cataloged and stored in spreadsheets. we defined an id for each publication and recorded the publication title, authors, year, and vehicle of publication. data from publications returned in s4 and http://bit.ly/start-tool investigating knowledge management in hci design castro et al. 2022 s5 were extracted and organized into a data extraction table oriented to the research questions. the spreadsheets produced during the study can be found in http://bit.ly/mapping-km-in-hcidesign. the first and second authors performed publication selection and data extraction. the third and fourth authors reviewed both. once data has been validated, the first and the second authors carried out data interpretation and analysis, and again third and fourth authors reviewed the results. discordances were discussed and resolved. quantitative data were tabulated and used in graphs and statistical analysis. finally, the four authors performed qualitative analysis considering the findings, their relation to the research questions and the study purpose. 3.2 results the study considered papers published until october 2020. searches were conducted for the last time in november 2020. figure 1 illustrates the followed process and the number of publications selected in each step. figure 1. publication selection process. in the 1st step, as a result of searching the selected sources, a total of 381 publications was returned. in the 2nd step, we eliminated duplicates, achieving 228 publications (reduction of approximately 40%). in the 3rd step, we applied the selection criteria over the abstract, resulting in 21 papers (reduction of approximately 91%). at this step, we only excluded publications that were clearly unrelated to the subject of interest. in case of doubt, the paper was taken to the next step. in the 4th step, the selection criteria were applied considering the full text, resulting in 11 publications (reduction of approximately 48%). finally, in the 5th step, we performed snowballing technique by checking the references of the 11 selected publications and identified 4 more publications, which in total added up to 15 publications. when analyzing the publications to identify the km approaches applied in the hci design context, we noticed that some publications addressed complementary works from the same research group. hence, we considered complementary works as a single km approach when extracting data about rqs 3, 4, 5, 6 and 7. table 2 shows the list of identified km approaches, their descriptions and corresponding publications. two papers were grouped into a km approach and three other papers were grouped in another km approach. thus, we considered a total of 12 different km approaches found in 15 publications. along with this and the next section, we refer to the approaches by using the id listed in the table. after table 2, we present the data synthesis for each research question. further information about the selected publications, including detailed extracted data, can be found in http://bit.ly/mapping-km-in-hci-design. table 2. selected publications. id approach brief description ref. #01 trading off usability and security in user interface design through mental models proposes the development of an organizational mental model through knowledge transfer and transformation, using collaborative brain power from various knowledge constellations to design. (mohamed et al., 2017) #02 knowledge management challenges in collaborative design of a virtual call centre proposes a knowledge-based system with the following functionalities: (a) storing design primitives and formal knowledge in an online library; (b) preserving procedures and rules that proved successful in past design problems; (c) formal modeling of knowledge elements that might be applicable for usability improvements; (d) providing multiple mechanisms for knowledge acquisition, preserving, transfer and sharing. (sikorski et al., 2011) #03 applying knowledge management in ui design process defines a process to automate the transformation of a task description into an interaction description. first, it identifies and uniformizes existing knowledge about ui design process using knowledge classification techniques. then, captured knowledge is represented in the form of ontologies, deriving a task metamodel and an interaction metamodel. this extracted knowledge is integrated to design defining a transformation of task description into interaction description using an intermediate model between them and a two-step transformation. (suàrez et al., 2004) #04 a knowledge management tool for speech interfaces proposes a knowledge-based system to help developers of speechdriven interfaces learn with previous design solutions. these solutions are collected, made accessible and divided into categories regarding their content type. solutions with corresponding structures are clustered and compared within their own category, providing designers with a suggestion mechanism based on their desired kind of solution. (bouwmeester, 1999) http://bit.ly/mapping-km-in-hci-design http://bit.ly/mapping-km-in-hci-design http://bit.ly/mapping-km-in-hci-design investigating knowledge management in hci design castro et al. 2022 there is also a ranked suggestion mechanism of design elements based on available design material and design guidelines. #05 design knowledge reuse based on visualization of relationships between claims presents a tool that aims to improve design and knowledge acquisition by exploring relationships between claims. it allows a better search and retrieval mechanism to a design knowledge repository, which is obtained by applying km strategies (generalize, classify, store, retrieve) to claims. (wahid, 2006; wahid et al., 2004) #06 design knowledge reuse and notification systems to support design in the development process presents a system connected to a design knowledge repository based on claims. it allows teams to leverage knowledge from previous design efforts by searching for reusable claims relevant to their current project and to extend the repository by updating existing claims and creating new ones. (chewar et al., 2004; chewar & mccrickard, 2005; j. l. smith et al., 2005) #07 exploring knowledge processes in user-centered design process proposes a conceptual framework that guides the design process based on five propositions: (1) designers and users should be actively included as actors in the process since they both have the knowledge needed for a successful design; (2) this knowledge possessed by them is context-specific; (3) there is useful knowledge that has not been articulated by both users and designers and, therefore (4) knowledge processes transforming tacit knowledge into explicit knowledge by users and designers are linked and should be combined; and finally, (5) resulting knowledge obtained along the process is embedded into concepts, products or services. (still, 2006) #08 lessons learnt from an hci repository concerns about the implementation of a knowledge repository using windows help files. it is maintained by a group within the organization that receives content updates from the team and properly inserts this new material into the repository. new versions are released from time to time and distributed as physical copies to be installed on each computer. (wilson & borras, 1998) #09 a pattern language approach to usability knowledge management presents a km system that used principles of use case writing and pattern languages to describe problems found in user testing sessions and the following solutions to them. patterns can be retrieved by forms with filters, text search and database queries. filters include goals and subgoals, being useful respectively to show all problems related to a specific user goal and possible solutions and to provide insights of what interactions or devices have been problematic regardless of user goal. (hughes, 2006) #10 an expert system for usability evaluations of business-to-consumer ecommerce sites proposes a knowledge-based system to help with e-commerce usability evaluations. a knowledge engineer is responsible for acquiring and representing knowledge, eliciting knowledge from textual, non-live sources of expertise about design guidelines that affect the usability of 11 e-commerce elements. the elicited knowledge is consolidated and presented in a form of rules in the expert system. (gabriel, 2007) #11 a framework for developing experiencebased usability guidelines presents a km system to manage design guidelines contextualized by usability examples. the system allows designers to describe their current problems and requirements and then search for cases with similar characteristics. they can also follow hyperlinks to more general guidelines, which also point to other cases and search from a list of hierarchically arranged guidelines and follow other related guidelines and cases. the system is initially seeded with organization-wide usability guidelines and is updated as new projects are developed. (henninger et al., 1995) #12 prototype evaluation and redesign: structuring the design space through contextual techniques proposes a method based on contextual inquiry and brainstorming to identify usability issues in interface evaluations and derive proper design solutions to them. first, interface evaluation sessions are conducted with users when they share their perceptions while interacting with a high-fidelity prototype of the system. those sessions are recorded and, later, relevant comments are transcribed into usability flaws. in a second moment, there are brainstorm meetings where developers, designers and hci specialists propose design solutions to the previously identified usability flaws. (a. smith & dunckley, 2002) publication year and type (rq1): figure 2 shows the distribution of the 15 selected publications over the years and their distribution considering the publication type. papers addressing km in the hci design context have been published since 1995 in journals and conferences (no workshop publications were found). conferences have been the main forum, encompassing 73.3% of the publications (11 out of 15). four papers (26.78%) were published in journals. investigating knowledge management in hci design castro et al. 2022 figure 2. publications over the years. the venues of each selected publication were also analyzed to investigate if they were more related to hci, km or software engineering (se). table 3 summarizes the venues of the selected publications and indicates their main focus. figure 3 presents the distribution of the venue orientation across the publications. 53.3% of the publications (8 out of 15) were published in hci venues and the remaining of the publications are divided into km (26.7%) and se (20.0%) venues. table 3. venue orientation of the selected publications. ref. venue area (mohamed et al., 2017) behavior & information technology hci (sikorski et al., 2011) international conference on knowledge-based and intelligent information and engineering systems ai (wahid, 2006) conference on designing interactive systems hci (suàrez et al., 2004) conference on task models and diagrams hci (bouwmeester, 1999) international acm sigir conference on research and development in information retrieval information retrieval (j. l. smith et al., 2005) ieee international conference and workshops on engineering of computer-based systems software engineering (chewar et al., 2004) international conference on computer-aided design design (wahid et al., 2004) ieee international conference on information reuse and integration data science (chewar & mccrickard, 2005) hawaii international conference on system sciences information systems (still, 2006) european conference on knowledge management km (wilson & borras, 1998) international journal of industrial ergonomics hci (hughes, 2006) journal of usability studies hci (gabriel, 2007) isoneworld conference information systems (henninger et al., 1995) dis conference on designing interactive systems: processes, practices, methods, and techniques hci (a. smith & dunckley, 2002) interacting with computers hci figure 3. venue orientation of the selected publications. research type (rq2): figure 4 presents the classification of the research types (according to the classification proposed in wieringa et al. (2005)) reported in the 15 selected publications. 13 publications (86.7%) propose a solution to a problem and argue for its relevance. thus, they were classified as proposal of solution. five of them (33.3%) also present some kind of evaluation, being one (6.7%) evaluated in practice (i.e., also classified as evaluation research), and four (26.7%) investigating the characteristics of the proposed solution not yet implemented in practice (i.e., validation research). one publication (6.7%) refers exclusively to evaluation research, discussing the evaluation of km in an industrial setting, and another is a personal experience paper, reporting the experience of the authors in a particular project in the industry. investigating knowledge management in hci design castro et al. 2022 figure 4. research type of the identified publications. motivation for using km in hci design (rq3): we identified six reasons for using km in hci design, as shown in table 4. some approaches presented more than one motivation, thus the total sum is greater than 12. table 4. motivations for using km in hci design. motivation approaches total improve product quality #01, #02, #04, #05, #06, #07, #10, #11, #12 9 reduce design effort #02, #03, #08, #09, #10 5 reduce design time #04, #05, #08 3 reduce design cost #05, #10 2 improve design team performance #06 1 improve hci design learning #06 1 nine approaches (75%) use km to improve product quality, most of them concerning usability. these approaches aim to provide benefits related to the quality of the interactive system in terms of its interaction with users. for example, approach #11 is proposed to help developers to design effective, useful and usable applications. approach #01, in turn, aims to improve alignment between design features and users’ requirements. seven approaches (58.3%) are motivated by improving one or more aspects related to the hci design process, namely: effort, time and cost. from these, reducing effort is highlighted. five approaches (41.7%) use km to reduce design effort, mainly by not depending on internal usability experts to perform hci design activities. approach #02, for example, applied km to decrease the need for experts to support the design team with their knowledge and experience, due to lack of knowledge to be reused. approaches #04, #05 and #08 were motivated by reducing hci design time through the reuse of previous solutions implemented for similar problems. reducing costs in the hci design process was the motivation for approaches #05 and #10, which focus on minimizing the involvement of external usability experts in the process and conducting usability evaluation more effectively. approach #06 aimed to improve design team performance by providing support for team coordination and collaboration. this approach also aimed to improve hci learning for the students involved in the project. managed knowledge in hci design (rq4): analyzing the publications, we identified 24 different types of knowledge items managed by the km approaches, as shown in table 5. some items are shown in the same line to save space. the most common knowledge items have been design guidelines and design solutions, addressed by four approaches, followed by test results, addressed by three approaches. we noticed that, in the context of hci design, km approaches have dealt with only one (#10) or two (#01, #03, #05, #06, #09, #11 and #12) different knowledge items. table 5. managed knowledge items. knowledge item approaches total design guidelines #04, #08, #10, #11 4 design solutions #02, #04, #07, #08 4 test results #02, #04, #12 3 claims #05, #06 2 design features #01, #12 2 design patterns #09, #11 2 lessons learned #04, #08 2 usability measures #02, #08 2 claims relationships #05 1 design changes #06 1 design feature checklists; design methods; design processes; design standards; design templates; interface objects #08 1 interaction model; task model #03 1 scenarios; test scenarios #02 1 user knowledge; user needs #07 1 user requirements #01 1 user tasks #09 1 we identified four different hci aspects addressed by the identified km approaches. the main aspect is usability, which is treated in all the identified approaches. two approaches (#03 and #08) also address ergonomics. #03 and #04 focus on particular types of design or interfaces. the former focuses on task-based design while the latter on speech driven interfaces. figure 5 shows the hci aspects addressed in the identified km approaches. the sum exceeds 12 because some approaches address more than one aspect. investigating knowledge management in hci design castro et al. 2022 figure 5. hci aspects addressed in km approaches. when knowledge is captured and used (rq5): table 6 shows when hci design knowledge has been captured and when it has been used along the hci design process. three approaches capture and use knowledge throughout the whole process. eight approaches (66.7%) use knowledge when producing design solutions. a smaller number (six, 50%) capture knowledge in this activity. the behavior is the opposite in design evaluation: more approaches are capturing (five, 41.7%) than using (three, 25%) knowledge in this activity. only one (8.3%) approach captures knowledge during requirements specification. table 6. capture and use of knowledge along the hci design process. activity (iso, 2019) knowledge capture knowledge use specify requirements 1 (#01) 0 produce design solutions 6 (#02, #03, #04, #07, #10, #11) 8 (#01, #02, #03, #04, #07, #09, #11, #12) design evaluation 5 (#02, #04, #09, #10, #12) 3 (#02, #09, #10) whole cycle 3 (#05, #06, #08) 3 (#05, #06, #08) technologies used in km approaches (rq6): table 7 shows the technologies (systems, methods, tools, theories, etc.) used in the analyzed km approaches. the most common technologies were knowledge-based systems and knowledge repositories, which are used in three approaches. for example, #04 proposes a knowledge-based system to help developers of speech-driven interfaces learn with previous design solutions. #08, in turn, proposes the implementation of a knowledge repository using windows help files. knowledge management systems and knowledge-based analysis were used in two approaches. a knowledge management system is proposed in #09 to describe problems detected in user test sessions and the respective solutions and in #11 to describe design problems and requirements and then search for usability examples with similar characteristics and hyperlinks to more general related guidelines. knowledgebased analysis, in turn, was used in #03 and #07 combined with other technologies, such as ontology and model transformation (#3) and conceptual framework (#7). other technologies such as brainstorming, contextual inquiry, heuristic evaluation and mental models were used in only one km approach. table 7. technologies used in km approaches in hci design context. technology approaches total knowledge-based system #02, #04, #10 3 knowledge repository #05, #06, #08 3 knowledge management system #09, #11 2 knowledge-based analysis #03, #07 2 ontology; model transformation #03 1 conceptual framework #07 1 contextual inquiry; brainstormingbased technique #12 1 mental model; internalization awareness; observation; behavioral interviews; absorptive capacity; heuristic evaluation #01 1 benefits and challenges of using km in hci design (rq7): table 8 summarizes the benefits and difficulties reported in the publications. two approaches (#04 and #10) did not report any benefit or challenge in using km in hci design. considering the 10 other approaches, it can be noticed that, in general, more benefits than difficulties were reported. the most reported benefit was to enable replicability of domain or context knowledge. for example, #07 reached wide scope applicability because of the common conceptualization proposed as a conceptual framework. on the other hand, the most reported difficulty was that knowledge is often too specific for a given context. for example, in #11 it is stated that the approach is best suited for contexts in which common customer needs are being addressed in similar application domains. investigating knowledge management in hci design castro et al. 2022 table 8. benefits and difficulties of using km in hci design context. benefits approaches total enable replicability of domain/context knowledge #03, #06, #07, #09, #12 5 improve product quality #02, #05, #06, #12 4 improve communication #01, #03, #11 3 increase team engagement/empowerment #02, #06 2 increase organizational integration #03, #08 2 reduce design effort #03, #12 2 improve design conceptualization #03, #07 2 promote standardization #02 1 increase productivity #11 1 promote organizational competitive advantage #02 1 decrease implementation and maintenance effort #08 1 decrease implementation and maintenance costs #08 1 difficulties approaches total knowledge is often context-specific #02, #06, #09, #11 4 issues related to features of the km technologies #05, #06, #09 3 low team engagement/empowerment #01, #05, #08 3 user involvement #07, #12 2 integration of the km approach into the organization #06, #11 2 km implementation and maintenance effort #08, #09 2 lack of consensus about hci design conceptualization #01, #02 2 3.3 discussion taking the period of publications into account (rq1), we can notice a long-term effort regarding the use of km in hci design, since this topic has been targeted by researchers for more than 20 years. however, the low average of publications per year (0.6 since 1995) shows that the topic has not been widely addressed. we can also notice that most of the publications are from the 2000s decade. the low percentage of journal publications, which generally require more mature works, can be seen as a reinforcement that the research on this topic is not mature enough yet. besides, results about the research type (rq2) show that only 40% of the works included some kind of evaluation, being only 13% evaluation of solutions in practice. this can be a sign of difficulty in applying the proposed approaches in industry, which reinforces that research on this topic is not mature enough yet and there seems to be a gap between theory and practice. concerning rq3, we can notice that using km in hci design has been motivated mainly by delivering better products to users or optimizing the hci design process in terms of effort, time and cost. improving the performance of the hci design team was also mentioned, which is consistent with the other motivations related to the hci design process since increasing performance can contribute to decreasing effort, time and cost. by analyzing the results of approaches that applied some validation or evaluation, we noticed that only two (#03 and #12) provided results related to the initial motivation for using km in hci design (reduce design effort and improve product quality, respectively). the other publications were more focused on validating or evaluating features or functionalities of the proposed solutions. a common concern in several publications was the need for hci design expert consultants, which can increase hci design cost and effort. capturing and reusing knowledge contribute to retaining organizational knowledge and reducing dependence on external consultants. another concern refers to communication problems. a. smith & dunckley (2002) highlight that barriers to effective communication between designers, hci specialists and users, due to their differing perspectives, affect product quality. km solutions are helpful in this context. usability has been the focus of the km initiatives in the hci context (rq4). in fact, this is not a surprise, because usability has been one of the most explored hci aspects in the last years. moreover, this property is quite comprehensive and includes other important aspects of hci design, such as learnability, memorability, efficiency, safety and satisfaction (iso, 2019). however, there are other important properties not addressed in the analyzed papers, such as user experience, communicability and accessibility. the knowledge items managed by the km approaches are quite diverse. design solutions, guidelines, test results and design patterns are some knowledge items found in different publications. despite the variety of knowledge items, we noticed that most of the approaches (66.7%) manage up to two different knowledge items. by analyzing the coverage of the approach in terms investigating knowledge management in hci design castro et al. 2022 of single or multiple projects, we found out that four approaches (#01, #03, #07 and #12) manage knowledge involved in a single project, while the other eight approaches are more extensive, accumulating knowledge from multiple projects. in order to elevate knowledge reuse to the organizational level, a km approach must comprehend multiple projects in that organization. concerning knowledge use and capture (rq5), at first, we expected that knowledge was captured and used in the same activity of the hci design process. therefore, results showed us that the same knowledge could be produced and consumed in different parts of the hci design process. for example, there are more approaches capturing knowledge in the design evaluation activity than using it. this reinforces the iterative characteristic of hci design, where knowledge obtained in evaluation activity in one cycle can be used to improve the design in the next cycle. different technologies have been used to implement km in the hci design context (rq6). the most common are system-based approaches that use software to support the km process and store knowledge. we expected this result because km systems, knowledge-based systems and knowledge repositories are widely adopted technologies in the km area. on the other hand, only two approaches use specific hci techniques, namely contextual inquiry and heuristic evaluation. this may indicate that km traditional approaches are suitable for addressing km problems in hci design (what was indeed expected) and that hci techniques can be used to address specificities of the hci design domain. earlier steps of the development of km solutions, such as knowledge analysis and modeling, are also addressed in some publications. moreover, there is also concern with later steps, like the integration of the km system into the organization. some approaches combine different technologies, which can be a sign that the use of different techniques is a good strategy to address a more complete km approach in hci design. as for the benefits and challenges of using km in the hci design context (rq7), when categorizing the findings, we noticed that several of them are benefits and challenges of using km in general. however, by analyzing the context of each km approach, we can better understand how the findings relate to hci design. for example, regarding the benefit improve communication, the works highlight the use of km to support communication among the different actors involved in the hci design process. in #10, communication between hci specialists, designers and users is mediated by prototypes aiming at an agreement about the system design. in #01, km facilitates the elicitation of the user’s knowledge for the designer to apply it to the design. in #03, km reduces errors of interpretation and contextualization among the people involved in the system design. some of the identified challenges and benefits are opposite each other. for example, there is the challenge of low team engagement on one hand and the benefit of increasing team engagement on the other hand. we kept both because they were cited in different publications, thus under different perspectives. moreover, we can see the challenge as a difficulty that, when overcome by the use of km, can be turned into a benefit. by analyzing the most cited benefits and challenges, we noticed that the generality level of the knowledge is an important question in a km approach. the most cited benefit points to knowledge replicability in a specific context/domain. the most cited challenge points to the fact that it is difficult to generalize knowledge. looking at data from rq5, we noticed that approaches handling knowledge from multiple projects reported the knowledge generalization challenge, while approaches handling knowledge in a single project reported easy replication of knowledge. thus, the generality level of knowledge should be determined by the context where the km approach will be applied. when dealing with a high diversity of knowledge and contexts, it becomes harder to produce general knowledge to be widely used to solve specific problems and be adopted in different contexts. one way of achieving improvements in replicability is using knowledge-based analysis methods, as reported by approaches #03 and #07. based on the panorama provided by the mapping study results, in summary, we can say that km has not been much explored in the hci context; it has been used mainly to improve software quality and hci design process efficiency; it has focused on usability; and the km approaches have been based on systems and repositories. as for benefits, km has enabled knowledge replicability, improved product quality and communication. the main difficulties have been to generalize knowledge, address issues related to features of the system and low engagement of the team. 3.4 threats to validity as with any study, our mapping study has some limitations that must be considered together with the results. following the classification presented by (petersen et al., 2015), next we discuss the main threats to the mapping study results. descriptive validity is the extent to which observations are described accurately and objectively. to reduce descriptive validity threats, a data collection form was designed to support data extraction and recording. the form objectified the data collection procedure and could always be revisited. however, data extraction and recording still involved some subjectivity and was dependent on the researcher’s decisions. an important investigating knowledge management in hci design castro et al. 2022 limitation in this sense is related to the classifications we made. we defined classification schemas for categorizing data in some research questions. some categories were based on classifications previously proposed in the literature (e.g., type of research (wieringa et al., 2005)). others were established during data extraction, based on data provided by the analyzed publications (e.g., rq4). with an aim towards minimizing the threat, data extraction, classification schemas and data categorization were done by the first and second authors and reviewed by the other two authors. discordances were discussed and resolved. however, determining the categories and how data fit them involves a lot of judgment. thus, different results could be obtained by other researchers. theoretical validity is determined by the researcher’s ability to capture what is intended to be captured. in this context, one threat refers to the sources. we used four digital libraries selected based on other secondary studies in software engineering. although this set of digital libraries represents a comprehensive source of publications, the exclusion of other sources may have left some valuable publications out of our analysis. acm was not included in the sources because scopus covers most of its publications. however, there are hci publications indexed by acm and not indexed by scopus, which may have jeopardized the mapping results. to minimize this risk, we performed snowballing. another threat refers to the fact that the study focused on scientific literature and did not include other alternatives, such as grey literature, that could enhance the systematic mapping coverage. hence, extending this study with a multivocal literature review through grey literature analysis could complement and enrich the obtained results. there are also limitations related to the adopted search string. even though we have used several terms, there are still synonyms that we did not use. for example, since km is a subjective area, many publications may have addressed km aspects using other words such as “collaboration” and “organizational learning”, which were not covered by our search string. moreover, we did not include hci and km acronyms alone (hci was combined with “design”), which could be an additional threat. however, the string includes the full terms referring to hci and km and we believe that it is probable that publications including the acronyms also include the full terms in either their title, abstract or keywords. hence, our search string might have covered them anyway. the researcher bias over publications selection, data extraction and classification is also a threat to theoretical validity. to minimize this threat, as we previously said, the steps were initially performed by the first and second authors and, to reduce subjectivity, the other two authors performed these same steps. discordances and possible biases were discussed until reaching a consensus. finally, interpretive validity is achieved when the drawn conclusions are reasonable given the data obtained. the main threat in this context is the researcher bias over data interpretation. to minimize this threat, like in the other steps, interpretation was performed by the first and second authors and reviewed by the other two. discussions were carried out until a consensus was reached. however, subjectivity still relies on qualitative interpretation and analysis. even though we have treated many of the identified threats, the adopted treatments involved human judgment, therefore the threats cannot be eliminated and must be considered together with the study results. 4 survey: km in hci design practice the systematic mapping provided information about km approaches to support hci design according to the literature records. after conducting the mapping study, we performed a survey with 39 brazilian hci design practitioners to investigate km in hci design practice. a survey is an experimental investigation method usually done after the use of some technique or tool has already taken place (pfleeger, 1994). surveys are retrospective, i.e., they allow to capture an “instant snapshot” of a situation. questionaries and interviews are the main instruments used to apply a survey, collecting data from a representative sample of the population. the resulting data are analyzed, aiming to draw conclusions that can be generalized for the whole population represented by that sample (mafra & travassos, 2006). in this work, we intended to reach many participants and analyze data objectively and quantitatively. thus, in our survey, we decided to use a questionnaire containing objective questions. we followed the process defined in (wohlin et al., 2012) which comprises five activities. scoping is the first step, where we scope the study problem and establish its goals. planning comes next, where the study design is determined, the instrumentation is considered and the threats to the study conduction are evaluated. operation follows from the design, consisting in collecting data which then are analyzed and evaluated in analysis and interpretation. finally, in presentation and package, the results are communicated. next, in section 4.1 we present the survey planning and execution. section 4.2 concerns the survey results. section 4.3 discusses the results and section 4.4 presents threats to validity. investigating knowledge management in hci design castro et al. 2022 4.1 survey planning and execution the study goal was to investigate aspects related to km in hci design practice. aligned to this goal, we defined the research questions presented on table 9, which were based on the systematic mapping research questions and results. table 9. survey: research questions and their rationale. id research question rationale rq1 which stakeholders have been involved in hci design practice? identify which stakeholders have been involved in hci design practice, which helps identify different perspectives and information needs in hci design. rq2 which knowledge has been involved in hci design practice? investigate which knowledge has been involved in hci design practice, particularly knowledge items (e.g., design solutions, guidelines and lessons learned) and design artifacts (e.g., wireframes, mockups and prototypes) used as sources of knowledge or produced to record useful knowledge. rq3 which hci design activities have demanded better km support? investigate which hci design activities have needed better support of km (e.g., because there have not been enough knowledge resources to support their execution). rq4 how has km been applied in hci design practice? investigate how km principles have been applied and identify technologies (e.g., tools, methods, etc.) that have been used to support knowledge access and storage in hci design practice. rq5 which benefits and difficulties have been noticed when using km in hci design practice? identify benefits and difficulties that have been experienced by practitioners when applying km in hci design practice and verify if practitioners have experienced more benefits or difficulties. rq6 which goals the use of km in hci design practice has contributed to achieving? identify which goals the use of km in hci design has contributed to, aiming to figure out predominant reasons for using km in hci design practice. the participants were 39 brazilian professionals with experience in hci design of interactive software systems. the participants profile was identified through questions regarding their current job positions, education level, knowledge of hci design and practical experience in hci design activities. most participants (79.5%) declared to play roles devoted to hci design activities (nine ux/ui designers; six ux designers; four product designers, two designers, two ux research designers, one art director, one it analyst & ux designer, one interaction designer, one lead designer, one lead ui designer, one staff product designer and one ui designer). others 20.5%) play roles that perform some activities related to hci design (one programmer, one requirement analyst, one chief growth officer, one product owner, one it analyst, one it manager, one marketing manager and one project leader). although these roles cannot be considered hci design experts, we did not exclude these participants because they declared to have practical experience and knowledge in hci design (probably acquired in their previous job and academic experiences). moreover, even playing roles not dedicated to hci design, they are often involved in hci design in some way. eight participants (20.5%) had masters’ degrees, 26 (66.7%) had bachelor’s degrees, and five (12.8%) had not yet finished bachelor’s degree courses. all participants declared theoretical knowledge of hci design. four of them (10.3%) declared low knowledge (i.e., knowledge acquired by himself/herself through books, videos or other materials). 16 participants (41%) declared medium knowledge, acquired mainly during courses or undergraduate research. finally, 19 participants (48.7%) declared high knowledge (i.e., they are experts or have a certification, masters or ph.d. degree related to hci design). some areas of the courses cited by participants that declared medium or high knowledge are design (46.2%), computer science (38.5%), arts (28.2%), social communication (15.4%) and user experience (7.7%). the participants were allowed to choose more than one option, hence the sum of the values is over 100%. other areas such as anthropology, neuroscience, information science, psychology were also mentioned by one participant each. 26 participants (66.7%) declared more than three years of experience in hci design practice, 11 participants (28.2%) declared between one and three years and two (5.1%) declared less than one year. the instrument used in the study consisted of a questionnaire composed of 10 objective questions. most answer options for each question were defined based on the mapping study results. for example, when asked about the goals achieved with the help of km in hci design (rq6), the options provided to the participants refer to the goals we found in the mapping study. however, some options were rewritten in a way that could enhance participants understanding (e.g., we changed “test results” to “previous design evaluation results” on rq2) and others were added based on the authors’ knowledge and experience (e.g., we included forums, blogs and social networks in rq4). furthermore, most questions also allowed the participant to provide additional information in text boxes to complement his/her answers. for example, besides selecting goals from the list provided in the question related to rq6, the participants were also allowed to include new goals in their answers. the questionnaire is available at http://bit.ly/questionnaire-km-in-hci-design. http://bit.ly/questionnaire-km-in-hci-design http://bit.ly/questionnaire-km-in-hci-design investigating knowledge management in hci design castro et al. 2022 the procedure adopted in the study consisted in sending the invitation to participate in the study, receiving the answers, verifying them, consolidating and analyzing data. the invitation was posted in discussion groups on facebook, linkedin and interaction design foundation’s website2. the authors also sent the invitation by email to potential participants. since the platforms did not inform how many people visualized the posts, we could not infer the percentage of invites that led to answers before sending the invitation, we performed a pilot with three participants. considering the participants’ feedback, we improved the questionnaire aiming to ensure that the questions were clear and understandable. the invitation to participate in the study was posted on social media and sent by email on december 16th, 2020. we received answers until january 11th, 2021. we received 40 answers to the questionnaire, however, after analyzing the participants profile related to hci design knowledge and experience, we excluded one participant who reported to have low knowledge and experience with hci design and did not answer some of the questionnaire questions. after that, each provided answer was verified and data was consolidated and analyzed against the research questions. 4.2 results in this section, we present the data synthesis for each research question. stakeholders involved in hci design practice (rq1): aiming to identify stakeholders involved in hci design practice, we asked the participants to identify the stakeholders they directly interact with within their hci design practice. as it can be seen in table 10, developer has been the most common stakeholder involved in hci design practice, being mentioned by 37 participants (94.9%). following that, project manager, designer, user and client were mentioned, respectively, by 34 (87.2%), 33 (84.6%), 27 (69.2%) and 26 (66.7%) participants. product owner was cited by three participants (7.7%) and others (business analyst, customer experience analyst, data analyst, hr people, product manager and scrum master) were mentioned only once. table 10. stakeholders involved in hci design practice. stakeholder number of participants % developer 37 94.9% designer 34 87.2% project manager 33 84.6% client 27 69.2% user 26 66.7% product owner 3 7.7% business analyst 1 2.6% customer experience analyst 1 2.6% 2 https://www.interaction-design.org data analyst 1 2.6% hr people 1 2.6% product manager 1 2.6% scrum master 1 2.6% knowledge involved in hci design practice (rq2): first, the participants were asked about the knowledge items they use or produce during hci design activities. we consider as knowledge items pieces of knowledge that can be useful in hci design, such as lessons learned, standards, guidelines and patterns. figure 6 presents the results of this question. some items have been used and produced by a high number of participants: organizational design standards (used by 34 participants, 87.2%, and produced by 26 participants, 66.7%), lessons learned (used by 34 participants, 87.2%, and produced by 24 participants, 61.5%), guidelines (used by 34 participants, 87.2%, and produced by 22 participants, 56.4%) and libraries of design components or elements (used by 32 participants, 82.1%, and produced by 23 participants, 59%). other knowledge items have also been used by many participants, but produced by a smaller number, such as examples (used by 34 participants, 87.2%, and produced by 14 participants, 35.9%), design solutions from the organization (used by 35 participants, 89.7%, and produced by 18 participants, 46.2%) and design solutions from outside the organization (used by 35 participants, 89.7%, and produced by 11 participants, 28.2%). in general, hci design practitioners have used and produced different knowledge items (11.1 and 6.6 in average, respectively). figure 6. knowledge items used and produced in hci design practice. https://www.interaction-design.org/ investigating knowledge management in hci design castro et al. 2022 the participants were also asked about design artifacts they use or produce during hci design activities. we use the term design artifact to refer to documents, models, prototypes and others that record information about the design solution. figure 7 shows the results. user requirements, scenarios and interaction models were the most cited artifacts used during hci design. on the other hand, wireframes, functional prototypes and mockups were the most cited artifacts produced during hci design. figure 7. design artifacts used and produced in hci design practice. we also asked the participants to inform whether the artifacts used and produced by them sufficiently provide all information needed to describe the hci design solution (i.e. if the knowledge recorded in the artifacts is enough for the implementation and evaluation of the solution). 26 participants (66.7%) answered “yes” and 13 (33.3%) answered “no”. eight out of the 13 participants pointed out they missed information about personas, user research data and usability tests. these 13 participants were also asked about the ways the missing information is communicated. the results are presented in table 11. annotations and talks have been the most used ways (eight participants, 61.5%) to complement the information provided in design artifacts. seven participants (53.9%) reported the use of meetings, while one used documentation or specific tools. the participants indicated that annotations and talks had been used informally, while meetings, documentation or tools have been used systematically, following organizational practices. table 11. ways to obtain missing information. method number of participants % annotations 8 61.5% talks 8 61.5% meetings 7 53.9% documentation or tool 1 7.7% none 1 7.7% hci design activities demanding better km support (rq3): taking the hci design activities established by iso 9241-210 (iso, 2019) as a reference, the participants were asked to judge whether the knowledge resources (e.g., knowledge items, artifacts) used by them have provided sufficient knowledge to support each activity. figure 8 presents the results. in general, most participants consider that they have access to enough knowledge to perform hci design activities. produce design solutions has the highest number of participants (31 participants, 79.5%) reporting to have had sufficient knowledge to perform it. on the other hand, evaluate design solutions has the highest number of participants (10 participants, 25.6%) declaring that the available knowledge has not been enough. sixteen participants (41%) declared to have not had sufficient knowledge to support at least one hci design activity. they pointed out that, in order to address the lack of knowledge, they have performed user research, searched for successful use cases, talked to stakeholders, and looked at the literature. figure 8. available knowledge to support hci design activities. how km has been applied in hci design practice (rq4): figure 9 shows the approaches that have been used to support knowledge access or storage in hci design practice. brainstorming and blogs have been the most used ways to access knowledge (28 participants, 71.8%), followed by mental models and electronic documents and spreadsheets (26 participants, 66.7%). except for blogs, those have also been the most used ways to store knowledge: brainstorming has been used by 27 participants (69.2%); mental models and electronic documents and spreadsheets by 24 (61.6%). ontologies have been the less used way by the participants. only 7 participants (18%) have used ontologies to access knowledge and 5 participants (12.8%) have used it to store knowledge. concerning knowledge storage, social networks (6 participants, 15.4%) and forums (8 participants, 20.5%) have also not been much investigating knowledge management in hci design castro et al. 2022 used. in general, the approaches shown in figure 9 have been more used to support knowledge access than to support knowledge storage. figure 9. approaches to support knowledge access and storage in hci design. benefits and difficulties of using km in hci design practice (rq5): 34 participants (87.2%) reported performing km practices to support hci design activities. 16 of them (41.0%) have followed institutionalized organizational practices, while 18 (46.2%) have performed on their own initiative. these 34 participants were asked about the benefits and difficulties they have perceived in using km to support hci design. the results are summarized in table 12 and table 13. table 12. benefits of using km in hci design practice. benefit number of participants % enable replicability of domain or context knowledge 27 79.4% promote standardization 26 76.5% improve communication 25 73.5% increase productivity 24 70.6% reduce design effort 24 70.6% improve product quality 23 67.6% improve design conceptualization 20 58.8% improve team learning 18 52.9% reduce dependency on specialists 18 52.9% increase team engagement or empowerment 17 50.0% increase organizational integration 16 47.1% reduce design cost 16 47.1% promote organizational competitive advantage 11 32.4% table 13. difficulties of using km in hci design practice. difficulty number of participants % low team engagement or empowerment 16 47.1% km implementation and maintenance effort 15 44.1% integration of the km approach into the organization 15 44.1% lack of consensus about hci design conceptualization 14 41.1% find relevant knowledge to a given context 13 38.2% low user involvement 9 26.5% issues related to features of the km technologies 8 23.5% unclear business model 1 2.9% goals to which the use of km in hci design practice has contributed (rq6): aiming to identify the predominant reasons for using km in hci design practice, the participants were asked how much km support to hci design contributes to achieving certain goals. the goals presented to them were identified in the systematic mapping as motivations to perform km in the hci design context. figure 10 shows the results. figure 10. km contribution to goals achievement when supporting hci design. according to the participants, the goals to which using km in hci design contributes the most are improve product quality (84.6% of the participants stated that km contributes a lot or contributes to it) and reduce effort spent on design activities (79.5% of the participants stated that km contributes a lot or contributes to it). on the other hand, the participants have seen less contribution of km in hci design to reduce the usage of financial resources in design and to reduce the dependency on specialists (43.6% of the participants stated that km contributes little or is indifferent to both of them). investigating knowledge management in hci design castro et al. 2022 4.3 discussion in this section, we present some discussions about the results shown in the previous section. by analyzing the participants’ profile, we noticed that several stakeholders (20.5%) who had knowledge of and experience with hci design did not play a role devoted to hci design by the time of the survey execution. we believe that this reinforces the multidisciplinary nature of hci design and corroborates with a recent finding from (neto et al., 2020) that some professionals may choose to pursue a double background involving design and development areas. concerning stakeholders (rq1), it can be noticed that a variety of them are involved in hci design. considering that the interactions usually occur in the context of projects, the results indicate that teams of hci design projects have included designers, developers, project managers, and frequently also have involved clients and users. these stakeholders have different roles in hci design, and thus may have different hci design knowledge needs. for example, a developer may need to implement the design solution presented in a design artifact. for that, this artifact should present technical decisions that affect the implementation. a project manager, in turn, may need to have a broader view of several design artifacts to verify if the implemented solution satisfies the requirements agreed with the client. hence, km approaches must consider the needs of different stakeholders to properly support hci design. moreover, it may be necessary to integrate knowledge from different sources to provide a solution that integrates the needs of different stakeholders. this can be done, for example, with a knowledge management system with multiple views for each different role. regarding knowledge involved in hci design (rq2), by analyzing the knowledge items used and produced in hci design practice, we can notice which knowledge has been more useful to practitioners. most participants use knowledge items that provide design knowledge obtained from previous design experiences, such as design solutions from the organization, design solutions from outside the organization and examples. this can be a sign that new designs have been created based on previous experiences adapted to the new context. however, these knowledge items have not been much produced by the participants. this may be due to the effort required to record knowledge for future reuse. hence, it would be important to facilitate capture, recording and retrieval of knowledge embedded in design solutions. on the other hand, two of the knowledge items produced by the highest number of participants (organizational design standards and guidelines) record general principles and practices to be followed when designing hci solutions. this may indicate that the participants have found it easier to produce knowledge independent of specific solutions. considering the relation between the number of knowledge items used and produced by the participants, the higher number of used items shows that, in general, the participants have acted more as knowledge consumers than knowledge producers. this may happen because either the participants do not have enough time to produce knowledge items, or the knowledge production is done by someone else. consulting knowledge directly helps designers in the activities they were doing at that moment. in contrast, knowledge production does not seem to be immediately useful to them, although it is important at an organizational level. we believe that approaches that promote knowledge recording and storage requiring less effort could motivate designers to act as knowledge producers. as for design artifacts, we noticed that the ones produced by more participants (wireframes, functional prototypes and mockups) represent abstractions of the design solution. hence, the creation of such artifacts is part of the design solution development. on the other hand, the artifacts used by more participants (user requirements, sceneries and interaction models) provide useful information to develop the design solution (i.e., they represent inputs to design development). one-third of the participants (33.3%) considered the artifacts used or produced by them limited to meet information needs about the design solution and reported the use of complementary ways to transfer missing knowledge. when analyzing the three most cited ways, we observed that two of them (talks and meetings) are based on the conversation between team members. this can be a sign that it may be difficult to articulate certain pieces of knowledge in artifacts. this is reinforced by the high usage of annotations, which are less formal and structured, and the low usage of documentation and tools. besides, considering that the use of more than one method of knowledge transfer is a common practice used by the participants, it is likely that they prefer to have this communication redundancy as a way of reinforcing the understanding of all stakeholders about the design. therefore, we believe that the missing knowledge in hci design artifacts can be transferred, for example, by performing regular meetings and by providing means to easily attach additional annotations on design artifacts. concerning hci design activities (rq3), ‘produce design solutions’ was the one that more participants (79.5%) indicated to have access to enough knowledge to perform it. this can be a sign that participants have used knowledge mainly to support the creation of design solutions. on the other hand, a high number of participants indicated that they had not had sufficient knowledge to perform the activities ‘understand and specify the context of use’ (23%), ‘specify investigating knowledge management in hci design castro et al. 2022 user requirements’ (23%) and ‘evaluate the design solution’ (25.6%). therefore, it is necessary to identify useful knowledge to support these activities (e.g., missing knowledge related to personas and user research data, as reported in rq2) and provide means to represent and access it in an easy way. as for the approaches to support knowledge access and storage in hci design (rq4), it can be observed that the most used approaches, such as brainstorming, mental models and electronic spreadsheets and documents, usually support both knowledge access and storage. this may suggest that it is easier and simpler to implement and use them. brainstorming, for example, has the advantage of the participants sharing and obtaining knowledge at the same time. on the other hand, web-based resources, such as blogs, forums and social networks are more used to support knowledge access than knowledge storage. probably, these resources have been used more as sources of inspiration to bring new ideas from outside the organization. in addition, the reason why these resources have been less used by practitioners to record knowledge may be a concern in not exposing organizational design knowledge on the internet. hci design knowledge must be captured, recorded and propagated in order to be raised from the individual level to the organizational level. hence, we believe that km initiatives in hci design should consider approaches such as the ones most used by practitioners to support both knowledge access and storage. concerning the benefits and difficulties of using km in hci design (rq5), most participants declared to have experienced km practices in hci design. 41.0% followed institutionalized practices and 46.2% have performed on their own initiative. this indicates that hci design professionals have been concerned with the need for practices that help manage knowledge and are seeking solutions by themselves when they are not provided by the organization. according to the participants, in general, using km to support hci design brings more benefits than difficulties. the most cited benefits were related to standardization, reuse, communication and productivity, while the most cited difficulties were related to the lack of consensus in hci design conceptualization and to the effort of implementing, engaging the team and integrating the km approach in the organization. based on that, to effectively implement a km approach, it would be interesting to convince people and the organization that the additional effort in the beginning is worth the benefits they obtain afterward. finally, by analyzing goals to which the use of km in hci design has contributed (rq6), ‘reduce the usage of financial resources’ and ‘reduce the dependency on specialists’ have been considered less impacted by the use of km in hci design. this may be because reducing costs can be a side effect of reducing time spent on design or producing better designs, with fewer errors. moreover, even if expert’s knowledge is transferred and managed at the organizational level, user-centered design deals with people, hence there are subjective aspects that still need to be addressed by specialists. another point to be considered is that the participants of the survey were, in the majority, hci design experts, which could have biased their answers about the impact of using km to reduce the dependency on hci design experts. it is also important to note that ‘reduce the effort spent on design activities’ was the goal which participants believe to be most impacted by the use of km in hci design. by having in hand proper knowledge resources, the designer can learn from previous experiences, reuse solutions and explore more design alternatives, which can lead to designing better and more efficiently. 4.4 threats to validity as discussed in the context of the systematic mapping, when carrying out a study, it is necessary to consider threats to the validity of its results. in this section, we discuss some threats involved in the survey using the classification presented in (wohlin et al., 2012). internal validity: it is defined as the ability of a new study to repeat the behavior of the current study with the same participants and objects. the main threat to internal validity is communication and sharing of information among participants. to address this threat, the questionnaire was made available online, so that the participants could answer it at the time they considered most appropriate. this can minimize the threat of communication since participants were not physically close during the study and did not necessarily perform the study at the same time. external validity: it is related to the ability to repeat the same behavior with different groups of participants. in this sense, the limited number of participants and the fact that all of them are brazilian professionals are also threats to the results. moreover, some of the participants were invited based on the authors’ relationship network, which may also have influenced the answers. construction validity: it refers to the relationship between the study instruments, participants and the theory being tested. in this context, the main threat is the possibility that the participants have misunderstood some questions. to address this threat, we performed a pilot that allowed us to improve and clarify questions. moreover, we provided definitions for the terms used and examples of information that should be included in the survey, so that the participants could better understand how to answer it. conclusion validity: it measures the relationship between the treatments and the results and affects the ability of the study to generate conclusions. a threat to conclusion validity refers to the subjectivity in data analysis, which may reflect investigating knowledge management in hci design castro et al. 2022 the authors’ point of view. in addition, the results reflect the participants’ personal experience, interpretation and beliefs. hence, the answers can embed subjectivity that could not be captured through the questionnaire. these and the other threats discussed above affect the representativeness of the survey results and, thus, the results must be understood as preliminary evidence and should not be generalized. 5 consolidated view of findings in this section, we present some discussions involving the systematic mapping and survey results, aiming to provide a consolidated view of the findings from both studies. the three most cited motivations for using km found in the systematic mapping (rq3) are the same as the three goals most impacted by the use of km in hci design practice, according to survey participants (rq6). this shows that, in general, it is expected that the use of km in hci design can contribute to improving product quality and reducing effort and time spent on design activities. considering the most reported benefits and difficulties of using km in hci design, the survey results provided some of them that were not observed in the literature. for example, most survey participants reported ‘standardization’ and ‘productivity’ as benefits and ‘km implementation and maintenance effort’ and ‘lack of consensus about hci design conceptualization’ as difficulties. this difference is not a surprise, since the mapping results showed that most proposed approaches had not been applied in the industry. we believe that to achieve success in implementing knowledge management, it is important to consider hci design professionals’ perspectives, pursuing the benefits and implementing strategies to overcome the difficulties. there are other differences between the mapping and survey results. for example, traditional km technologies, such as knowledge management systems, knowledge repositories and knowledge-based systems, have been the most used approaches reported in the literature, but have not been much used by hci design professionals. the reasons why they do not use those approaches may be quite diverse, including not being aware that they exist or considering them too complex. since 46.2% of the participants perform km practices on their own initiative, they have likely preferred simpler approaches that can be implemented by themselves. this reinforces the gap between industry and academy perceived from the analysis of the systematic mapping results. in order to decrease this gap, km approaches to support hci design should be closer to approaches that professionals are already familiar with, which can contribute to simpler and easier implementation and use. results from both studies show that design guidelines and design solutions have been reused in hci design. organizational design standards, lessons learned and design component libraries have also been useful for hci design professionals. therefore, km approaches to support hci design should be able to handle these knowledge items, supporting their capture, storage and retrieval. as indicated by results from both studies, these knowledge items have probably been most used to support the activity ‘produce design solutions’. this was the activity in which most approaches found in the literature use knowledge and most participants considered having sufficient knowledge support. km approaches should also provide support to other activities such as ‘understand and specify context of use’, ‘specify user requirements’ and ‘evaluate design solutions’, contributing to the hci design process as a whole. 6 conclusion in this paper, we presented an investigation about the use of knowledge management in the hci design context. to investigate the state of the art, we performed a systematic mapping. after that, we carried out a survey with 39 brazilian professionals who work on hci design. as the main result of the studies, we provided a panorama of research related to the topic and identified gaps and opportunities for improvements to organizations interested in applying km initiatives in the hci design context. we noticed that, although hci design is a favorable area to apply knowledge management, there have been only a few publications exploring this research topic. due to the increasing importance of interactive systems and the diversity of interfaces that have been made available for people’s use, we believe that there are many challenges and questions to be addressed in future research. for example: (i) the lack of a common conceptualization of hci design (pointed out in #01 and #02 in the mapping study and also by 35.9% of the survey participants) leads to communication problems between the different actors involved in the hci design process. we believe that the use of ontologies to establish this common conceptualization could help in this matter. however, since ontologies are not much familiar to practitioners (survey rq4 results), ontologybased km approaches in hci design should abstract the ontology to final users (e.g., using the ontology to derive the conceptual model of a knowledge-based system). (ii) the gap between theory and practice (systematic mapping rq2 results) shows that it is necessary to take km solutions to practical hci design environments. the investigating knowledge management in hci design castro et al. 2022 survey results show that hci design professionals are familiar with more robust km approaches (such as knowledge management systems), but prefer to use simpler ways to deal with knowledge, such as brainstorming sessions and electronic spreadsheets and documents. therefore, lightweight technologies and a divide and conquer strategy to reduce the complexity of the conception, implementation and evaluation of a km approach might be useful, allowing to provide results for the organizations in smaller periods of time and increasing benefits as the approach evolves. (iii) other aspects besides usability (e.g., user experience, communicability and accessibility) should be explored in km initiatives to improve hci design. (iv) the benefits and difficulties identified in the mapping (rq7) and reported by the survey participants (rq5) indicate issues that can be investigated in future research. for example, case studies can be carried out in organizations to evaluate the use of km approaches in the hci design context. concerning related works, we did not find any study investigating the use of km in the hci design context. a work that can be related to ours is (stephanidis & akoumianakis, 2001), consisting of a literature review about categories of computer-aided hci design tools and a proposal of a new category to address the knowledge complexity involved in hci design. however, the study focused on computational tools, not investigating how other kinds of km approaches can help in the hci design process. as future work, concerning the systematic mapping, new studies can be conducted to better understand the state of the art of km in hci design and improve the use of km in this context. for example, the results obtained in our mapping study could be compared with results from other studies investigating km use in other domains (e.g., requirements engineering). moreover, km solutions proposed in other domains can inspire new proposals to support hci design by using km. as for the survey, it can be extended to include more participants from different countries and also to investigate other aspects. considering the studies’ results, which showed us a gap between the hci design professionals and the approaches proposed in the literature, we have worked on the development of a tool to support km in the context of hci design of interactive systems (castro et al., 2021). by making use of the information provided by this study, we aim to reduce the gap between academy and industry by proposing a tool able to meet the needs of hci design professionals. references bjørnson, f. o., & dingsøyr, t. (2008). knowledge management in software engineering: a systematic review of studied concepts, findings and research methods used. information and software technology, 50(11), 1055–1068. bouwmeester, n. (1999). a knowledge management tool for speech interfaces (poster abstract). proceedings of the 22nd annual international acm sigir conference on research and development in information retrieval, 293–294. https://doi.org/10.1145/312624.312721 carroll, j. m. (2014). human computer interaction (hci). in m. soegaard & r. f. dam (eds.), the encyclopedia of humancomputer interaction (2nd ed., pp. 21–61). the interaction design foundation. castro, m. v. h. b., barcellos, m. p., falbo, r. de a., & costa, s. d. (2021). using ontologies to aid knowledge sharing in hci design. xx brazilian symposium on human factors in computing systems (ihc’21). https://doi.org/10.1145/3472301.3484327 castro, m. v. h. b., costa, s. d., barcellos, m. p., & falbo, r. de a. (2020). knowledge management in human-computer interaction design: a mapping study. 23rd iberoamerican conference on software engineering, cibse 2020. chammas, a., quaresma, m., & mont’alvão, c. (2015). a closer look on the user centred design. procedia manufacturing, 3, 5397– 5404. https://doi.org/https://doi.org/10.1016/j.pr omfg.2015.07.656 chewar, c. m., bachetti, e., mccrickard, d. s., & booker, j. e. (2004). automating a design reuse facility with critical parameters. in r. j. k. jacob, q. limbourg, & j. vanderdonckt (eds.), computeraided design of user interfaces iv (pp. 235–246). springer netherlands. chewar, c. m., & mccrickard, d. s. (2005). links for a human-centered science of design: integrated design knowledge environments for a software development process. proceedings of the 38th annual hawaii international conference on system sciences, 256c-256c. https://doi.org/10.1109/hicss.2005.390 de souza, c. s. (2005). the semiotic engineering of human-computer (acting with technology). in b. a. nardi, v. kaptelinin, & k. a. foot (eds.), technology. the mit press. https://doi.org/10.1017/cbo97811074153 24.004 gabriel, i. j. (2007). an expert system for usability evaluations of business-toconsumer e-commerce sites. proceedings of the 6th annual isoneworld conference, las vegas, nv. henninger, s., haynes, k., & reith, m. w. (1995). a framework for developing investigating knowledge management in hci design castro et al. 2022 experience-based usability guidelines. proceedings of the 1st conference on designing interactive systems: processes, practices, methods, & techniques, 43–53. https://doi.org/10.1145/225434.225440 hughes, m. (2006). a pattern language approach to usability knowledge management. j. usability studies, 1(2), 76–90. iso. (2019). iso 9241-210:2019(en) ergonomics of human-system interaction part 210: human-centred design for interactive systems. in int. organization for standardization. kitchenham, b. a., & charters, s. (2007). guidelines for performing systematic literature reviews in software engineering (vol. 2). mafra, s. n., & travassos, g. h. (2006). estudos primários e secundários apoiando a busca por evidência em engenharia de software. relatório técnico, rt-es, 687(06). mohamed, m. a., chakraborty, j., & dehlinger, j. (2017). trading off usability and security in user interface design through mental models. behav. inf. technol., 36(5), 493–516. https://doi.org/10.1080/0144929x.2016.1 262897 neto, e. h., van amstel, f. m. c., binder, f. v., reinehr, s. dos s., & malucelli, a. (2020). trajectory and traits of devigners: a qualitative study about transdisciplinarity in a software studio. 2020 ieee 32nd conference on software engineering education and training (csee\&t), 1–9. o’leary, d. e. (1998). enterprise knowledge management. computer, 31(3), 54–61. https://doi.org/10.1109/2.660190 petersen, k., vakkalanka, s., & kuzniarz, l. (2015). guidelines for conducting systematic mapping studies in software engineering: an update. information and software technology, 64, 1–18. https://doi.org/https://doi.org/10.1016/j.inf sof.2015.03.007 pfleeger, s. l. (1994). design and analysis in software engineering: the language of case studies and formal experiments. acm sigsoft software engineering notes, 19(4), 16–20. rogers, y., sharp, h., & preece, j. (2011). interaction design: beyond humancomputer interaction (3rd ed.). john wiley & sons. rus, i., & lindvall, m. (2002). knowledge management in software engineering. ieee software, 19(3), 26–38. https://doi.org/10.1109/ms.2002.1003450 schneider, k. (2009). experience and knowledge management in software engineering (1st ed.). springer publishing company, incorporated. sikorski, m., garnik, i., ludwiszewski, b., & wyrwiński, j. (2011). knowledge management challenges in collaborative design of a virtual call centre. knowlegebased and intelligent information and engineering systems, 657–666. smith, a., & dunckley, l. (2002). prototype evaluation and redesign: structuring the design space through contextual techniques. interacting with computers, 14(6), 821–843. https://doi.org/10.1016/s09535438(02)00031-0 smith, j. l., bohner, s. . a., & mccrickard, d. s. (2005). toward introducing notification technology into distributed project teams. 12th ieee international conference and workshops on the engineering of computer-based systems (ecbs’05), 349– 356. https://doi.org/10.1109/ecbs.2005.69 stephanidis, c., & akoumianakis, d. (2001). knowledge management in hci design. in w. karwowski (ed.), international encyclopedia of ergonomics and human factors (vol. 1, pp. 705–710). taylor & francis. still, k. (2006). exploring knowledge processes in user-centered design process. the 7th european conference on knowledge management, 533. suàrez, p. r., jùnior, b. l., & de barros, m. a. (2004). applying knowledge management in ui design process. in p. slavik & p. palanque (eds.), proceedings of the 3rd annual conference on task models and diagrams tamodia ’04 (pp. 113–120). acm press. https://doi.org/10.1145/1045446.1045468 sutcliffe, a. g. (2014). requirements engineering from an hci perspective. in m. soegaard & r. f. dam (eds.), the encyclopedia of human-computer interaction (2nd ed., pp. 707–760). the interaction design foundation. valaski, j., malucelli, a., & reinehr, s. (2012). review: ontologies application in organizational learning: a literature review. expert system with applications: an international journal, 39(8), 7555– 7561. https://doi.org/10.1016/j.eswa.2012.01.07 5 wahid, s. (2006). investigating design knowledge reuse for interface development. proceedings of the 6th conference on designing interactive systems, 354–356. https://doi.org/10.1145/1142405.1142462 wahid, s., smith, j. l., berry, b., chewar, c. m., & mccrickard, d. s. (2004). visualization of design knowledge component investigating knowledge management in hci design castro et al. 2022 relationships to facilitate reuse. proceedings of the 2004 ieee international conference on information reuse and integration, 2004. iri 2004., 414–419. https://doi.org/10.1109/iri.2004.1431496 wieringa, r., maiden, n., mead, n., & rolland, c. (2005). requirements engineering paper classification and evaluation criteria: a proposal and a discussion. requir. eng., 11(1), 102–107. https://doi.org/10.1007/s00766-005-00216 wilson, p., & borras, j. (1998). lessons learnt from an hci repository. international journal of industrial ergonomics, 22(4), 389–396. https://doi.org/https://doi.org/10.1016/s01 69-8141(97)00093-0 wohlin, c., runeson, p., höst, m., ohlsson, m. c., regnell, b., & wesslén, a. (2012). experimentation in software engineering. springer. investigating knowledge management in human-computer interaction design 1 introduction 2 background 2.1 hci design 2.2 knowledge management 3 systematic mapping: km in hci design according to the literature 3.1 research protocol 3.2 results 3.3 discussion 3.4 threats to validity 4 survey: km in hci design practice 4.1 survey planning and execution 4.2 results 4.3 discussion 4.4 threats to validity 5 consolidated view of findings 6 conclusion references journal of software engineering research and development, 2021, 9:12, doi: 10.5753/jserd.2021.1898 this work is licensed under a creative commons attribution 4.0 international license.. development of an ontology-based approach for knowledge management in software testing: an experience report érica ferreira de souza [ federal university of technology paraná | ericasouza@utfpr.edu.br ] ricardo de almeida falbo [ federal university of espírito santo ] nandamudi l. vijaykumar [ national institute for space research | vijay.nl@inpe.br ] katia r. felizardo [ federal university of technology paraná | katiascannavino@utfpr.edu.br ] giovani v. meinerz [ federal university of technology paraná | giovanimeinerz@utfpr.edu.br ] marcos s. specimille [ federal university of espírito santo | marcosspecimille@gmail.com ] alexandre g. n. coelho [ federal university of espírito santo | alexandregncoelho@gmail.com ] abstract software development organizations are seeking to add quality to their products. testing processes are strategic elements to manage projects and product quality. however, advances in technology and the emergence of increasingly critical applications make testing a complex task and large volumes of information are generated. software testing is a knowledge-intensive process. because of this, these organizations have shown a growing interest in knowledge management (km) programs, which in turn support the improvement of testing procedures. km emerges to manage testing knowledge, and, consequently, to improve software quality. however, there are only a few km solutions supporting software testing. this paper reports experiences from the development of an approach, called ontology-based testing knowledge management (ontot-km), that aims to assist in launching km initiatives in the software testing domain with the support of knowledge management systems (kmss). ontot-km provides a process guiding how to start applying km in software testing. ontot-km is based on the findings of a systematic mapping on km in software testing and the results of a survey with testing practitioners. moreover, ontot-km considers the conceptualization established by a reference ontology on software testing (roost). as a proof of concept, ontot-km was applied to develop a kms called testing km portal (tkmp), which was evaluated in terms of usefulness, usability, and functional correctness. results show that the developed kms from ontot-km is a potential system for managing knowledge in software testing, so, the approach can guide km initiatives in software testing. keywords: knowledge management, knowledge management system, software testing, testing ontology 1 introduction with the emergence of new technologies during the last decades, more advanced techniques have been applied in software development, in order to achieve high-quality software products (thrane, 2011). thus, more efficient techniques to qualify a software product should be incorporated in its development life cycle, ensuring a well-managed process. testing activities play an important role in assessing and achieving the quality of a software product (souza, 2014). currently, software testing is considered a process consisting of activities, techniques, resources, and tools. advances in technology and the emergence of increasingly critical applications also make testing a complex task. during software testing, large volumes of information are generated. software testing is a knowledge-intensive process, and thus it is important to provide computerized support for tasks of acquiring, processing, analyzing, and disseminating testing knowledge in an organization (andrade et al., 2013; souza, 2014). in this context, knowledge management (km) emerges to manage testing knowledge, and, consequently, to improve software quality. km can be defined as a set of organizational activities that must be performed systematically to acquire, organize, and sharing the different knowledge types in the organization (o’leary and studer, 2001). the adoption of principles of km in software testing can help testers to promote reuse of knowledge, to support testing processes, and even to guide management decisions in organizations (souza et al., 2015a). software testing, in general, can benefit from reusing test cases, testing techniques, lessons learned, and personal experiences, among others (li and zhang, 2012; janjic and atkinson, 2013; souza et al., 2015a). to enable the reuse of testing knowledge, software organizations should be able to capture this knowledge and make it available to be shared with their teams. however, there are only a few km solutions in the context of software testing (souza et al., 2015a). the major problems in organizations regarding software testing knowledge are the low reuse rate of knowledge and barriers to knowledge transfer. this occurs because most of the testing knowledge in organizations is not explicit and it becomes difficult to articulate it (souza et al., 2015a). on the other hand, implementing km solutions, in general, is not an easy task. according to storey and barnett (2000), a large number of organizations are taking great interest in the idea of km, but, these organizations are not familiar with how and where to start since they lack the proper guidance to implement km. so, with an orientation on how to implement new km solutions in the organization, or even with an existing solution that can be customized, becomes interesting for organizations since it is an opportunity for continued cost remailto:ericasouza@utfpr.edu.br mailto:vijay.nl@inpe.br mailto:katiascannavino@utfpr.edu.br mailto:giovanimeinerz@utfpr.edu.br mailto:marcosspecimille@gmail.com mailto:alexandregncoelho@gmail.com development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 duction, quality improvement, and reduction in software delivery (rokunuzzaman and choudhury, 2011). concerning technologies for km, ontologies have been widely recognized as a key technology (herrera and martinb, 2015). ontologies can be used for establishing a common conceptualization to be used in the knowledge management system (kms) to facilitate communication, search, storage, and representation of knowledge (o’leary and studer, 2001). however, only a few initiatives have used an ontology-based approach for km in the software testing domain (souza et al., 2015a). this paper reports our experiences in developing an approach to assist in launching km initiatives in the software testing domain with the support of kmss. in this paper, we present ontot-km, an ontology-based testing knowledge management approach. ontot-km provides a process to apply km in software testing. ontot-km considers the conceptualization established by a software testing ontology. a striking feature of ontot-km is to describe how a testing ontology can be used for guiding km initiatives in software testing. the software testing ontology used in ontot-km is a reference ontology on software testing, called roost (souza et al., 2017). roost was developed for establishing a common conceptualization about the software testing domain and can be used to serve several km-related purposes, such as defining a common vocabulary for knowledge workers regarding the testing domain, structuring knowledge repositories, annotating knowledge items, and making searching easier (souza, 2014; souza et al., 2017). lessons learned and experiences acquired in conducting this study are presented on two main fronts. firstly, the ontot-km approach is presented to help software organizations to implement an initial km solution in software testing. subsequently, a prototype of a kms was developed, called testing km portal (tkmp), both as a proof of concept from the ontot-km approach, as well as understanding the needs of software development professionals in having a kms in software testing ready and available for customization. this research is an extension of a preliminary study published in (souza et al., 2020). the extensions of this work are essentially threefold. first, we improved several sections to provide better research understanding through the inclusion of new text, extra depth in some paragraphs, and the inclusion of new figures and tables. second, we analyzed the database created from roost using data mining techniques, to present the applicability of this type of research in the search for useful knowledge in knowledge repositories. third, we improved the tkmp analysis by software engineering practitioners. we carried out an analysis separating the participants by professional position, such as professionals directly related to software development companies and professionals directly related to scientific research. the main contributions of this research are the guidelines provided by ontot-km for guiding km initiatives in software testing. these guidelines are supported not only by roost, but also from the findings of the mapping study souza et al. (2015a) and the results of a survey with 86 testing practitioners. ontot-km was applied to develop tkmp, which was evaluated by test leaders of real projects in which it was applied. tkmp also was evaluated by 43 practitioners in terms of usefulness, usability, and functional correctness. such evaluation was designed applying the goal, question, metric (gqm) paradigm (basili et al., 1994) and technology acceptance model (tam) (davis, 1993). the remainder of this study is structured as follows. section 2 presents the main research concepts. section 3 presents ontot-km. section 4 presents the application of ontot-km and the evaluation results. section 5 discusses related works. finally, in section 6, we present our final considerations. 2 background in this section, the main concepts of this study are discussed. 2.1 software testing software testing consists of the dynamic verification & validation (v&v) activities of the behavior of a program on a finite set of test cases, against the expected behavior abran et al. (2004). testing activities are supported by a welldefined and controlled testing process (abran et al., 2004). the process consists of several activities, namely (abran et al., 2004), (myers, 2004), (black and mitchell, 2011), (mathur, 2012): test planning, test case design, test coding, test execution and test result. in the first activity, the testing should be planned, such as, the test environment for the project, scheduling testing activities, and planning for possible undesirable outcomes. test planning is documented in a test plan. then, in the test case design the test cases to be run are designed, documented, and then coded. during test execution, test code is run, producing results, which are then analyzed to determine whether test cases have been passed or failed. the testing activities are performed at different levels. unit testing focuses on testing each program unit or component. integration testing takes place when such units are put together, aiming at ensuring that the interfaces among the components are defined and handled properly. finally, system testing regards the behavior of the entire system (abran et al., 2004), (myers, 2004), (black and mitchell, 2011), (mathur, 2012). in addition, many testing techniques are providing systematic guidelines for designing test cases, intending to make testing efforts more efficient and effective. testing techniques can be classified, among others, as (burnstein, 2003): white-box testing techniques, which are based on information about how the software has been designed and coded; black-box testing techniques, which generate test cases relying only on the input/output behavior, without the aid of the code that is under test; defect-based testing techniques, which aim at revealing categories of likely or predefined faults; and model-based testing techniques, which are based on models, such as statecharts, finite state machines, and others. one of the main characteristics of the software testing process is that it has a large intellectual capital component and can thus benefit from experiences gained from past projects (souza et al., 2015a). during software testing, large volumes of information are processed and generated. so, it can be considered a knowledge-intensive process, making it necessary development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 for automated support acquiring, processing, analyzing, and disseminating testing knowledge for reuse. in this context, knowledge management (km) can be used (souza et al., 2015a). 2.2 knowledge management km can be viewed as the development and leveraging of organizational knowledge to increase an organization’s competitive advantage (zack and serino, 2000). in general, km formally manages the increase of knowledge in organizations in order to facilitate its access and reuse, typically by using information systems (iss) and kmss (herrera and martin-b, 2015). in particular, kmss aims at supporting organizations in knowledge management, in an automated way. one issue in kmss is how to represent knowledge. one alternative is ontologies (o’leary and studer, 2001) as they are considered a key technology for km (herrera and martinb, 2015), by defining the shared vocabulary to be used in the kms facilitating knowledge communication, integration, search, storage, and representation. in ontology-based kmss, ontologies are typically used to structure the content of knowledge items, to support knowledge search, retrieval, and personalization, serving as a basis for knowledge gathering, integration, and organization, and support knowledge visualization, among others. km has shown important benefits for software organizations. in souza et al. (2015a), we performed a systematic mapping (sm) looking for studies presenting km initiatives in software testing. an sm is a secondary study for an overview of a research area through the classification of theavailableevidence(kitchenhamandcharters,2007).the main conclusions from this sm were: (i) there are few publications (only 15 studies were retrieved) addressing km initiatives in software testing; (ii) the major problems that have motivated applying km in software testing are low knowledge reuse rate and barriers in knowledge transfer; (iii) as a consequence, knowledge reuse and organizational learning are the main purposes for managing software testing knowledge; (iv) there is a great concern with both explicit and tacit knowledge; (v) reuse of test cases is the perspective that has received more attention; (vi) kmss are used in almost all initiatives (11 of the 15 studies); and (vii) different technologies have been used to implement those kmss, such as conventional technologies (databases, intranets, and internet), yellow pages (or knowledge maps), recommendation systems, data warehouse, and ontologies. in particular, one finding drew our attention: only two studies, actually, used ontologies in a km initiative applied to software testing (liu et al., 2009; li and zhang, 2012). this seems to be a contradiction, since, as pointed out by staab et al. (2001), ontologies are the glue that binds km activities together, allowing a content-oriented view of km. one possible explanation for this low number of studies is the fact that developing an ontology is a hard task, especially in complex domains, as is the case of software testing (souza et al., 2015a). based on the findings of the sm, we decided to perform a systematic literature review (slr) looking for ontologies on the software testing domain in the literature (souza et al., 2013). an slr also is a secondary study that uses a welldefined process to identify available evidence (kitchenham and charters, 2007). from this slr, 12 ontologies addressing this domain were identified. as the main findings, it is possible to highlight (souza et al., 2013): (i) most ontologies have limited coverage; (ii) the studies do not discuss how the ontologies were evaluated; (iii) none of the analyzed testing ontologies is truly a reference ontology, i.e., a domain ontology that is constructed with the main goal of making the best possible description of the domain as realistic as possible; and, finally, (iv) although foundational ontologies have been recognized as an important instrument for improving the quality of conceptual models in general, and more specifically of domain ontologies, none of the analyzed ontologies is grounded in foundational ontologies. this motivated us to build roost, a reference ontology on software testing (souza et al., 2017). roost was developed for establishing a common conceptualization of the software testing domain. 2.3 roost roost is presented very briefly here since it is not the scope of this paper to present the entire ontology. details of the ontology can be found in (souza et al., 2017). since the testing domain is complex, roost was developed in a modular way, comprising four modules (sub-ontologies): (i) software testing process and activities representing the testing process and the main activities that comprise it, namely test planning, test case design, test coding, test execution, and analysis of the test results; (ii) testing artifacts focusing on the artifacts used and produced by the testing activities; (iii) techniques for test case design looking at testing techniques, such as black-box, white-box, defect-based, and model-based testing techniques; and (iv) software testing environment addressing the main components of the testing environment, including test hardware resources, test software resources, and human resources. in order to develop roost, the systematic approach for building ontologies (sabio) (falbo, 2014) was adopted. sabio method incorporates best practices commonly adopted in software engineering and ontology engineering and addresses the design and coding of operational ontologies. furthermore, roost has been developed by reusing and extending ontology patterns from the software process ontology pattern language (sp-opl) (falbo et al., 2013) and the enterprise-ontology pattern language (eopl) (falbo et al., 2014). an ontology pattern language (opl) is a network of interconnected domain-related ontology patterns that provides holistic support for solving ontology development problems for a specific domain (falbo et al., 2013). more recently, roost has been integrated into the software engineering ontology network (seon) (ruy et al., 2016). the full model of roost is available at http://dev.nemo.inf.ufes.br/seon/. given the size of roost, figure 1 presents only its testing process and activities sub-ontology. concepts reused from software process ontology are shown in gray. specific concepts are shown in white. some of the main concepts of this sub-ontology are also presented below. more specific details of roost’s testing process and activities sub-ontology can development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 be found at (souza et al., 2017) and seon network1. figure 1. roost’s testing process and activities sub-ontology in this sub-ontology, the process and activity execution (pae) pattern was reused. pae concepts were extended to the testing domain, as shown in figure 1. testing process is a subtype of specific performed process, since a testing process occurs in the context of the entire software process (general performed process) of a project. a testing process, in turn, is composed of testing activities, and thus testing activity is considered a subtype of performed activity. similarly to performed activity, testing activity can be further divided into composite and simple testing activity. in pae pattern, specific performed processes are composed of two or more performed activities. a performed activity, analogously, can be simple or composite. a composite performed activity is composed of other performed activities; a simple performed activity cannot be decomposed into smaller activities (falbo et al., 2013). besides specializing concepts, relationships were also specialized from pae. for instance, in pae, there is a whole-part relationship between specific performed process and performed activity. the whole-part relationship between testing process and testing activity is a subtype of the former. whenever a roost relationship is a subtype of another relationship defined in sp-opl, the same name is used for both. regarding the testing process activities, test planning and level-based testing are composite performed activities. although not shown in figure 1, test planning involves several sub-activities, such as defining the testing process, allocating people and resources for performing its activities, analyzing risks, and so on. level-based testing comprises test case design, test coding, test execution and test result analysis, which are considered simple performed testing activities. considering the test levels, level-based testing groups simple testing activities according to the test level to which they are related. thus, level-based testing is specialized ac1http://dev.nemo.inf.ufes.br/seon/ cording to the instances of test level, a second-order type, whose instances partition level-based testing in more specific types of testing activities. in figure 1, the three most cited testing levels in the literature are made explicit: unit testing, integration testing and system testing. however, there may be others, such as regression testing. regarding testing stakeholders, the test manager is responsible for performing test planning activities. test manager also participates in test result analysis activities. test case designer participates in test planning activities, and she is in charge of performing test case design and test result analysis activities. finally, the tester is responsible for performing test coding and test execution. with respect to testing artifacts, test planning produces a test plan, which is used by level-based testing activities. test case design uses several artifacts as test case design inputs and applies testing techniques for developing test cases. during test coding, test code is produced, implementing a test case. during test execution, test cases are executed by running the code to be tested and the test code, producing test results. finally, in a test result analysis activity, test results are analyzed and the findings are reported in a test analysis report. 3 ontot-km given the applicability of km to improve software testing processes, we developed ontot-km for assisting companies that want to create their solutions for km initiatives in dynamic software testing, supported by a kms. ontot-km consists of a process and a set of guidelines for implementing a kms in software testing organizations. ontot-km is supported by roost, in particular, to structure the kms knowledge repository. moreover, ontot-km guidelines are based on the findings of the sm presented in (souza et al., 2015a), and the results of a survey performed with testing practitioners presented in (souza et al., 2015b). the ontot-km process comprises the following steps: (i) diagnose the organization’s testing process; (ii) establish the testing km scope; (iii) develop a testing kms; (iv) load existing knowledge items; and (v) evaluate the testing kms. figure 2 presents ontot-km process as a uml activity diagram. as this figure shows, roost is used to support steps (i) and (iii). steps (i) and (ii), shown in another color in figure 2, are considered optional. following, each process step is presented, describing the main guidelines that apply. step 1: diagnose the current state of the organization’s testing process the first step of the ontot-km process is to make a diagnosis of the current state of the organization’s testing process. it refers to investigating the existing knowledge within the testing process, in order to identify knowledge assets and understand how and where testing knowledge is developed and used in the organization. this step may become optional given the organization’s maturity level. this step is an important step for organizations with low maturity. once identifying the knowledge items, organizations can then proceed to development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 figure 2. ontot-km process manage them. this step may be accomplished by surveys with questionnaires and/or interviews. this activity should consider the entire current state of software testing in the organization concerning, at least, the following aspects: the adopted testing process, the activities of the candidate testing processes to be targeted by the km initiative, the artifacts produced during this process, the testing techniques applied, the test levels contemplated by the process, and the test environments adopted by the organization’s software projects. aspects related to km should also be investigated, such as the current km practices applied in the testing process, the organization’s purpose of applying km to software testing, problems related to testing knowledge in the organization, among others. roost can be used in this step as the common vocabulary for supporting the analysis of the current status, as well as to formulate the questions to be used in the questionnaires and/or interviews with the organization’s testing practitioners. the results of these questionnaires/interviews should be used as guidelines for the next step (establish scope). for this step, we suggest asking, at least, the following questions to the testing practitioners participating in the diagnosis: • what are the testing activities that comprise the organization’s testing process? evaluate the answers considering the consensual activities considered in roost (test planning, test case design, test coding, test execution, and test result analysis), and consider the possibility of improving the organization’s testing process by aligning it to the testing process captured by roost. • in which activities of the testing process is km more useful? the activities pointed out in the previous step should be the ones considered here as possible answers. • what are the testing levels in which the organization performs tests? testing activities can be performed at different levels. taking roost as a basis, consider at least the following levels: unit testing, integration testing, and system testing. however, if the organization tests software at other levels, these should be considered. • in which testing level is km more useful? the testing levelspointedout inthepreviousstepshouldbetheones considered here as possible answers. • which resources do you consider more important to have the knowledge available about them when defining the testing environment? according to roost, the possible answers that are considered for this question are the following types of resources: hardware, software, and human. • concerning tacit and explicit knowledge, what are the types of knowledge you consider to be more important to manage during the software testing process? testing practitioners tend to consider both useful, but we need to evaluate which one is more important and which is easier to implement. in general, for organizations starting a km initiative in software testing, explicit knowledge is easier to handle. in particular, test cases highlight the most important artifacts to be managed as a knowledge item, as pointed in sm (souza et al., 2015a) and the survey (souza et al., 2015b). • what is the purpose of applying km in software testing? what benefits can km bring to software testing? this question captures the feeling of testing practitioners regarding why and how an organization can benefit by applying km to software testing. step 2: establish the scope of the testing km initiative once the diagnosis of the status of the testing process has been carried out, the next step is to establish the km scope. similarly, as in the case of step 1, this step may also be considered optional if the organizations already know their needs. for the km scope, it is necessary to familiarize oneself with the organization’s needs. the organization must define the testing process activities that are to be supported, and knowledge types to be managed. a major challenge for organizations is to know which knowledge is useful, and thus identify potential knowledge items among the several knowledge assets generated in the testing process. results from step 1 should be used here. in addition, it is suggested that organizations start with small km initiatives. as a general guideline, we recommend considering the survey results that we performed (souza et al., 2015b). from this survey, both test case design and test planning were considered the most important testing activities to be supported by km practices, and capturing as knowledge items their main outcomes, namely: test case and test plan. when considering test cases as knowledge items, it is necessary to build an appropriate infrastructure that allows for the analysis, storage, and retrieval of existing test cases. this structure can be achieved from the ontotkm approach. in the reuse of this knowledge item, for example, the test reuse system may be able to cover a variety of search scenarios in order to assist its users in different situations. the search engine enables searching for test cases by informed parameters, for example, test levels or testing techniques. the returned development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 test cases can be reused in similar scenarios. according to werner (2014), reusable tests that have been written for a similar scenario are likely to help to better understand how a previously created similar system works. in addition, by reusing the knowledge contained in existing tests, developers can benefit from the knowledge that others have invested in developing them. these tests can help to gain better insights into how a particular kind of component should behave. regarding test planning, the survey led to the selection of testing techniques and the definition of the testing environment as the most important tasks to be supported by a km initiative. concerning knowledge about the testing environment, managing knowledge about human resources and software resources is pointed out as the most promising approach. regarding knowledge about human resources, this impression is corroborated by the sm (souza et al., 2015a), where yellow pages and knowledge maps appear in various initiatives. other knowledge items related to making tacit knowledge explicit can also be considered in km scope, namely: • lessons learned (ll): ll can be understood as knowledge acquired through experience in a particular situation. ll can be classified as best practices, errors/critiques, and success factors. lls are informal knowledge items that can be understood as ideas, facts, questions, points of view, decisions, among others. in addition, ll can also be classified as informative, success, or failure lessons. informative lls explain how to proceed in a given situation; lessons of success provide examples of problems that were solved positively; and failure lessons provide examples of negative responses to attempt to solve a problem and potential ways to cope up with the situation (o’leary, 1998). • knowledge regarding discussion: discussion among the organization’s members may be submitted as knowledge items. tools to support discussion among the organization’s members, such as discussion forums, have been fundamental in km environments (fischer and ostwald, 2001). discussion forums become important tools for knowledge management for the following reasons: (i) very useful knowledge can be generated and captured during discussions (falbo et al., 2004), and (ii) a major challenge of km is to convert tacit knowledge into explicit knowledge (nonaka and krogh, 2009; davenport and prusak, 2000). step 3: develop a testing kms this phase concerns developing a kms to support the km initiative and comprises the main activities for developing systems, in general: requirements specification, design, implementation, and testing. requirements must be elicited and specified. functional requirements may be created from use case models, class diagrams, and state diagrams to model the behavior of knowledge items throughout their existence in the kms. nonfunctional requirements should also be addressed, such as security, usability, accessibility, etc. roost is very useful in this step. roost can serve as the initial conceptual model for the kms, and thus as the basis for structuring the testing knowledge repository. specific information (attributes of the classes in the conceptual model) should be identified, taking the characteristics of the organization’s testing artifacts into account, and, most importantly, information that is available in the tools used for supporting the testing process. furthermore, interoperability issues should also be analyzed. ideally, software tools that are part of the test environment should be integrated with the testing kms to act together, interacting, and exchanging data to obtain the expectedresults. inthiscontext,possibleknowledgeitemsidentified in these tools can be automatically converted/imported to the testing kms. another key point is to define the km process activities that are to be supported by the testing kms. we recommend providing support to the following typical activities of a km process: creating knowledge items, evaluating knowledge items before making them available, searching knowledge items, assessing the usefulness of available knowledge items, and maintaining the knowledge repository. during the design of testing kms, developers should consider the platform in which the system is to be built, and non-functional requirements should be addressed. several technologies can be used, including those that are commonly considered in km solutions like content management systems, document management systems, and wiki, as well as those considered intelligent km solutions, such as knowledge-based and expert systems, reasoners, and semantic wikis. once designed, the kms should be coded and tested. step 4: load existing knowledge items. for initially populating the knowledge repository of the testing kms, the organization should look for existing knowledge items. for instance, if the system must manage test cases, existing test cases can be imported to the testing kms. the existing knowledge items should be reengineered to ensure conformance with the knowledge repository structure. knowledge items can be registered manually in the testing kms or mechanisms for loading and reengineering these knowledge items can be built to automate the loading process. once a knowledge repository can be created and populated, data mining can be explored. the knowledge repository can contain useful hidden information (knowledge) of major relevance to the business, so, mining on these data can be performed. data mining is the application of specific algorithms for extracting patterns from data. data mining integrates the knowledge discovery in databases (kdd), process knowledge data structuring (fayyad et al., 1996). data mining methods are used in the identification of relevant information in large volumes of data, such as classification, regression, clustering, summarization, association rule, dependency modeling, among others (fayyad et al., 1996). mining stored data, in large databases to discover potential information and knowledge, has been a popular topic in database research. data mining is a technology to obtain information and valuable knowledge (yun et al., 2003). according to basili and rombach (1991), the quality of development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 software development can be improved by reusing acquired experiences, rather than starting from scratch. therefore, as a result of applying kdd, another knowledge item type can be considered: mined items. mined items can be provided by a mining process in a km database and can identify relationships that are not apparent, facilitating decisionmaking. furthermore, identifying behavior patterns in data stored in knowledge bases can help the organization to reuse and share the knowledge acquired in previous projects. step 5: evaluate the testing kms. evaluation should be done to determine if the testing kms meets the expectations. improvements can be carried out, implying a return to the previous steps. a suggestion to evaluate the testing kms is to analyze some quality characteristics, such as usefulness, usability, and functional correctness. to do that, two models can be considered: gqm (basili et al., 1994) and tam (davis, 1993). gqm is a measurement model, organized into three levels. in the first level (conceptual level), the study goals should be defined. the second level (operational level) refers to a set of questions that should be defined to characterize the evaluation or the accomplishment of a specific goal. finally, in the last level (quantitative level), a set of metrics should be associated with questions, to answer them measurably. the result of applying the gqm approach is the specification of a measurement system targeting a particular set of issues and a set of rules for interpreting measurement data (basili et al., 1994). gqm is useful because it facilitates identifying not only the precise measures required but also the reasons why the data are being collected (park et al., 1997). tam determines the acceptance of a given technology by users, considering two-factor analysis: usefulness and usability. when evaluating these two factors, it is possible to map the users’ acceptance of new technology. usefulness refers to how much the user realizes that certain technology is useful to her in terms of productivity increase. according to iso/iec 25010 (iso/iec, 2011), usefulness is the “degree to which a user is satisfied with their perceived achievement of pragmatic goals, including the results of use and the consequences of use”. in this standard, usefulness is part of the quality in use model. the perception of usability refers to the effort reduction that the user achieves when using a given technology instead of using other alternatives (davis, 1993). in iso/iec 25010 (iso/iec, 2011), usability refers to the “degree to which a product or system can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use”. it is a quality characteristic of the product quality model, but for consistency with its established meaning, it is also defined as a subset of the quality in use model. in this work, we also decided to evaluate another quality characteristic: functional correctness, a sub characteristic of functional suitability in the iso/iec 25010 product quality model. according to iso/iec (2011), functional correctness is the “degree to which a product or system provides the correct results with the needed degree of precision”. 4 applying ontot-km our experience in developing the ontot-km approach has two fronts. first, we introduce the approach, and then we create a prototype of a kms based on ontot-km that allows us to evaluate the approach, as well as obtain the opinion of software professionals in having a kms in software testing ready and available for customization. following, the kms development based on ontot-km and all processes of the evaluations are presented. to evaluate the ontot-km approach, we applied it to build a prototype of a kms for managing software testing knowledge, called testing knowledge management portal (tkmp). the resulting system was populated with data from two real projects and different evaluations were conducted. the projects were (souza, 2014): (i) amazon integration and cooperation for modernization of hydrological monitoring (icammh) project; and (ii) on-board data handling (obdh) software inside the inertial systems for aerospace application (sia) project. icammh project was a collaboration involving the brazilian aeronautics institute of technology and the brazilian national water agency, supported by the brazilian financial foundation for projects finep. the project developed a pilot system for modernization and integration of telemetry points collected from hydrological data, as a basis for managing water resources in the amazon region. the second project is devoted to developing software for the onboard computer of the sia project, which is a computational system for obdh (on-board data handling) to attitude and orbit control (aoc) of satellites that can be adapted for future space applications at the national institute for space research (inpe). the first version of the obdh software was in the testing phase when this work was being done. the final version of this software aims at adding all the functionalities of obdh of a satellite. its main functionalities are: (i) receiving and analyzing ground station telecommands; (ii) formatting and transmission of telemetry; (iii) data acquisition from on-board subsystems (real time and stored); (iv) housekeeping; and (v) fault detection isolation and recovery (fdir). at the time we were carrying out this research, icammh project had already been finalized and the sia project was in its early stage. 4.1 diagnose the current state of the organization’s testing process as icammh project has already been finalized and the testing activities of the sia project were only in the very initial phase, it was not possible to run the diagnostic step. this step was replaced by the findings from the survey with 86 testing practitioners we performed (souza et al., 2015b). out of these 86 participants, some are also team members and leaders of the icammh and sia project. the survey’s purpose was to identify which is the most appropriate scenario in the software testing domain, from the point of view of testing stakeholders, for starting a km initiative. the survey presents questions that addressed aspects considered both in the conceptualization of roost and by the sm presented in (souza et al., 2015a), as shown in table development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 1. furthermore, managing testing knowledge is not an easy task, and thus it is better to start with a small-scale initiative. thus, firstly, it is necessary to identify essential knowledge items of a sub-topic of software testing to be dealt with in the kms. from the survey results, the following conclusions are considered: (i) the participants identified test case design and test planning as being the activities in which km would be most useful. therefore, test cases and test plans are considered the most useful artifacts to be reused; (ii) explicit knowledge was considered more important than tacit knowledge. explicit knowledge represents the objective and rational knowledge that can be documented, and thus it can be accessed by many (nonaka and takeuchi, 1997). on the other hand, the tacit knowledge is the subjective and experience-based knowledge and typically remains only in people’s minds (nonaka and takeuchi, 1997); (iii) among the most targeted artifacts to reuse, test cases stood out with 90.7%; and (iv) the purposes for which experts are more interested in applying km in software testing are related to improving the quality of results in software testing, and reducing cost, time and effort spent in a software project. 4.2 establish the scope of the testing km initiative considering the main findings of the survey, test case design was considered the software testing activity to be supported, and test case the main artifact to be managed. all relevant information for designing test cases had to be considered in the scope of the tkmp development. thus, concepts related to test cases in roost were also considered in the scope of the initiative, namely: test case input, expected result, test result, test code, test case designer and testing technique. besides the test cases as the main artifacts to be managed, ll and knowledge regarding discussion were also considered in the scope of tkmp. these two types of knowledge items were considered in the scope of tkmp since survey participants pointed out individual experiences and communication between test team members as the types of tacit knowledge with more significant importance to generate explicit knowledge items. in addition, meetings with the project leaders from icammh and sia projects also helped to reach this scope. still, concerning tacit knowledge, we decided that tkmp should also include a yellow page system since survey participants pointed out human resources as the most useful resource to be managed and test case designers are in the scope of this km initiative. finally, we also decided to apply kdd for discovering useful knowledge from existing data and identifying the mined items. as presented in section 3, step 4, different mining methods can be used in the identification of relevant information in large volumes of data. in this project, for creating the mined items, the method of association rule was used. the association rule method identifies patterns of behavior in the set of data that often occur jointly in the database and model rules from these sets. the association rules, when applied to a data set, allow finding rules of the type of x → y, i.e., transactions of the database which contain x tend to contain y. the method of rule association was used along with the apriori algorithm (agrawal and srikant, 1994; witten et al., 2005). apriori algorithm is the best known in rule discovery methods (agrawal and srikant, 1994). 4.3 develop a testing kms considering the scope defined in the previous activity, tkmp was developed. the specification of the main requirements was developed, such as the use case diagram and class diagram conceptual model. figure 3 shows a partial use case diagram describing the main functionalities of tkmp and actors. the use cases in gray are general, in the sense that they apply to manage software engineering knowledge items of different nature. use cases in white represent testing-specific features. the developer is the main actor, representing all types of professionals involved in the software development process. knowledge manager represents a user with specific permissions, guaranteeing access to features inherent only to a knowledge manager. next, the use cases shown in figure 3 are briefly described. • create knowledge item: this use case allows developers to create a knowledge item. • create discussion-related knowledge item: this use case allows developers to register a discussion-related knowledge item. • create lesson learned: this use case allows developers to register a lesson learned. • create mined item: this use case allows the developer to register a mined item. • create test case: this use case allows developers to register a test case. • include test result: this use case allows the developer to include a test result relative to a test case. • include incident: this use case allows the developer to report an incident related to a test result. • include issue: this use case allows the developer to register an issue related to an incident. • change knowledge item: this use case allows the knowledge manager to change a knowledge item. • delete knowledge item: this use case allows the knowledge manager to delete a knowledge item. • pre-evaluate knowledge item: this use case allows the knowledge manager to pre-evaluate a knowledge item, making it available, rejecting it, or selecting experts to evaluate it. • evaluate knowledge item: this use case allows a developer to make a detailed evaluation of a knowledge item, to support the knowledge manager in making decisions about whether the item should be approved or rejected. • visualize knowledge item: this use case allows developers to visualize the details of a knowledge item. • visualize test case: this use case allows developers to visualize the details of a test case. • search knowledge item: this use case allows the developer to search for knowledge items available per informed parameters. development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 table 1. relationships between the survey questions (sq) and the research questions (rq) from the mapping study and roost survey question based on sq1. in which activities of a testing process is km more useful? sq2. in which activities of testing planning is km more useful? roost: testing process and activities subontology sq3. a test environment consists of, among others, human resources, hardware, and software. about which of these resources are more important to have available knowledge when defining the test environment? roost: testing environment sub-ontology sq4. in which testing level is km more useful? roost: testing process and activities subontology sq5. what is the type of knowledge you consider to be more important during the software testing process? mapping study: rq7. what are the types of knowledge items typically managed in software testing? sq6. regarding the types of knowledge items listed below, indicate the importance of generating explicit knowledge from tacit knowledge. mapping study: rq7 sq7. regarding testing artifacts, which are the ones you judge to be more appropriate for reuse? mapping study: rq7 roost: testing artifacts sub-ontology sq8. what is the purpose of applying km in software testing? mapping study: rq6. what are the purposes of employing km in software testing? sq9. what benefits km can bring to software testing? mapping study: rq9. what are the main benefits and problems reported regarding applying km in software testing? figure 3. functionalities of tkmp • search test case: this use case allows the developer to search for test cases per informed parameters. • value knowledge item: this use case allows the developer to value the utility of a knowledge item consulted. • find experts: this use case allows the developer to find and select experts with a desired profile, as well as viewing the profiles of experts found. it works as a yellow pages system. figure 4 shows a partial conceptual model of tkmp. this model focuses on knowledge items, on test cases. classes in gray are derived from roost, i.e., the roost conceptual model was used as the starting point for specifying tkmp, mainly to structure its knowledge repository regarding the software testing notions. information from the software tools that compose the testing environment of the icammh project was used as the basis for identifying attributes and enumerated types, to specify tkmp in detail. these tools are testlink2, a web-based test management system, and mantisbt3, a bug tracking system. 2http://testlink.org/ 3http://www.mantisbt.org/ development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 figure 4. conceptual model of tkmp. testlink is a web-based test management system. it offers support for test cases, test suites, test plans, test projects, and user management and reports. mantisbt is a bug (or defect) tracking system. however, it is often configured by users to serve as a more generic issue tracking system and project management tool. in the case of the icammh project, mantisbt was customized to deal with two categories of requests: activity-related requests and defect-related requests. in the context of the icammh project, an integration scheme between testlink and mantisbt was used. testlink can integrate with mantisbt, allowing for a test case to be associated with a defect-related request. thus, all incidents that were registered in mantisbt, as defect-related requests, were conditioned to the existence of a test case in testlink. tkmp project and requirements specifications are currently available at https://cutt.ly/kybolun. 4.4 load existing knowledge items once tkmp was developed, previous existing knowledge items in the two projects were loaded to the knowledge repository. initially, tkmp’s knowledge repository was populated with 1568 test cases extracted from icammh project. next, other test cases from the sia project were also inserted in tkmp’s knowledge repository using tkmp’s functionalities. in the context of the icammh project, test case related information was stored both in testlink and mantisbt. each one of these tools has its data repository, implemented in different ways, demanding analysis of the structure of each one to load the data. moreover, each tool has its terminology to represent the manipulated data, i.e., different terms are used to represent the same concept. thus, to load existing test cases, a feature was developed to connect and get data from the repositories of mantisbt and testlink, and then to development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 convert them into objects (instances) of the data schema of tkmp. roost was used for mapping concepts from the involved tools. this procedure is illustrated in figure 5. figure 5. loading existing knowledge items in this step, we decided to mine the stored knowledge items, since mined items also were considered as scopes of km initiatives in software testing (see section 4.2). data mining was performed on icammh data. to create the mined items, the method of rule association was used. one of the algorithms to the better known association rules is the apriori algorithm. it can work with a large number of attributes, generating various combinations among them. for the generation of association with the apriori algorithm, the waikato environment for knowledge analysis (weka) tool was used (witten et al., 2005). weka is a collection of machine learning algorithms for data mining tasks. a brief explanation of how this item can be generated is given below. considering only those test cases that failed, 415 records were returned from a query in the knowledge repository. table 2 presents the first 20 returned and 8 attributes considered in this data mining, corresponding to classes: human resource, test case, incident, issue. after loading the data set, the apriori algorithm was executed using the weka tool. weka returns the most important 10 associations. this number can be changed in the algorithm settings. the listing in table 3 shows the results of the associations that were found. analyzing the rules some conclusions can be inferred. the fifth rule, for example, shows that out of 219 recorded incidents with status resolved and resolution priority normal, the importance of test cases is medium in 210 of them. this is quite reasonable because the importance of completing the test case is considered medium, an incident generated by this test case can also be a priority of correction normal. just as with all the other rules, one realizes that there are consistencies among associations that were presented. about the associations returned with the tkmp data, no irregularities were detected. in this case, it is concluded that the classes used to generate associations have the correct registration patterns by the project members. however, more classes could be incorporated into the associations to allow more analyses of the data. furthermore, other mining algorithms could be used. by using association rules combined with other mining methods one could detect behaviors not seen by the naked eye, for example, to notice or register a certain type of defect tends to appear when changing a certain software component or the severity of the test case is always major when they are related to a particular module. behaviors like these could help the responsible expert in project decisions related to the tests being conducted. for the registration of a knowledge item of the mined item type in tkmp, generic information about that item was considered given the diversity of methods and algorithms that exist in data mining. in the conceptual model of tkmp (figure 4), the mineditem table shows the attributes that are available for the registration of a mined item. the attributes are: description, algorithm, result, and analysis. 4.5 evaluate the testing kms although tkmp is still considered a prototype, built as a proof of concept for the ontot-km approach, we decided to conduct different evaluations for this kms in order to get the feeling of software professionals in having a kms available for customization. tkmp went through a preliminary evaluation in two steps. firstly, tkmp was evaluated by the leaders of the two projects, icammh and sia. secondly, tkmp was made available on the web, and software engineering practitioners were invited to use it and then to answer a questionnaire to give feedback in terms of usefulness, usability, and functional correctness. 4.5.1 evaluation with the project leaders once tkmp’s knowledge repository was populated with data from the two real projects (icammh and sia), demonstrations with the data obtained from the projects were made and the leaders were requested to use and analyze the portal. then, we interviewed in order to collect opinions and/or impressions of the leaders about the tkmp. the interview was conducted in an unstructured manner and anonymously. this configuration for the interview allowed information to emerge more freely. we began the interview by considering three open questions to serve as a guide: “what is the perceived usefulness of tkmp?”, “do you think it’s easy to learn to use tkmp?” and “do you notice inconsistencies when using tkmp?”. open questions allowed respondents a wide range of answers and diverse discussions about the tool. some of the leaders’ comments on the tkmp are presented below. the leaders of both projects stressed the importance of such a system to better support the software testing processes. positive responses were presented by the leaders to the tkmp in terms of usefulness, usability, and inconsistencies. with respect to icammh project, the leader observed that there was always a great loss of knowledge due to the turnover rate of the team members. in her words, “a kms such as tkmp would be indeed beneficial for finding similar test cases to be reused in the design of new ones to other similar situations in different modules and future projects”. with respect to the sia project, the leader’s evaluation was that tkmp would be very important for dealing with critical systems. however, he pointed out that a challenge would be to change team members’ culture because many development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 table 2. attributes analyzed (first 20 records) test case author execution author importance severity reproducibility issue status priority resolution status 7 7 high minor bug always closed normal fixed 7 7 high major bug always closed normal fixed 7 7 high minor bug always closed normal fixed 7 7 high crashes the application or os always closed high fixed 7 7 high major bug always closed normal fixed 7 7 high major bug always closed high fixed 7 7 high major bug always closed high fixed 7 7 high major bug always closed high fixed 7 7 high major bug always closed high fixed 7 7 high minor bug always resolved normal fixed 7 7 high major bug always closed normal fixed 7 7 high major bug always resolved high fixed 7 2 high minor bug have not tried resolved normal fixed 7 7 high mijor bug always closed normal fixed 7 2 high minor bug have not tried closed normal fixed 7 2 high minor bug have not tried closed normal fixed 7 2 high minor bug have not tried resolved normal fixed 7 7 high minor bug always closed normal not a bug 7 2 high major bug have not tried closed normal fixed 7 2 high minor bug have not tried resolved normal fixed table 3. results of the associations rule associations 1 issuestatus=resolved 236 ==> resolutionstatus=fixed 235 conf:(1) 2 importance=medium issuestatus=resolved 226 ==> resolutionstatus=fixed 225 conf:(1) 3 issuestatus=resolved priority=normal 219 ==> resolutionstatus=fixed 218 conf:(1) 4 importance=medium issuestatus=resolved priority=normal 210 ==> resolutionstatus=fixed 209 conf:(1) 5 issuestatus=resolved priority=normal 219 ==> importance=medium 210 conf:(0.96) 6 issuestatus=resolved priority=normal resolutionstatus=fixed 218 ==> importance=medium 209 conf:(0.96) 7 issuestatus=resolved 236 ==> importance=medium 226 conf:(0.96) 8 issuestatus=resolved resolutionstatus=fixed 235 ==> importance=medium 225 conf:(0.96) 9 issuestatus=resolved priority=normal 219 ==> importance=medium resolutionstatus=fixed 209 conf:(0.95) 10 issuestatus=resolved 236 ==> importance=medium resolutionstatus=fixed 225 conf:(0.95) times the team is not ready or does not accept new concepts, tools, and ideas. 4.5.2 evaluation by software engineering practitioners tkmp was also evaluated by 43 practitioners in software engineering, and it was based on gqm, tam, and functional correctness. the evaluation based on the gqm paradigm involved four steps: planning, definition, data collection, and interpretation. (i) planning and definition. at gqm’s conceptual level, measurement goals should be defined. we identified three goals for this evaluation, and from these goals, at the operational level, we defined seven questions, as table 4 shows. finally, at the quantitative level, we defined metrics associated with the questions, in order to answer them measurably. for each question, as table 5 shows, we defined five metrics, each one aiming at computing the number of participants that strongly disagree (mg.q.1), disagree (mg.q.2), neither agree nor disagree (mg.q.3), agree (mg.q.4), or strongly agree (mg.q.5) with a statement corresponding to the question. figure 6 summarizes the gqm approach we followed. table 6 presents the statements that we used to represent the questions in the questionnaire that participants answered. questions q1.1–q1.4 were used to characterize the portal usefulness, questions q2.1–q1.2 were used to collect data on the level of usability. question q3.1 was used to evaluate tkmp functional correctness. table 7 shows how to interpret the results. the lines should be read as “if <> then <>”. for example, the interpretation of question 3.1 (q3.1) is “if m1+m2 > m4+m5 then the users do not notice inconsistencies when using the tkmp”, where m1, m2, m4, m5 are the responses given by the participants (metrics). it is important to notice that m1 and m2 (see table 5) are answers that totally or partially disagree with the question. on the other hand, m4 and m5 are answers that totally or partially agree with the question. in addition to the questions created using gqm’s condevelopment of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 table 4. defined goals and questions g1: evaluate tkmp usefulness q1.1. what is the perceived usefulness of tkmp regarding creating software testing knowledge items? q1.2. what is the perceived usefulness of tkmp regarding searching for software testing knowledge items? q1.3. what is the perceived usefulness of tkmp regarding reusing software testing knowledge items? q1.4. what is the perceived global usefulness of tkmp? g2: evaluate tkmp usability q2.1. to what extent do users recognize that it is easy to learn to use tkmp? (learnability) q2.2. to what extent do users recognize that tkmp is appropriate for their needs? (appropriateness recognizability) g3: functional correctness q3.1 do users notice inconsistencies when using tkmp? table 5. metrics used in the gqm mg.q.1 number of participants who strongly disagree mg.q.2 number of participants who disagreed mg.q.3 number of participants who neither agree nor disagree mg.q.4 number of participants who agree mg.q.5 number of participants who strongly agree table 6. statements used to refer to the questions q1.1 tkmp is useful to create software testing knowledge items. q1.2 tkmp is useful to search for software testing knowledge items. q1.3 tkmp is useful to reuse software testing knowledge items. q1.4 i would use or recommend the tkmp. q2.1 i learned to use the tkmp quickly. q2.2 i recognize tkmp as being suited to my tester needs. q3.1 i did not notice inconsistencies when using the tkmp. ceptual level, at the end of the questionnaire, we present three open questions to professionals in order to allow the participant to externalize their opinion about the tkmp in terms of good points, bad points, and general comments. (ii) data collection. the data used to evaluate the tkmp were based on the metrics presented above. to collect the data, we requested experts in software organizations to use tkmp to perform activities to create, validate and search for knowledge items. after using the tool, 43 participants answered a questionnaire containing the questions previously presented. considering the participants’ profile, out of these 43, 8 hold doctoral degrees, 13 hold masters, 22 finished figure 6. gqm approach to evaluate the tkmp undergraduate programs. all of them are from the software engineering area and they have an average of six years of experience in the area. in relation to software testing knowledge, 42.9% of participants reported having basic knowledge, 37.2% reported having intermediate knowledge, and 23.3% considered having advanced knowledge on software testing. a summary of the responses given by the participants is shown in table 8. this table shows the number of responses according to the goals, questions and metrics used. (iii) interpretation. figures 7, 9 and 10 present charts that show the answers per question used in our gqm model. these answers were interpreted according to table 7: goal 1: evaluate tkmp usefulness figure 7 presents the chart generated from the answers related to tkmp usefulness. applying the interpretation expressions shown in table 7, in relation to this goal, the results show that the participants considered tkmp a useful tool for managing software test knowledge items. regarding evaluating tkmp usefulness, we also carriedoutananalysisseparatingthe43participantsbyprofessional position: professionals directly related to software development companies (23 professionals); and professionals directly related to scientific research (22 participants). this separation by position allowed us to infer how the software industry and the academic environment view the usefulness investigated topic. figure 8 presents the chart generated from the answers related to tkmp usefulness by position. in general, analysis of the metrics for this chart, both for industry professionals and for researchers, tkmp is a type of tool that they would use or indicate, especially for research-related professionals (14 strongly agree). despite the interest, industry professionals presented a lower perception of the usefulness of the tkmp than academic researchers. in the sm conducted by souza et al. (2015a) the main problems reported on the implementation of km initiatives in software testing in the organization were investigated. the main problems mentioned were that km systems are not yet appropriate; employees are normally reluctant to share their knowledge and increased workload. we believe that these problems may be related to the participants’ responses. development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 table 7. results interpretation 01 for q1.i, i=1 to 4: m1+m2 > m4+m5 tkmp is not useful for managing software testing knowledge items. 02 for q1.i, i=1 to 4: m1+m2 < m4+m5 tkmp is useful for managing software testing knowledge items. 03 for q1.i, i=1 to 4: m3 > m4+m5 or m1+m2 = m4+m5 we cannot say that tkmp is useful to manage software testing knowledge items 04 to q2.1 and q2.2: m1+m2 > m4+m5 tkmp cannot be easily used to manage software testing knowledge items. 05 to q2.1 and q2.2: m1+m2 < m4+m5 tkmp can be easily used to manage software testing knowledge items. 06 to q2.1 and q2.2: m3 > m4+m5 or m1+m2 = m4+m5 we cannot say that tkmp can be easily used to manage software testing knowledge items. 07 to q3.1: m1+m2 > m4+m5 tkmp can be considered functionally correct. 08 to q3.1: m1+m2 < m4+m5 tkmp cannot be considered functionally correct. 09 to q3.1: m3 > m4+m5 or m1+m2 = m4+m5 we cannot say if tkmp is functionally correct or not. table 8. results summary questions metrics goal m1 m2 m3 m4 m5 total q1.1 0 0 0 19 24 43 g1 q1.2 2 2 8 17 14 43 q1.3 0 1 7 16 19 43 q1.4 0 4 9 8 22 43 g2 q2.1 0 2 11 17 13 43 q2.2 1 0 11 17 14 43 g3 q3.1 3 5 14 14 7 43 on the other hand, in the academic area, there is considerable growth in research in km and software engineering. in 2008, bjørnson and dingsøyr (2008) already presented the growing interest in research on km in software engineering. this growing interest continues to these days (menolli et al., 2015; vasanthapriyan et al., 2015; pinto et al., 2018; napoleão et al., 2021). figure 7. questions and answers related to usefulness of tkmp goal 2: evaluate the usability of tkmp figure 9 presents the chart generated from the answers related to usability. the results showed the participants considered that tkmp can be easily used to manage software testing knowledge items. figure 8. questions and answers related to usefulness by position figure 9. questions and answers related to usability of tkmp goal 3: evaluate the functional correctness of tkmp figure 10 presents the chart related to functional correctness. the results show that tkmp can be development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 considered functionally correct. however, even the metrics pointing out that most of the participants did not find inconsistencies to the point that they were not able to use tkmp, by figure 10, it is possible to notice that the participants found inconsistencies in tkmp. we consider this a normal result since tkmp is still a prototype. figure 10. question and answers related to functional correctness of tkmp. as mentioned in the questionnaire planning, we present three open questions to professionals to externalize good points, bad points, and general comments about the tkmp. as can be seen in figure 7 (usefulness), figure 9 (ease of use) and figure 10 (functional correctness), some professionals chose the option “strong disagreement or disagree” in the tkmp evaluation. we analyzed open-ended responses that practitioners wrote to identify improvements to the tool and consequently the approach. when analyzing the responses of these 10 participants, most of the comments are related to functional correctness analysis (figure 10). in general, we noticed that many of the observations were more related to the small inconsistencies identified in the tool. two of the professionals, for example, mentioned that the search for knowledge items could be improved to be faster. one of the professionals mentioned that when there is a lot of data to be returned in a database, in the implementation it is possible to use more advanced strategies to optimize this process. other comments for improvements were: the system would better work with images; allow access to an instructional help in any part of the tool; keep all fields in the tool as case sensitive; and allow sharing of information via email, as well as sending an email of evaluations of knowledge items performed. it is worth noting that the tkmp is a prototype considered as a proof of concept. despite this, all suggestions for improvements will be considered in the evolution of this research. it’s also possible to notice, especially, by the charts of figures 7 and 9, that a considerable number of participants chose the option “neither agree nor disagree” for tkmp usefulness and usability. when analyzing the responses of these participants separately (15 participants), we did not find any pattern that justified this choice. we only noted that concerning the knowledge level in software testing, 10 participants mentioned having basic or intermediate knowledge. it is not possible to say, but we believe that a low time of knowledge about software testing may have some influence on the answer about tkmp utility and usability. 4.6 other partial applications of ontot-km wealsostartedtoapplyontot-kminsoftwareorganizations. three companies evaluated ontot-km and tkmp. first, we conducted the diagnostic and scope definition activities in these three companies by applying a questionnaire based on the survey presented in souza et al. (2015b). respondents to the questionnaire were software testers responsible for the software testing activities within the companies. for privacy reasons, we do not mention the company’s names. however, some characteristics are: located in brazil; medium sized software organizations; the main products they develop are systems for the fiscal area, such as an electronic fiscal receipt, metrology, and also customized systems to meet the needs of customers from diverse segments. the main diagnosis results by the three companies are: (i) “test case design” activity was the most useful; (ii) “test environment structuring” was the testing planning activity in which km is most useful; (iii) “human resource” and “software resource” are considered the resources from which it is quite important to have the knowledge available at the time of setting the test environment; (iv) the explicit knowledge was considered more important than tacit knowledge; (v) “test plan” and “test case” were considered the artifacts most reusable ones; (vi) there is no formal instrument for km within the three companies; and (vii) “increasing the testing process efficiency” and “best test case selection” are the main expected benefits of applying km in software testing. the results were very close to the results obtained in the general survey applied to the 86 participants in souza et al. (2015b). from the diagnosis results in the companies, it was possible to establish the scope for software testing initiatives. the test plan definition and test case design were considered the software testing activities to be first supported, and test cases the main knowledge item to be managed. until now, we have not conducted the remaining activities of the ontot-km approach (develop a kms testing, load existing knowledge items, and evaluate the testing kms), although companies have shown interest in developing their own kms solution. we intended that companies would also use tkmp. we proposed to the three companies the use of tkmp already developed from the research project’s scope since the diagnosis results were similar. companies were at ease in uploading the organization’s data in tkmp or registering new data if they wish. the purpose was to analyze if an already existing km tool, such as tkmp, could be customized by the organization to meet its current needs. some suggested customizations were: (i) to implement a traceability matrix among test cases and lessons learned in order to assist the test coverage; (ii) to develop a repository of artifacts (historical basis); and (iii) turning tkmp into a plug-in to integrate with project management tools (e.g., jira, redmine). the two fronts analyzed in this study were well accepted by the three organizations that participated in the survey. the participants mentioned that it is interesting to have their solution for a kms using ontot-km. however, they mentioned this would be possible if the company had a team to develop the system. on the other hand, it is also attractive to have a more general and open source kms available to be cusdevelopment of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 tomized by the company. we believe that the experience we had from evaluations performed both ontot-km, as well as tkmp, give us motivations and directions for future works, for example, we intend to consider some of the customization suggestions and enhancements to be implemented in tkmp since there is an intention to create a robust version of the portal to be made available to the software testing community. 4.7 study limitations there were limitations in the study. the first limitation refers to the low representativeness of companies participating in the study (3 companies). validating an approach, as ontotkm in a real environment, needs the authorization and trust of the organization to use its data and information, and allocate employees for system development. however, we noticed an enormous barrier to this. several other companies were invited to apply ontot-km. while they recognize the benefits from km in software testing, many refused to participate. the invited organizations mentioned that the idea of implementing km in the organization, even with an existing tool, could generate an increased workload. this position is in line with the results detected through the sm conducted in souza et al. (2015a). shortage of time is also a potential risk to incorporate km principles in software testing because knowledge sharing can imply increasing the employee workload and costs. we intend to continue inviting software companies to participate in the research and look for strategies that allow the company to feel safe in relation to use or build a kms, for example, allow the company to install the system on-premises and on their database server. a second limitation of this research concerns the sample size of the software engineering practitioners that answered the questionnaire. 43 practitioners answered the questionnaire. of these, 23 are professionals directly related to software development companies. the results cannot be generalized. therefore, we intend to replicate this survey as many software practitioners as possible in real projects in the industry. in addition, we also intend to conduct interviews with these professionals. the interview purpose is to better understand the responses that professionals have about tkmp, for example, to better understand the reason that led professionals to shore up the option “neither agree nor disagree”, as can be seen in the figures 7, 9 and 10. another limitation is related to step 1 of ontot-km. this step was not employed on both projects (icammh and sia). for this step, we used the results of a survey. the diagnosis step was not made exclusively for the projects in question. however, some survey participants were team members and leaders of the icammh and sia projects. we believe that the participation of these team members may have helped to achieve a specific diagnosis for the projects under study. 5 related work different approaches to the development of kmss can be found in the literature. dehghani and ramsin (2015) provided a review of seven methods for kms development. in general, these methods provide activities, principles, and techniques intending to apply km in the organizations (rmontano et al., 2001; calabrese and orlando, 2006; chalmeta and grangel, 2008; iglesias and garijo, 2008; sarnikar and deokar, 2010; moteleb et al., 2009; amine and ahmednacer, 2011). some of these kms methodologies are presented below. chalmeta and grangel (2008) presented a methodology called km-iris. km-iris was defined on a general level that can be used as a guide to managing knowledge in any kind of organization. the methodology is divided into five phases: (i) analysis and identification of the target knowledge; (ii) extraction of the target knowledge; (iii) classification and representation; (iv) processing and storage. in this activity an operational kms is implemented; and (v) utilization and continuous improvement by using the kms. chalmeta and grangel (2008) mention that ontologies can be used in the first phase of the methodology, that is, after identifying the knowledge, this knowledge can be detailed building on an ontological classification so that it can be represented, processed, and used in a later phase. ontologies are also suggested by chalmeta and grangel (2008) to be used in the second phase of the methodology. ontot-km also has guidelines that identify target knowledge, called knowledge items, and this item should be ranked. ontot-km is based on a test ontology and the diagnostic phase of the test environment, as well as the development of kms, are strongly related to this ontology. in (r-montano et al., 2001), a methodology to develop a kms was presented. the phases of this methodology are as follows: (i) a strategic planning; (ii) models logical and physical aspects by specifying the strengths and weaknesses of the organizational km process; (iii) development of the kms prototype; (iv) verification and validation the kms through practical usage of the system; and (v) deploy and maintain the kms. similar to the methodology of (rmontano et al., 2001), ontot-km also proposes a planning stage, called diagnosis, as well as generation of models to support the construction of a kms and its validation. however, these main activities in ontot-km are supported by a software testing ontology. calabrese and orlando (2006) presented a methodology that consists of 18 phases: (i) km principles and governance; (ii) organizational structure and sponsorship; (iii) requirements analysis; (iv) measurement; (v) knowledge audit; (vi) initiative scoping; (vii) prioritization; (viii) technology solution assessment; (ix) planning the development of the kms; (x) knowledge elicitation; (xi) building the kms; (xii) verifies and validates the kms; (xiii) review and update the kms; (xiv) knowledge maintenance processes; (xv) communication and change management; (xvi) train and publish the kms; (xvii) maintenance and support; and (xviii) measurement and reporting. in general, the methodology presentedbycalabreseandorlando(2006)is thedetailingofthe process for constructing a kms. for example, in the ontotkm process (figure 2) it is possible to notice that after the evaluation activity of kms testing, improvements can be the system returning to previous process activities. however, we do not treat this action as an explicit activity but in the form of a relationship arrow. on the other hand, in the case presented development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 by calabrese and orlando (2006), this action is considered in phases (xiii), (xiv), (xv), (xvii) and (xviii). in (sarnikar and deokar, 2010), a methodology is presented to direct the development process based on the workflows within the organization. the methodology consists of 7 different design steps: (i) business process model development of the organization; (ii) knowledge intensity identification; (iii) requirements’ identification; (iv) knowledge sources identification; (v) knowledge reuse assessment; (vi) task-user knowledge profile development; and (vii) designs the system components to support the tasks investigated in previous phases. different from the ontot-km process and also from the other methodologies presented in (dehghani and ramsin, 2015), the (sarnikar and deokar, 2010) methodology presents the design and construction of kms only in its last phase. iglesias and garijo (2008) presented a methodology that is not specifically targeted at developing a kms but can be effectively used for this purpose. iglesias and garijo (2008) proposed the methodology mascommonkads that extends object-oriented and knowledge engineering techniques for the conceptualization of multi-agent systems. the phases of the methodology are as follows: (i) obtains the initial view of the problem domain; (ii) discovers system requirements; (iii) designs the system; (iv) develops and tests; and (v) operates and maintains the system. in the designs phase, an initial set of agents are determined and a model is developed. the communication between the agents is expressed in an ontology. in (amine and ahmed-nacer, 2011), an ontology-based agile methodology is presented to develop a kms to reduce the risks of component-based development through managing the knowledge needed for component selection, update, and maintenance. the phases are as follows (the last four phases are iterative): (i) initialization. the main objective of this phase is to have the deepest understanding possible of the organization. in this phase the creation of an initial ontology of the organization domain can be conducted; (ii) domain mapping. continuously refines the problem domain ontologies created in the initialization phase; (iii) profiles and policies identification; specifies the authentication mechanisms and the level of system access allowed for each user; (iv) implementation and personalization of the kms; and (v) verification and validation of the kms. the phases of the methodology proposed by amine and ahmed-nacer (2011) are very similar to ontot-km. as with amine and ahmednacer (2011), we also use the resources of an ontology. however, unlike amine and ahmed-nacer (2011), the ontology we use is not created based on the organization but rather on an already validated domain ontology and that aims at establishing a common conceptualization about the software testing domain. finally, the methodology presented by moteleb et al. (2009) aims at using practical experiences for developing kmss in small organizations. the methodology is divided into five phases: (i) sense-making that aims at investigating whether kms development is a conceivable solution for the organizational problems; (ii) categorize the conceivable solutions through communicating with the stakeholders; (iii) the system is designed based on the solutions presented in the previous phase; (iv) specifies the appropriate technologies based on the technical, social and organizational features of the kms; and (v) monitors and maintains the kms. ontot-km also analyzes the solution for the organization in the diagnostic phase, as well as the design to construct the kms. however, as mentioned, ontot-km is supported by software testing ontology, since this domain is the goal of ontot-km. table 9 presents a brief comparison of related work, discussed above, that presented approaches to the development of a system for supporting km. to the best of our knowledge, there is no method devoted to developing a kms for supporting km in software testing. in this way, we compared the system developed using ontot-km (tkmp) with other works addressing km in software testing. these works are some of the ones selected in the mapping study on initiatives applying km in software testing presented in (souza et al., 2015a). thus, the studies retrieved in this mapping were used here as a baseline for comparison with our work. most of the studies providing automated support for managing testing knowledge employing a kms. in addition, the mapping results point out that test case reuse has been the major focus of these initiatives. these results are in line with the findings of the survey that guided us in the development of tkmp, concerning the fact that test cases are the main knowledge item to be managed. in (janjic and atkinson, 2013), an automated test recommendation approach that proactively makes test case suggestions while a tester is designing test cases was presented. they developed a prototype of an automated, non-intrusive test recommendation system called test tenderer. a search engine, called sentre, uses the current test case to perform a search for reusable, semantically matching components. analogously to (janjic and atkinson, 2013), test case design was considered the software testing activity to be supported by tkmp. however, test tenderer addresses unit testing, while tkmp is more general. although janjic and atkinson say that sentre performs a search for reusable, semantically matching components, the heuristics applied are namebased searches. in tkmp, in turn, the knowledge repository is structured based on roost, which is also used as the basis for the search functionality. finally, test tenderer works non-intrusively in the background and smoothly integrates into normal working environments. thus, the developer’s normal working practices are not disturbed, and they only need to break away from the task of writing new test cases to consider already existing tests suggested by the recommendation engine. tkmp, on the other hand, does not proactively suggest test cases. testers must make a query for retrieving similar test cases. the technologies to support km in software testing were another important question investigated by the mapping. the mapping showed that knowledge maps/yellow pages seem to have good results. a knowledge map contains information about experiences that employees possess. in (liu et al., 2009), for instance, a km model whose one of its main components is a knowledge map repository was created. the system identifies, utilizing statistics, the staff with some knowledge, improving the culture of knowledge-sharing in the enterprise. analogously, tkmp also provides a yellow page development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 table 9. characteristics for different approaches to the development of kmss approach objective number phases ontology evaluation chalmeta and grangel (2008) methodology for directing the process of developing and implementing a kms in any type of organisation 5 ontologies are suggested to be used in the steps (i) analysis and identification of the target knowledge and (ii) extraction of the target knowledge the methodology was applied to a large textile enterprise r-montano et al. (2001) recommendations to develop a kms 5 calabrese and orlando (2006) process for a comprehensive kms 12 a sensitivity/realism assessment using an actual configuration management application to demonstrate the utility of the process was conducted sarnikar and deokar (2010) a design process for kms 7 the design process was validated by demonstrating the feasibility of the proposed design process and comparing the approach with others modeling approaches iglesias and garijo (2008) methodology mascommonkads that extends object-oriented and knowledge engineering techniques for the conceptualization of multi-agent systems 5 an ontology can be used in the communication between the agents a case study was conducted in a travel agency context amine and ahmednacer (2011) implementation of kms using component-based software engineering (cbse) 5 an ontology-based agile methodology was used a case study of the application of the methodology was conducted in a software organization moteleb et al. (2009) use of practical experiences for developing kmss in small organizations 5 the approach was validated in practice by an inquiry into a number of problems experienced by particular organizations ontot-km development of an ontology-based approach for km in software testing 5 an a reference ontology on software testing was used a kms was development as proof of concept. kms was evaluated in terms of usefulness, usability, and functional correctness feature. li and zhang (2012) presents a knowledge management model and one of the elements of this model is also a knowledge map. this model is based on an ontology of reusable test cases. however, this ontology has limited coverage when compared with the roost. 6 conclusions this work presents our experiences in developing an approach to assist in launching km initiatives in software testing. ontot-km provides guidelines to apply km with the development of kmss and based on a software testing ontology. although there are approaches for developing kmss dehghani and ramsin (2015), to the best of our knowledge, there is no approach devoted to developing a kms for supporting km in software testing. in this respect, ontot-km is an original contribution. results show that the developed kms from ontot-km is a potential system for managing knowledge in software testing, so, the approach, can guide km initiatives in software testing. an approach like ontot-km can support different scenarios in software development companies. organizations that develop different products or product lines, for example, have a large turnover of knowledge when compared to organizations that build specific software for each client/project (matturroandsilva,2005).hence, thereuseoftestingknowledge becomes more frequent in the later phases of software development. thus, a km system, such as tkmp, would allow searching for solutions to similar problems registered in the tool. reuse is related not only to similar test cases, but development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 also to lessons learned, best practices, and patterns of behavior in the project that can be identified by item mined and that can be reused or at least assist in project decisions. in relation to ontot-km evaluation, in this work, we intended to evaluate the approach, as well as the generated kms. now we intend to apply the diagnosis to as many software development companies as possible to reach a common scope in order to be developed in a general kms. this kms will be part of an environment already maintained by this research project, called software engineering knowledge management diagnosis (seknow) santos et al. (2019). currently, seknow was developed only to analyze km in software development organizations (diagnostics step), however, given the evolution of research, seknow has been undergoing adaptations to meet more activities related to km and software organizations. we also intend as future work to extend tkmp considering other conceptualizations established by roost. we also intend to conduct more experimental studies to confirm the results of the evaluations discussed in this paper. as mentioned earlier, we will apply km diagnosis in software development companies that maintain different project domains with agile or traditional developments. the objective of km diagnosis is to measure the organization’s current state of km. km diagnosis can help the company to understand the real needs before devoting costly efforts to km implementation and thus better target km application initiatives at strategic points (bukowitz and williams, 2000). conducting km diagnostics in different domains of software development can shows how km activities are present in environments with agile or traditional practices. for this reason, we have been conducting the synthesis on km and agile software development (asd) (napoleão et al., 2021), and it will certainly be considered in the next stages of this project. just like ads, purpose development and operations (devops) practices are also strongly related to km. devops is a methodology that combines flexibility with rigorous testing and communication routines, aiming to deliver software efficiently and quickly (mishra and otaiwi, 2020). the adoption of devops in an organization provides many benefits including quality but also brings challenges to an organization, for example, knowledge reuse. it is in our interest to use the study conducted in this work in an organization that adopts devops and measures how much is possible to manage knowledge in software testing in this context. acknowledgements the first author would like to thank professor ricardo de almeida falbo (in memoriam) for successfully leading this work and sharing your valuable advice. the authors would like to thank: brazilian aeronautics institute of technology (ita) and the brazilian agency of research and projects financing (finep) project 5206/06icammh; and the sia project for providing the data. brazilian funding agency cnpq project 432247/2018-1. all participants that used the tkmp and answered the evaluation questionnaire are also duly acknowledged. references abran, a., bourque, p., dupuis, j., and moore, w. (2004). guide to the software engineering body of knowledge swebok. technical report, a project of the ieee computer society professional practices committee. agrawal, r. and srikant, r. (1994). fast algorithms for mining association rules in large databases. in 20th international conference on very large data bases, pages 487– 499. amine, m. and ahmed-nacer, m. (2011). an agile methodology for implementing knowledge management systems: a case study in component-based software engineering. software engineering applications, 5:159–170. andrade, j., ares, j., martinez, m., pazos, j., rodriguez, s., romera, j., and suarez., s. (2013). an architectural model for software testing lesson learned systems. an architectural model for software testing lesson learned systems, 55:18–34. basili, v. and rombach, h. d. (1991). support for comprehensive reuse. software engineering journal, 6:303–316. basili, v. r., caldiera, c., and rombach, h. (1994). guide to the software engineering body of knowledge swebok. technical report, goal question metric paradigm, new york: john wiley & sons. bjørnson, f. o. and dingsøyr, t. (2008). knowledge management in software engineering: a systematic review of studied concepts, findings and research methods used. information and software technology, 50:1055–1068. black, r. and mitchell, j. l. (2011). advanced software testing. rocky nook, usa, 3 edition. bukowitz, w. and williams, r. l. (2000). the knowledge management fieldbook. financial times prentice hall, great britain. burnstein, i. (2003). practical software testing: a processoriented approach. springer professional computing, new york, 3 edition. calabrese, f. and orlando, c. (2006). deriving a 12-step process to create and implement a comprehensive knowledge management system. journal of information and knowledge management systems, 3(36):238–254. chalmeta, r. and grangel, r. (2008). methodology for the implementation of knowledge management systems. journal of the american society for information science and technology, 5(59):742–755. davenport, t. h. and prusak, l. (2000). working knowledge. harward business school press, boston, usa, 2 edition. davis, f. d. (1993). user acceptance of information technology: system characteristics, user perceptions and behavioral impacts. international jounal of man-machine studies, 38:475–487. dehghani, r. and ramsin, r. (2015). methodologies for developing knowledge management systems: an evaluation framework. journal of knowledge management, 19(4):682–710. falbo, r. a. (2014). sabio: systematic approach for building ontologies. in 8th intern. conference on formal ontology in information systems. falbo, r. a., arantes, d. o., and natali, a. c. c. (2004). development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 integrating knowledge management and groupware in a software development environment. in international conference on practical aspects of knowledge management, pages 94–105. falbo, r. a., barcellos, m., nardi, j., and guizzardi, g. (2013). organizing ontology design patterns as ontology pattern languages. in extended semantic web conference, montpellier. falbo, r. a., ruy, f. b., guizzardi, g., barcellos, m. p., and almeida, j. p. a. (2014). towards an enterprise ontology pattern language. in symposium on applied computing, gyeongju. fayyad, u., gregory, p., and p.smyth, p. (1996). from data mining to knowledge discovery in databases. american association for artificial intelligence, pages 37–54. fischer, g. and ostwald, j. (2001). knowledge management: problems, promises, realities, and challenges. ieee intelligent systems, 16:60–72. herrera, r. j. g. and martin-b, m. j. (2015). a novel processbased kms success framework empowered by ontology learning technology. engineering applications of artificial intelligence, 45:295–312. iglesias, c. and garijo, m. (2008). the agent-oriented methodology mas-commonkads. in intelligent information technologies: concepts, methodologies, tools, and applications, information science, pages 445–468. iso/iec (2011). iso/iec 25010 systems and software engineering systems and software quality requirements and evaluation(square)syste m and software quality models. janjic, w. and atkinson, c. (2013). utilizing software reuse experience for automated test recommendation. in international workshop on automation of software test, pages 100–106, san francisco. kitchenham, b. and charters, s. (2007). guidelines for performing systematic literature reviews in software engineering. technical report ebse 2007-001, keele university and durham university, uk. li, x. and zhang, w. (2012). ontology-based testing platform for reusing. in intern. conference on internet platform for reusing, pages 86–89, henan, china. liu, y., wu, j., liu, x., and gu, g. (2009). investigation of knowledge management methods in software testing process. in inter. conference on information technology and computer science, pages 90–94, kiev. mathur, a. p. (2012). foundations of software testing. pearson education in south asia, india, 5 edition. matturro, g. and silva, a. (2005). a knowledge-based perspective for preparing the transition to a software product line approach. in international conference on software product lines, pages 96–101, berlin, heidelberg. menolli, a., cunha, m. a., reinehr, s., and malucelli, a. (2015). “old” theories, “new” technologies: understanding knowledge sharing and learning in brazilian software development companies. information and software technology, 58:289–303. mishra, a. and otaiwi, z. (2020). devops and software quality: a systematic mapping. computer science review, 38:100308. moteleb, a., woodman, m., and critten, p. (2009). towards a practical guide for developing knowledge management systems in small organizations. in european conference on knowledge management, pages 559–570. myers, g. j. (2004). the art of software testing. john wiley and sons, canada, 2 edition. napoleão, b. m., souza, e. f., ruiz, g. a., felizardo, k. r., meinerz, g. v., and vijaykumar, n. l. (2021). synthesizing researches on knowledge management and agile software development using the meta-ethnography method. journal of systems and software, 178:110973. nonaka, i. and krogh, g. (2009). tacit knowledge and knowledge conversion: controversy and advancement in organizational knowledge creation theory. organization science, 30:635–652. nonaka, i. and takeuchi, h. (1997). the knowledge-creating company. oxford university press, oxford, usa. o’leary, d. and studer, r. (2001). knowledge management: an interdisciplinary approach. ieee intelligent systems, 16(1). o’leary, d. e. (1998). enterprise knowledge management. ieee computer magazine, pages 54–61. park, r., goethert, w., and florac, w. (1997). goal-driven software measurement. handbook cmu/sei-96-hb002. pinto, d., oliveira, m., bortolozzi, f., matta, n., and tenório, n. (2018). investigating knowledge management in the software industry: the proof of concept’s findings of a questionnaire addressed to small and medium-sized companies. in 10th international joint conference on knowledge discovery, knowledge engineering and knowledge management kmis, pages 73–82. r-montano, b., liebowitz, j., buchwalter, j., mccaw, d., newman, b., and rebeck, k. (2001). a systems thinking framework for knowledge management. decision support systems, 31:5–16. rokunuzzaman, m. and choudhury, k. p. (2011). economics of software reuse and market positioning for customized software solutions. journal of software, 6:31–1029. ruy, f. b., falbo, r., barcellos, m., costa, s. d., and guizzardi, g. (2016). seon: a software engineering ontology network. in 20th inter. conference on knowledge engineering and knowledge management (ekaw), pages 527–542. santos, v., salgado, j. g., souza, e. f., felizardp, k. r., and vijaykumar, n. l. (2019). a tool for automation of knowledge management diagnostics in software development companies. in brazilian conference on software: theory and practice (cbsoft) tools session. sarnikar, s. and deokar, a. (2010). knowledge management systems for knowledge-intensive processes: design approach and an illustrative example. in international conference on system sciences, pages 1–10. souza, e. f. (2014). knowledge management applied to software testing: an ontology based framework. thesis in computer science, national institute for space research (inpe), brazil. souza, e. f., falbo, r. a., specimille, m. s., coelho, a. g. n., vijaykumar, n. l., felizardo, k. r., and meindevelopment of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 erz, g. v. (2020). experience report on developing an ontology-based approach for knowledge management in software testing. in 19th brazilian symposium on software quality experience reports (sbqs ’20), pages 1–10. souza, e. f., falbo, r. a., and vijaykumar, n. (2017). roost:reference ontology on software testing. applied ontology, 12:59–90. souza, e. f., falbo, r. a., and vijaykumar, n. l. (2013). ontology in software testing: a systematic literature review. in research seminar ontology of brazil (ontobras), pages 71–82, belo horizonte. souza, e. f., falbo, r. a., and vijaykumar, n. l. (2015a). knowledge management initiatives in software testing: a mapping study. information and software technology, 57:378–391. souza, e. f., falbo, r. a., and vijaykumar, n. l. (2015b). using lessons learned from mapping study to conduct a research project on knowledge management in software testing. in 41st euromicro conference on software engineering and advanced applications (seaa), pages 208– 215, madeira, portugual. staab, s., studer, r., schurr, h. p., and sure, y. (2001). knowledge processes and ontologies. intelligent systems, 16:26–34. storey, j. and barnett, e. (2000). knowledge management initiatives:learning from failure. journal of knowledge management, 4:145–156. thrane, c. (2011). quantitative models and analysis for reactive systems. thesis in applied computing, department of computer science aalborg university, denmark. vasanthapriyan, s., tian, j., and xiang, j. (2015). a survey on knowledge management in software engineering. in international conference on software quality, reliability and security companion (qrs-c), pages 237–244, vancouver, bc, canada. werner, j. (2014). reuse-based test recommendation in software engineering. phd thesis, universität mannheiml, mannheim. zugl. als druckausg. im verl. dr. hut, münchen erschienen. witten, i. h., frank, e., and hall, m. a. (2005). data mining: practical machine learning tools and techniques. morgan kaufmann, san francisco, 3 edition. yun, h., ha, d., hwang, b., and ryu, k. (2003). mining association rules on significant rare data using relative support. journal of systems and software, 67:181–191. zack, m. and serino, m. (2000). knowledge management and collaboration technologies. in knowledge, groupware and the internet, pages 303–315, butterworth. introduction background software testing knowledge management roost ontot-km applying ontot-km diagnose the current state of the organization's testing process establish the scope of the testing km initiative develop a testing kms load existing knowledge items evaluate the testing kms evaluation with the project leaders evaluation by software engineering practitioners other partial applications of ontot-km study limitations related work conclusions journal of software engineering research and development, 2022, 10:1, doi: 10.5753/jserd.2021.1973 this work is licensed under a creative commons attribution 4.0 international license.. tact: an instrument to assess the organizational climate of agile teams a preliminary study eliezer dutra [ unirio and cefet/rj | eliezer.goncalves@cefet-rj.br ] patrícia lima [ unirio | patricia.lima@edu.unirio.br ] cristina cerdeiral [ univeris | cerdeiral@gmail.com ] bruna diirr [ unirio | bruna.diirr@uniriotec.br ] gleison santos [ unirio | gleison.santos@uniriotec.br ] abstract background: measuring the organizational climate of agile teams is a challenge for organizations, mainly because of the shortages of specific instruments to agile methodologies. on the other hand, finding companies willing to participate in the preliminary validation of an instrument is a challenge for researchers of the organizational climate. the preliminary validation allows identifying problems and improvements in the instrument. objective: we present the preliminary evaluation of tact. tact is an instrument to assess the organizational climate of agile teams. its initial version comprises the communication, collaboration, leadership, autonomy, decision-making, and client involvement dimensions. method: we planned and executed a case study considering three development teams. we evaluated tact using open-ended questions, quantitative methods, and tam dimensions of intention to use, perceived usefulness, and output quality. results: tact allowed to classify the organizational climate of the teams for the communication, collaboration, leadership, autonomy, decision-making, and client involvement dimensions. some items were assessed negatively or neutrally, which represent points of attention. tact captured the lack of agile ceremonies, the difficulty of the product owner in planning iterations, and the distance in leadership. in addition, tact dimensions presented high levels of reliability. conclusions: tact captured the organizational climate of the teams adequately. the team leaders reported intention of future use. the items that compose tact can be used by researchers investigating the influence of human factors in agile teams and practitioners who need to designorganizationalclimateassessmentsofagile teams. byusinganinstrumentadaptedtoassesstheorganizational climate of agile teams, an organization can better identify issues and improvement actions aligned with agile values, principles, and practices. keywords: organizational climate, agile software development, human factor influence 1 introduction several factors can influence the organizational climate of agile software development teams, such as trust, openness, respect, team engagement, a culture of action and change, innovation, leadership, communication, personality, software quality, performance, support from management and the availability of resources for the project (acuña et al., 2008; soomro et al., 2016; grobelna and stefan, 2019; serrador et al., 2018; vishnubhotla et al., 2020). curtis et al. (2009) propose that organizations should periodically identify each person’s opinion on their working conditions. the authors recommend the organizational climate survey to learn and understand the factors influencing teams, their activities, and, consequently, the software’s quality (curtis et al., 2009). the instrument used in the assessment of the organizational climate must consider the most critical factors in the domain, as the organizational climate is evaluated through behavior, attitudes, feelings, policies, practices, and procedures that characterize life in the organization (lenberg et al., 2015; schneider et al., 2014). vishnubhotla et al. (2020) point out the need for further studies to investigate the influence of human factors on the organizational climate of agile teams. both academia and industry suggest that collaboration, communication, autonomy, decision-making, client involvement and leadership are critical human factors that influence agile software development projects (chagas et al., 2015; dybå and dingsøyr, 2008). to assess the organizational climate of agile teams, organizations should select the organizational climate instruments that measure the desired factors. many organizations may find it difficult to select instruments for copyright reasons. hiring a specialized consulting company can aid this process. however, dutra et al. (2012) report that many consulting companies do not disclose details of how the instrument was designed, its reliability, nor the statistical procedures adopted to its validation. several studies have investigated the impact of human factors in agile projects (chagas et al., 2015; vishnubhotla et al., 2018), including surveys with members of agile teams (grobelna and stefan, 2019). however, the literature review we conducted did not identify studies that report the design of scales, models, or questionnaires specific to assess the organizational climate of agile teams. some studies use generic scales/questionnaires that can be used in different business domains (acuña et al., 2008; vishnubhotla et al., 2020). other studies only present factors that exert some influence on the organizational climate of agile teams (serrador et al., 2018; soomro et al., 2016). in previous work, dutra et al. (2020) presented the initial version of tact: “an instrument to assess the organizational climate of agile teams”. tact was devised https://orcid.org/0000-0002-9000-8369 mailto:eliezer.goncalves@cefet-rj.br https://orcid.org/0000-0002-2637-011x mailto:patricia.lima@edu.unirio.br https://orcid.org/0000-0002-3443-2202 mailto:cerdeiral@gmail.com https://orcid.org/0000-0002-1197-9133 mailto:bruna.diirr@uniriotec.br https://orcid.org/0000-0003-0279-0440 mailto:gleison.santos@uniriotec.br tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 and validated preliminary for the communication, collaboration, and leadership dimensions. the instrument dimensions showed high reliability. in the current work, we extended the initial study by adding the client involvement, autonomy, and decision-making dimensions, creating new items to measure the organizational climate of the teams considered in the previous study, and expanding the users of tact to include a third team. moreover, we increased the literature background to show the constructs (delgado-rico et al., 2012) considered to guide the creation of tact items, and we used factor analysis to identify the most influential items for each dimension considered in the case study. this study aims to evaluate tact preliminarily for the communication, collaboration, leadership, autonomy, decision-making, and client involvement dimensions. tact was built considering the main human factors that influence agile teams. two specialists confirmed the validation of the tact items for agility. the data collection procedures used in the case study showed that tact evaluated the organizational climate correctly for the three teams. the quantitative analysis indicated the most influential items by each dimension in the case study. tact items showed high factor loading. tact showed excellent psychometric indices, for example, high correlation inter items in the spearman correlations (ρ) and high alfa cronbach value (> 0.8). practitioners can use tact items in their organizational climate assessment. researchers can explore new evidence of reliability and validity of the tact dimensions. the paper is organized as follows: section 2 discusses the organizational climate in agile teams; section 3 presents the design of tact; section 4 deals with the study planning; section 5 presents the results; in section 6, we discuss the results; section 7 addresses the study limitations and threats to validity; finally, section 8 shows our final considerations. 2 background 2.1 specific characteristics for the formation of the organizational climate of agile teams the organizational climate is the meaning that employees attribute to the policies, practices, and procedures they experience, besides the behaviors they observe being rewarded, supported, and expected (schneider et al., 2014). as such, members of agile teams expect the values, practices, adopted procedures, and, even, the behavior of those involved to reflect the values, principles, and practices of the “agile philosophy” (hohl et al., 2018; beck et al., 2001). agile methods differ from traditional development methods in several aspects (dybå and dingsøyr, 2008; pmi and agile alliance, 2017). leadership, collaboration, communication, autonomy, decision-making, and client involvement are examples of factors that demand different behaviors among those involved, as they impact the adoption and use of agile methods (dybå and dingsøyr, 2008; chagas et al., 2015; noll et al., 2017; jia et al., 2016). schneider et al. (2014) claim that leadership is a crucial point in the formation of the climate in organizations. in agile development, the leadership is based on the role of the servant leader (pmi and agile alliance, 2017). pmi and agile alliance (2017) argue that servant leadership is the practice of leading by service, focusing on the team members’ comprehension, development as well as meeting their needs to enable them to perform at their best. dybå and dingsøyr (2008) argue that, in traditional methodologies, the management style is based on command and control with highly bureaucratic and formalized organizational structures, while in agile methodologies, the management style must be collaborative and the structure of the organization is organic (dybå and dingsøyr, 2008). chagas (2015) reports that collaboration in agile methodologies takes place between team members and the customer. in agile methodologies, the project is divided into small cycles, called iterations, which are planned and specified according to the client and based on the team’s development capacity (pmi and agile alliance, 2017). this negotiation is based on the communication and collaboration the team has while executing the development tasks. a process of communication and collaboration between members of the agile team in the iteration planning and the development tasks execution positively impacts the project’s success (chagas et al., 2015). unlike traditional approaches, in agile methodologies, the team has the autonomy to create and change the responsibility for performing the tasks (karhatsu et al., 2010; chagas, 2015; pmi and agile alliance, 2017; noll et al., 2017). jia et al. (2016) argue that the decision-making behavior of each individual will influence the behaviors of other teammates and the project outcome. for example, each member makes a decision about effort estimation and gives user story points under these conditions; different individual decision-making behaviors will generate different results, which are pertinent to the success or failure of the project. (jia et al., 2016). dutra and santos (2020) investigated difficulties associated with organizational climate assessments. the authors identified pitfalls in (i) non-assessment of behaviors and factors specific to the development of an organizational climate in agile teams, and (ii) not explicitly considering agile roles and other organizational structure management functions. the authors argue that the items of assessment instruments should be detailed enough to allow respondents to think about the organizational culture and better characterize the agile behaviors depicted (dutra and santos, 2020). 2.2 organizational climate in agile teams there are several studies on organizational climate in software development teams (soomro et al., 2016). however, many of these studies do not report characteristics of the software development process considered in the evaluated teams. in addition, the studies measured the climate using generic instruments used in different business domains, without considering values, principles, or specific practices of development teams. our literature review identified three studies (acuña et al., 2008; grobelna and stefan, 2019; vishnubhotla et al., 2020) that investigated the organizational climate of agile teams by survey climate instruments. tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 acuña et al. (2008) investigated whether the climate of software developers teams has any relationship with the qualityof thesoftwareproduct.theauthorsusedthetci©(team climate inventory) instrument (anderson and west, 1998) to assess the climate. the experimental study was carried out with 105 students allocated in 35 teams. all teams used an adaptation of the extreme programming method (xp) to develop the same software. the authors found that the climatic preferences of the team’s vision and their perception of participatory security were significantly correlated to better software. according to the authors, it is important to track the organizational climate of teams as one of many indicators of the quality of the software to be delivered. grobelna and stefan (2019) investigated how the organizational climate factors (e.g., leadership style, autonomy, rewarding, and communication) in agile software development teams affected the regularity of work speed and the teams’ efficiency. the authors prepared a questionnaire to measure the organizational climate, but the items created were not disclosed. the results confirmed that the desired organizational climate was based primarily on a positive relationship with the leader and other coworkers, commitment to work, and challenges at work. the authors argue that there are elements that point out that the more the team’s organizational climate is characterized by the team’s preferences, the greater the regularity of the work speed of this team is, and thus the team is more efficient (grobelna and stefan, 2019). vishnubhotla et al. (2020) investigated the association between personality traits and the climate in agile software development teams. the study was implemented with 43 members in eight agile teams. the authors used the tci© instrument (anderson and west, 1998) to assess the climate for each dimension (vision, participatory security, support for innovation, and task orientation). the study identified a statistically significant positive correlation between personality (considering the trait openness to experience) and the climate dimension (support for innovation). they concluded that the results of the regression analysis suggest that more data may be needed, and there are other human factors in addition to personality traits that should also be investigated in relation to the climate of agile teams. in summary, the tci© instrument is grounded in a theoretical model to measure vision, participatory security, support for innovation, and task orientation dimensions (anderson and west, 1998). tci© was used in acuña et al. (2008) and vishnubhotla et al. (2020) to measure factors that influence the capability of innovation of software development teams. the tci© dimensions do not measure the dimensions proposed on tact. the questionnaire items elaborated by grobelna and stefan (2019) were not published. regarding the use of questionnaires or generic scales to assess the organizational climate in agile teams, dutra and santos (2020) claim that the use of assessment instruments that do not consider agile values, principles, practices, and roles in a proper context may create difficulties for the analysis of possible causes of problems and the execution of corrective actions within organizational climate management. therefore, there is a need for specific instruments to measure the organizational climate of agile teams in the communication, collaboration, leadership, autonomy, decision-making, and client involvement dimenfigure 1. main steps used to build tact and to execute the case study sions. 3 tact overview in this section, we present the conception of the instrument to assess the organizational climate of agile teams (tact). instruments for organizational climate assessments measure behaviors, attitudes, or preferences (anderson and west, 1998; patterson et al., 2005). as such, tact conception and evaluation are based on psychometry concepts (dima, 2018; patterson et al., 2005; graziotin et al., 2020). tact design followed specific procedures suggested for elaborating and validating climate scales and other questionnaires in general (graziotin et al., 2020; anderson and west, 1998; bandura, 2006; dybå, 2000; gonzález-romá et al., 2009; recker, 2013; shull et al., 2008). figure 1 shows the steps followed to define tact and to execute the case study used to evaluate it. the steps involving the definition of constructs, items design, evaluation by specialists, and pretesting are described in the next subsections. the activities used for data collection from the case study, such as the interview with the process coordinator, documentation analysis, a survey using tact, leaders interview, and tam evaluation, are in section 4.3. the quantitative analysis from the case study is shown in section 5.3. 3.1 conceptual definition of the construct the first step to define the construct is a literature review (spector, 1992). the researchers should carefully read the literature about the construct, paying attention to specific details of exactly what the construct has been described (spector, 1992). in the delineation of a construct, it is helpful to base the conceptual and scale development effort on work that already exists. for each tact dimension, we identified (i) conceptual definitions to show a general description of the construct measured, and (ii) operational definitions to understand how the construct can be assessed (delgado-rico et al., 2012; spector, 1992). an operational definition is a description of something in terms of the operations (procedures, actions, or processes) by which it could be observed and measured (vandenbos, 2017). the constructs are presented in appendix a.1. to start step 1, we identified systematic literature reviews and other relevant sources to provide (i) theoretical and operational definitions for the investigated constructs (i.e., comtact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 munication, collaboration, leadership, autonomy, decisionmaking, and client involvement), (ii) human factors and their influences on agile teams, and (iii) factors, models, scales, questionnaires, and items for assessing climate of software development teams. we have identified some systematic literature reviews about human factors that impact agile software development (dybå and dingsøyr, 2008; franca et al., 2011; chagas et al., 2015; vishnubhotla et al., 2018; dutra et al., 2021). soomro et al. (2016) paper was considered for having identified studies, instruments, and factors used to assess the organizational climate of development teams. pmi and agile alliance (2017) and miller (2020) were used to standardize names of roles, practices, and artifacts considered in agile development. we used the most influential human factors related to agile software development teams (chagas et al., 2015) to select the tact dimensions investigated in this study. the agile manifesto (beck et al., 2001) was also used in this step. the identified literature was used (i) to make the conceptual and operational definition of constructs (delgado-rico et al., 2012; spector, 1992) and (ii) to capture examples of behaviors, attitudes, climate instruments, and practices and their influences. for example, a) dybå and dingsøyr (2008) showed that “the planning game activity was found to have a positive effect on collaboration within the company”, b) karhatsu et al. (2010) reported that “communication and collaboration are at the heart of agile software development. as the agile manifesto states, individuals and interactions over processes and tools and customer collaboration over contract negotiation. one aspect in communication and collaboration is customer cooperation”, and c) through soomro et al. (2016), we identified some items (açıkgöz et al., 2014) that could be adapted to measure the collaboration. 3.2 design/adaptation/selection of items step 2 aims to propose items that will be used to assess each dimension, adapted to the population’s culture. thus, the constructs (appendix a.1) identified in step 1, the identified systematic reviews, and other relevant literature were considered. some items or questionnaires and examples of behaviors identified in the previous step must be adapted to agile roles, practices, or artifacts. pmi and agile alliance (2017) and miller (2020) were considered a reference to identify the main roles and essential activities in agile software development projects. after reading the selected works, we started creating tact. for each considered dimension, namely communication, collaboration, leadership, autonomy, decision-making, and client involvement, evaluation items were selected, adapted, or created. some items from scales without any copyright were selected and translated to portuguese, e.g., “it13. team members work together as a whole” used in açikgöz (2017) to assess collaboration between software development team members. in other cases, only the role of the person exercising the action was altered. for example, the original item “my direct supervisor listened to my ideas and concerns”, proposed in sharma and gupta (2012), was changed to item “it20. the team facilitator listens to my ideas and concerns”. new items were also proposed to assess the organizational climate specific to agile teams. for this purpose, critical factors and/or items were selected, and the descriptions were adapted to the roles and activities performed by agile teams. for example, to assess the communication dimension, we defined the item “the team and the product owner always reach consensus on the priority of the user stories by negotiating which bug to fix or functionality to add”. this item was based on the team climate factor described in nianfang ji and jie wang (2012) “supervisors and staff communication and agreement their tasks, including what to do, to what degree, and how to do?” and the description presented by chagas (2015) for the communication factor “frequent communication can be used to prioritize features, set focus on bug-fixing or include more functionality”. on completion of step 2, 49 items had been established, with 9 items to measure communication, 8 items for collaboration, 10 items for leadership, 7 items for autonomy, 8 items for decision-making, and 7 items for client involvement dimension. the items included in the tact initial version are shown in appendix a.2. tact also comprises a dashboard, which is shown in section 5. 3.3 evaluation by specialists at the beginning of step 3, the tact items were analyzed by two specialists in agile software development. for each item, two questions were considered “can it be inferred that the presented item represents a behavior related to agile software development teams?” and “do you suggest any adaptation to the item description?”. the first specialist has 10 years of experience in using such methods and 5 years as a consultant focused on the agile transformation of organizations and teams. the second specialist is a process coordinator at a huge company. she has 14 years of experience in software process improvement and 4 years as responsible for defining and monitoring changes in agile processes. every tact item was considered related to agile software development teams. two researchers, co-authors of this work, discussed all comments and suggestions made by the specialists. after that, some adaptations in item descriptions were made. for example, in it08, the proposed description “the team and the product owner always agree (...)” was altered to “the team and product owner always reach consensus (...)”. 3.4 pretesting google sheets was used as a tool to develop tact. it mostly contains the form for conducting the climate survey and a dashboard with the results of the frequency by item and dimension (figure 2). the items proposed in appendix a.2 are measured using a 5-point likert scale (1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree and 5 = strongly agree). in tact, the organizational climate of the team is classified as positive (values 5 and 4), neutral (value 3), or negative (values 2 and 1). to begin step 4, a pretesting was performed with 3 developers to identify possible problems of interpretation for tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 the tact items and layout. in the end, the developers reported no difficulties in answering the survey. the authors implemented a layout suggestion presented in this step. to continue the preliminary assessment of tact, a case study (yin, 2013) was performed and is described in the next section. 4 case study planning and execution runeson and höst (2009) claim that case studies in software engineering aim to investigate a contemporary phenomenon in a real context for understanding how and why software engineering activities should be carried out. they also argue that improving the software process and the resulting products with the acquired knowledge is possible. the authors also highlight the main characteristics of a case study, namely 1) their conclusions must reflect a clear chain of evidence, whether qualitative or quantitative, collected from various sources in a planned and consistent manner; and 2) they must add to the existing body of knowledge, based on established theory, if any, or build such theory. thus, the case study described below is proposed as a method of evaluating both the case addressed and the tact instrument (yin, 2013). 4.1 research questions the study aims to evaluate tact preliminarily. to achieve the aim, the research questions (rq) are defined as follows: • rq1. how is the organizational climate in the examined agile teams? – rq1.1. how did working from home affect the organizational climate of the teams for the analyzed dimensions? • rq2. how do leaders perceive tact? • rq3. which are the most influential items in each dimension for the analyzed case? during the planning and execution of the study, teams allocated in the same physical environment were working from home due to the covid-19 pandemic described in davis et al. (2020). to investigate whether this fact could have impacted the organizational climate of the studied teams, we defined rq1.1. 4.2 description of the organization and teams the organization analyzed in the study is a big brazilian bank with millions of customers. it has dozens of development teams, composed of employees and outsourced collaborators. each team defines the software development process and can choose traditional (structured and rup) or agile (scrum, kanban, xp) methods, among others defined by the organization. each team has the freedom to define the scenario and artifacts to be developed as long as it is officially stated to the process sector. regarding leadership, some teams use the role of scrum master, but in others, this role is played by the hierarchical leader of the team. when present, the role of coach facilitates the understanding and dissemination of good agile practices by the teams. during this time of working from home, the team’s monitoring by the agile leader occurs through the ceremonies that continue to be performed, the monitoring of task execution, and meetings and interactions using microsoft teams and corporate skype resources. even with the change in the work routine, it was reported that tasks continue to be delivered within the established deadlines and with the required quality. three teams, named a, b, and c, were selected by convenience to participate in the case study. the teams have employees from the organization as well as outsourced members. 4.3 data collection for the data collection, we used interviews, document analysis, and the application of tact. data collection took place between january 2020 and march 2021. the first data collection procedure was an interview with a process coordinator of the organization. the objective was to understand how the company assessed the organizational climate, which difficulties were faced with assessing the organizational climate of agile software development teams, how the development process was like, and how the composition of agile teams was like. the second procedure was to analyze the executive reports with the last two organizational climate assessments results. it is noteworthy that the assessment performed by the organization is biennial and does not consider the team in which employees are allocated. only employees and superintendence department. for this reason, it is not possible to understand the climate of individual teams. the third procedure was the assessment of the organizational climate in the teams through tact. all members of teams were invited to participate voluntarily and anonymously in the study. the organizational climate survey was applied in three cycles called pulses. table 1 shows the dimensions applied to each team by pulse and the number of participants by each team in each pulse. pulse 1 was executed in june 2020 for team a and b. pulse 2 was executed in february 2021 for team c. lastly, pulse 3 was executed in march 2021, and all teams participated. the numbers in the columns team a, team b, and team c represent the size of each team at the moment each pulse was executed. in the period between pulse 1 and pulse 2, some team members from team a and b were allocated to other teams due to the conclusion of the product module. table 1. measurement cycles pulse dimension date team a team b team c 1 communication, collaboration, leadership jun20 13 10 2 communication, collaboration, leadership feb21 4 3 autonomy, decisionmaking, client involvement mar21 9 5 4 in addition to the items present in appendix a.2, three tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 open-ended questions were introduced: “regarding the examined dimensions (communication, collaboration, leadership, autonomy, decision-making, and client involvement), what are the main challenges for your team at this time working from home?” and “do you have anything to add about your team’s organizational climate?”. in addition, at the beginning of the instrument, we included a description with the definition of the organizational climate and the objective of the assessment. next, we presented a consent form to comply with ethical principles in which we informed that participation would be anonymous, voluntary and that the participant could abandon the assessment at any time without penalties. the fourth procedure represents the execution of semistructured interviews with the leaders of the respective teams. these interviews were designed to present the results of the climate assessment and capture the leader’s perception of tact and the team’s organizational climate. to do this, they were asked some questions such as “how do you evaluate the results, by dimension, of the organizational climate assessment carried out by the team? do the results by dimension represent your perception of the team’s daily life? in your opinion, was there any result that surprised you? do you believe that the items used represent expected behaviors in agility (mindset, values, principles, and practices)? otherwise, explain why the item does not represent expected behavior”. at the end of the interview, we sent a link to the leader to evaluate tact through tam (technology acceptance model) (venkatesh and davis, 2000; venkatesh and bala, 2008). the dimensions of intention to use, perceived usefulness, and output quality were used (venkatesh and davis, 2000; venkatesh and bala, 2008). in the interviews, we used a consent form to present and assure ethical aspects. 5 case study results this section aims to present the results of the organizational climate assessment, thus answering the research questions. 5.1 how is the organizational climate in the examined agile teams? (rq1) teams were allowed to answer the survey for 8 days on each pulse. we checked the data and calculations performed by tact. in total, 22 team members participated in pulse 1, 12 out of 13 (i.e., 92.31% of members) from team a and 10 (100%) from team b. in pulse 2, 3 out of 4 (75%) members from team c answered the survey. on the last pulse, 4 out of 9 (44%) members from team a, 5 (100%) members from team b, and 4 (100%) members from team c participated in the study. table 2 shows the frequency for each investigated dimension. the “dimension” column represents the description of the dimension. for each team, the relative frequencies (count for each value assigned by the members) and absolute frequencies (percentage in parentheses) were calculated according to the aforementioned likert scale. in table 2, we chose to count the values “strongly agree” and “agree” in the column “positive”, and “strongly disagree” and “disagree” in the column “negative”. finally, we consider the frequency of “neutral” to categorize the organizational climate as neutral. figure 2 shows the tact dashboard, which is used to present the results of the climate assessment. the climate is classified as positive, neutral, or negative to facilitate the analysis of results by team members, leaders, and others involved. when analyzing the results in table 2, higher frequencies can be observed in the “positive” column for team b and c in all dimensions. considering that the 49 items represent good behavior expected by the main existing roles in an agile team, it is possible to classify the organizational climate of teams b and c as positive or favorable for all dimensions. in team a, the organizational climate can be classified as i) positive for the communication, collaboration, and leadership, and ii) negative for autonomy, decision-making, and client involvement dimensions. table 2 shows that team b and c presented a positive climate superior to that of team a in all dimensions. for example, the frequency of the communication dimension was 82 (91.1%) for team b, 20 (74%) for team c, and 62 (58%) for team a. neutral and negative results represent points of attention for an analysis of possible causes and impacts on involved roles, elements of the process, the development project, or the team’s culture in general. 5.1.1 analysis of organizational climate from team a among the assessed teams, team a showed more items evaluated as negative and neutral (see table 2). thereby, the organizational climate can be considered negative for the autonomy, decision-making, and customer involvement dimensions. however, it is observed (i) positive evaluations in the items referring to the interaction between the team members, and (ii) negative and neutral evaluations in the interactions that involve the product owner and the leader. some points of attention were clarified in open-ended questions and the interview with the leader. in response to the question about the challenges for the communication dimension at this time of working from home, a member of team a said that “virtual rooms, when poorly managed, end up providing a space for inopportune conversations”. this statement was also corroborated by the leader of team b “they think they talk too much, lose focus a little bit”, mentioning the feedback obtained from the team at the previous day’s daily meeting. these comments can be associated with item “it04. team members frequently talk about club, entertainment, gym, parties, sports, and films”. for item “it07. in the current project, the daily meeting allows to know project problems and team difficulties”, the leader of team a admitted the negative result, “the team decided not to hold the daily meeting during the period of working from home anymore, the difficulties are addressed by whatsapp and the virtual room at microsoft teams”. in addition, the leader of team a agreed with the team, noting the negative result for item “it02. the team keeps the list of impediments, risks, and control actions updated” “many times i have to register the impediments myself, tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 table 2. results of the organizational climate assessment for teams a, b and c team a team b team c dimension negative neutral positive negative neutral positive negative neutral positive communication 23 (21%) 23 (21%) 62 (58%) 6 (7%) 2 (2%) 82 (91%) 1 (4%) 6 (22%) 20 (74%) collaboration 10 (10%) 20 (20%) 66 (70%) 0 (0%) 2 (3%) 78 (97%) 0 (0%) 2 (7%) 22 (93%) leadership 27 (23%) 29 (24%) 64 (54%) 0 (0%) 13 (13%) 87 (87%) 0 (0%) 5 (17%) 25 (83%) autonomy 12 (43%) 9 (32%) 7 (25%) 2 (6%) 2 (6%) 31 (88%) 0 (0%) 2 (7%) 26 (93%) decisionmaking 9 (28%) 16 (50%) 7 (22%) 0 (0%) 4 (10%) 36 (90%) 0 (0%) 3 (9%) 29 (91%) client involvement 6 (22%) 11 (39%) 11 (39%) 0 (0%) 2 (6%) 33 (94%) (0%) 0 (0%) 28 (100%) figure 2. part of tact’s dashboard (pulse 1: team a results) they don’t do it”. in the pulse 3, the item “it39. my team has open and effective communication” had all neutral assessments (4 100%), reflecting a change in the team’s climate for communication dimension. team a showed a greater positive climate in relation to the collaboration between the members themselves, for example, in items “it10. team members consider sharing know-how with each other” and “it12. my team works efficiently together when in the face of difficulties”. however, when collaboration involves the product owner and the leader, points of attention in the item “it17. in the current project, the team, the product owner, and the team facilitator work excellently together to plan the iteration” deserve to be stressed. with the analysis of the open question “do you have anything to add about your team’s organizational climate?”, it was possible to identify potentials causes for the negative assessment for item it17. the members reported that “after the coordination change occurred, there was some distancing between the po and the team” and “the team leader does not play her role”. this assessment of the negative climate was repeated in pulse 3. during the interview, in the analysis of it17, the leader of team a stated that “the team often wants to impose on the po what they think should be implemented in the product, they feel like they own the product”. the leader also pointed out that “the employee designated as po cannot develop stories at team speed. often, the product owner cannot approve a sprint with the business customer, as customers have other priorities, which compromises the next sprint planning”. regardingtheautonomyanddecision-makingdimension for iteration planning, the leader reported “sometimes there are demands that override all planning. we lived this recently, every time an unplanned demand arrived that passed over all others demands. this hinders the planning team’s autonomy”. the leader also declared “these past few months have been hard, a little stressful. most of the demands were out for planning”. the comments were said by the leader in the analysis of items “it34. my team has the decision authority and responsibility to plan the iteration” and “it35. my team has time to plan the changes without excessive stress or pressure”. the climate can be characterized as negative to decisionmaking dimension, considering 9 (28%) items assessed as negative and 16 (50%) items as neutral. the item “it41. the dependencies between the tasks do not hinder the fluidity of the project and do not cause major restrictions” obtained 75% of negative evaluations. about decision-making, a member of team a reported: “the decision-making process is still not very participatory”. in the analysis of item it41, the leader of team a stated “the dependencies between the tasks are getting in the way. demands have a number of tasks that impact. if the po does not approve the changes, this creates a configuration and change problem. in the company, if you put a demand into production and do not validate with the po, the infrastructure team rollback the demand”. with respect to client involvement, a member from team a described “business representatives, fail to fulfill their role during homologation, impacting the delivery in production not only that specific demand but also many others, as they depend on the implementation of the first demand”. 5.1.2 analysis of organizational climate from team b in team b, more than 90% of the items were positively evaluated. however, some items were evaluated as negative and neutral, thus representing points of attention. about the communication dimension, a team member reported “communication continues to flow very well, keeping productivity high and positive”. another report showed the good climate for collaboration between team members “when there is some difficulty to identifying an error in the tests, we share the screen, we make audio calls, we include other team members, whom we know have some more spetact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 cific experience at that point, in the conversations”. team b leader did not obtain any negative evaluations, only 3 neutral in the item “it25. the team facilitator gives the team helpful feedback on how to be more agile”. several praises for the performance of the agile leader during the period of working from home were registered in response to the open questions. these reports include “... his work remained close and very positive”, “... considering different points of view”, “... moving together, even at a time of working from home”. regarding the autonomy dimension, the team leader said “the team autonomy is very good. the members are participatory. in the team, there is no expression ‘this is my task, or this is not my problem’ ”. a member of team b wrote, “team members have always been autonomous about their tasks within each user story developed”. concerning the decision-making dimension, a member of team b reported dissatisfaction “the main challenges are when the team’s decisions come up against approval from other areas”. analyzing the item “it35. my team have time to plan the changes without excessive stress or pressure”, the leader reported “in the last few months, we had several po changes in the projects. before, the po was of it , now by determination of the company the po is of the business. the new po does not ‘walk’ with the team. she does not feel part of the team. she did not want to be a po. as the po was not planning , the team had to plan it”. the previous problems reported by team b leader may have influenced the two neutral evaluations (2 6%, see table 2) recorded in the dimension client involvement. the items “it44. in the current project, there are frequent meetings with business representatives and the team” and “it47. the current project does not have frequent requirement changes due to bad user stories definition” had neutral evaluations. analyzing items it44 and it47, the team b leader reported “many times the team had to prioritize and refine the user stories without the participation of the po. after planning, she made several changes to the user stories and the acceptance criteria”. 5.1.3 analysis of organizational climate from team c regarding the communication and collaboration dimension, team leader c said “the team is new. they have only 3 months in this project. they already knew each other. we have an excellent interaction. i do not know the team personally. what gets in the way are limitations of the tool (microsoft teams) because they do not have full access. but the collaboration between them is excellent”. regarding all neutral (3 100%) assessments in the item “it05. during the retrospectives, the team finds the best way to do things”, the leader reported “we still have not managed to do the retrospective meetings formally, the team is new. the team started by resolving only incidents. we talked, but not formally at a ceremony”. regarding the 3 neutral assessments involving iteration planning items “it34. my team has decision authority and responsibility to plan the iteration” and “it35. my team has time to plan the changes without excessive stress or pressure”, the leader said “they have autonomy. in the current project, they managed to negotiate changes in user stories. they had the autonomy to adjust the planning”. about the pressure in team c, the leader commented “it should also be considered that the product under development has a fixed date (which cannot be changed) to be launched. the product impacts millions of bank customers”. analyzing the decision-making dimension, a member of team c wrote “decision-making is shared between the members outsourced, the members of the company, and the business representatives. we can all contribute with equal weight. working from home facilitated the engagement and collaboration between these 3 roles”. on autonomy dimension, another member wrote “the autonomy limits are agreed with the client”. despite 100% positive evaluations of the client involvement dimension, one member reported that the product owner was not allocated in the same virtual environment. “in some moments, communication with the management area is not so synchronous, as we do not have access to the same communication tool (microsoft teams), but the continuous meetings in this same tool make it easier to exchange information and questions”. the 100% positive assessment of the team in the client involvement dimension did not surprise the leader. the leader declared “the managers praise the team a lot. in these last weeks, the managers have stayed together for up to 4 hours doing the backlog refining. i have never seen such engagement. in this project, there are many stakeholders involved. at this time of working from home, they are available to answer questions over the phone. now, we are currently holding 1-hour refinement meetings. the report used at the demonstration meeting containing the evidence was highly praised by the po. the po said: ‘i never got a homologation script with evidence that did not have errors’ ”. 5.1.4 how did working from home affect the organizational climate of the teams for the analyzed dimensions? (rq1.1) team members reported some challenges that could have impacted the organizational climate as they adapted to the period of working from home. the challenges mentioned were difficulty with communication tools; infrastructure problems; difficulties in reaching the support team; managing inopportune conversations in virtual rooms; absence of the facilitator at the ceremonies; customer contract hinders the action of the facilitator; and other challenges already present before the period of working from home. regarding the communication dimension, members of team a reported that “working from home actually facilitated team communication” and that there has been “improved contact while working from home, we communicate more”. in team b, the statement “our team is managing to maintain a good dialogue to clarify project issues” stands out. the challenges identified for this dimension mention the network infrastructure and supporting software. in relation to the collaboration dimension, the challenges tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 captured in the open-ended questions point to team b’s collaboration difficulties with the external support team “there have been challenges, some of which required the involvement of the support team”. one member reported a preference for working in person with the team: “... but i believe that being in the same physical space, help, and assistance would sometimes flow better”. another member stated that “the challenges are the same as they were before working from home”. regarding the leadership dimension, no issues were noticed in the performance of the leaders of team b and c. on the other hand, members of team a reported the absence of the team’s leader in ceremonies and a certain distance from the team’s activities. in general, the members’ responses did not indicate changes in organizational climate due to working from home for the dimensions investigated on tact. 5.2 how do leaders perceive tact? (rq2) during the interviews, for each analyzed dimension, the following question was asked to the leaders: “do you believe that the items used represent expected behaviors in agility (mindset, values, principles, and practices)? otherwise, explain why the item does not represent expected behavior”. regarding this question, no item was assessed as not being consistent with agility. in the final stage of the interview, the following questions were asked “in your opinion, what are the benefits of using this instrument?” and “how can the organizational climate assessment tool be improved?”. in relation to the first question, the leader of team a answered “i found it interesting, you can map out what needs attention... i can notice other things, interesting... it exposes, gives you a view of what is happening. very practical, because we can focus on the point that needs attention”. the leader of team b agreed, saying “i was able to see the positive things and the neutral points in order to try to improve... the visual formatting (graphics) was very clear. i managed to understand the results effortlessly”. the leaders did not report any suggestions to improve the instrument. after the interview, tam (venkatesh and davis, 2000; venkatesh and bala, 2008) was used, through the dimensions of tam for the leaders to evaluate tact. some items taken into consideration in the assessment, for example, were “assuming i have access to the instrument, i intend to use it”, “using the instrument improves my performance in my job”, and “the quality of the output i get from the system is high”. considering a 7-point likert scale, most leaders’ responses were the options “somewhat agree” and “strongly agree” for all items of dimensions of intention to use, perceived usefulness, and output quality. 5.3 which are the most influential items in each dimension for the analyzed case? (rq3) due to the large number of items defined on tact, it is relevant to identify the most important items for this case study, i.e., the most influential items in each dimension. for this purpose, we performed factor analysis (fa). fa is commonly used in software engineering to analyze items that use the likert scale (sharma and gupta, 2012; klünder et al., 2020; graziotin et al., 2020). graziotin et al. (2020) assert that fa allows to reduce the dimensionality of the problem space (i.e., reducing factors and/or associated items) and explaining the variance in the observed variables. in the case of analyses intended to assess a single construct, factor analysis helps identify those items that (best) represent the construct we are interested in, so that we can exclude the other items (graziotin et al., 2020). the quantitative results were processed using the r tool (v. 4.0.2) using primarily the psych library (revelle, 2018). it should be stressed that these procedures have an initial exploratory purpose and are not conclusive, as the small sample size (n = 25 pulse 1 and 2; n = 13 pulse 3), nonrandomness and data distribution can have interfered with the results (dima, 2018; kyriazos, 2018). the adopted procedures were i) analysis frequency of variation of the items and correlation matrix and ii) factor analysis. in step one, the response frequencies for all items are checked to verify whether the items have enough variation to differentiate respondents. if an insufficient variation is identified (i.e., 95% of responses in a single category for an ordinal item), the item needs to be excluded from further analysis (dima, 2018). in this case study, no items needed to be excluded. to continue the analysis of step one, the item correlations (see figure 3) were plotted for an initial visual diagnosis of the items and structure of the tact dimensions (dima, 2018). a higher degree of association between items of the same dimension may already be visible in the correlation matrix (figure 2). negative associations between items may indicate the need for reverse item coding, while items with weak associations consistent with other items may prove to be non-scalable in later stages (dima, 2018). analyzing the spearman correlations (ρ) for the test case (figure 3), we can observe: i) absence of negative correlations; ii) it04 and it32 with insignificant positive correlation, thus it04 and it032 will be excluded from next analyzes; and iii) in general, high and moderate positive correlation between items in the dimensions. critical values of ρ (0.9 to 1 very high; 0.70 to 0.90 high; 0.51 to 0.70 moderate; 0.31 to 0.5 – low; and 0 to 0.3 insignificant) (hinkle et al., 2003). to start the second step, we performed the test of calculating the kaiser-meyer-olkin index (kmo). the kmo index is a statistical test that suggests the proportion of variance of the items that may be explained by a latent variable. the values kmo (see table 3) were considered appropriate for the fa in each dimension. value of kmo ( < 0.5 unacceptable; > 0.5 and < 0.7 mediocre; > 0.7 < 0.8; good; 0,8 e 0,9 excellent) (field et al., 2012). as indicated by field et al. (2012), the next analysis was conducted on the polychoric correlation matrix. we used the parallel analysis graph (horn, 1965) to investigate the plausibility of the initial model proposed on tact, i. e., the association of the items with their dimension. figure 4 shows the parallel analysis graph (x-axis displays the “factor number” tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 figure 3. correlation matrix of the dimensions figure 4. parallel analysis graphic tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 table 3. quantitative analysis results dimension item λ communication it03 0.823 it05 0.760 it09 0.703 it01 0.686 it07 0.682 it08 0.606 kmo = 0.75 it06 0.603 α = 0.9 it02 0.543 collaboration it12 0.883 it15 0.842 it10 0.800 it13 0.694 it11 0.632 it14 0.622 kmo = 0.67 it17 0.618 α = 0.9 it16 0.579 leadership it21 0.858 it19 0.813 it20 0.745 it23 0.739 it27 0.728 it24 0.721 it22 0.698 it26 0.668 kmo = 0.85 it18 0.665 α = 0.97 it25 0.585 autonomy it29 0.827 it28 0.702 it33 0.680 it30 0.610 kmo = 0.63 it31 0.553 α = 0.95 it34 0.514 decision-making it39 0.827 it36 0.672 it42 0.651 it37 0.625 it40 0.563 it38 0.547 kmo = 0.7 it41 0.491 α = 0.94 it35 0.387 client involvement it45 0.792 it46 0.746 it43 0.731 it44 0.731 it49 0.720 kmo = 0.74 it48 0.452 α = 0.94 it47 0.337 and y-axis represents the “eigenvalue”). as per the kaiser criterion, only factors with eigenvalues greater than 1 can be retained (kaiser, 1960). the data simulated by the parallel analysis confirmed the hypothesis of retaining one factor by dimension. as shown in figure 4, all dimensions can be explained by a single factor. the fa was performed separated for each dimension to verify the more significant items. table 3 shows the quantitative results. analyzing the column “dimension” (table 3) and the first line “communication”, it is possible to verify that the items are ordered by significance. the factor loading (λ) (third column of table 3) indicates the correlation of the item for the associated dimension. regarding the small sample size, field et al. (2012) argue that if a factor has four or more loadings greater than .6 then it is reliable regardless of sample size. analyzing the communication dimension (table 3), the items it03 (λ = 0.823) and it05 (λ = 0.760) have the highest factor load, and they can be considered the most significant ones. therefore, for effective communication, the team should consider empathic listening (it03) and ensure the necessary discussions on possible decision-making agreed during the retrospectives (it05). for the collaboration dimension, items it12 (λ = 0.884) and it15 (λ = 0.842) have the highest factor load, and they can be considered the most relevant. the it12 represents that the team should work efficiently together to solve problems, and the it12 the collaboration to innovation. intheleadershipdimension, itemit21ishighlighted.the item it21 (λ = 0.858) measures the activities of the team leader in discussing the problems and impediments of the team. the facilitator’s behavior in protecting the team autonomy from external interference it29 (λ = 0.827) has a high correlation to other items to the autonomy dimension. for effective decision-making, the teams should have open and effective communication it39 (λ = 0.827). lastly, for dimension client involvement, the item it45 (λ = 0.792) represents the opportunity of stakeholders to suggest changes or improvements to the software. we calculated the reliability (see table 3) of tact dimensions using the α-cronbach coefficient (landis and koch, 1977). the α-cronbach indexes for each dimension are α > 0.8, which implies the reliability of tact for this case study is high (landis and koch, 1977). 6 discussion 6.1 case study we execute a case study to assess tact preliminarily. tact has 49 items to assess the climatic dimensions of communication, collaboration, leadership, autonomy, decisionmaking, and client involvement in agile development teams. the case study was carried out with three teams working at a bank. the climate assessment took place during a period in which teams that were previously physically allocated together were instead conditioned to work from home. in addition to the items established in tact, open-ended questions were used to understand the challenges faced by members working from home. in the end, we conducted interviews with the leaders to understand the possible causes or impacts of the items evaluated. analyzing the frequency of responses attributed to the items by the members, the answers to the open questions and the data from the interviews, there are signs of a positive organizational climate in teams b and c. on the other hand, there are signs of a negative organizational climate in team a. thereby, negative and neutral frequencies were observed tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 in some items, which can represent points of attention. communication, collaboration, autonomy, and decisionmaking are critical human factors in agile software development teams because members use them to plan and execute iterations, besides periodically adjusting the process or the team’s behavior (chagas et al., 2015; pmi and agile alliance, 2017). regarding the communication, collaboration, autonomy, and decision-making dimensions, there were positive frequencies for the relationship between the members of each team (e.g., it03, it11, it12, it28, it31, and it38). however, negative and neutral frequencies point to possible difficulties in team a when collaboration, communication, autonomy, and decision-making involve the roles of the product owner (it33 and it49) and the facilitator (it09, it17, and it29) and, also, agile ceremonies (it08 and it34) and artifacts (it02 and it08). team a abandoned or mischaracterized some agile practices while working from home, for instance, the daily meeting (it07). regarding the artifacts, the team was not communicating some impediments to the leader (it02) and both the product owner and the team were not adapting the requirements for the user story format (it08 and it16), due to the contract with the software factory, which established the requirement in another format for payment estimates. another critical factor identified in team a was the inability of the product owner to establish requirements according to the team’s speed and capacity (it17). thus, although the collaboration between team members was classified as positive in team a, the relationship with the product owner and the team facilitator reflected points of attention, which can be observed in the statement from one member: “the agile methodology is being abolished in our team”. as previously stated, leadership is one of the central elements for forming the organizational climate (schneider et al., 2014). the main activities of the servant leader can be summarized in (i) remove team impediments and (ii) facilitate, disseminate, and ensure the use of agile values, practices, and rules (noll et al., 2017; pmi and agile alliance, 2017). concerning leadership, during the interview, the leader of team a clarified the negative assessment for item it19, “i follow it closely when i am called, when i am needed”. in team b, tact captured a closer relationship between the leader and the team. however, when the leader of team b analyzed the neutral points of it19, she made the following statement: “i have not been able to dedicate myself, to be the scrum master that i was [before working from home]. the agility factor has been the greatest challenge, solving impediments faster. i need to do things that i still have not managed to”. the challenges captured in several reports did not point out new insights about working from home for the dimensionsinvestigated.concerningthechallenges, theteammembers reported “the challenges are the same as those that existed before working from home”, “there are no new problems in working from home. they [the challenges] existed before”, and “the current moment of working from home has not brought any new challenges so far(...)”. it is worth noting that, according to the report by the process coordinator, the quality and performance indicators are the same as before working from home began. supporting the report of the process coordinator, serrador et al. (2018) claim that it is often argued that teams allocated in the same physical space have a better performance, a greater success in the project. however, the authors also did not identify a significant difference between local and remote teams in the study on the climate for the success of development projects (serrador et al., 2018). 6.2 preliminary evaluation of tact the literature recommends implementing a pilot study for the initial assessment of instruments that measure behaviors, attitudes, or feelings (dybå, 2000; patterson et al., 2005; shahzad et al., 2017; recker, 2013). the pilot must utilize a sample with the same characteristics as the target population (anderson and west, 1998; dybå, 2000; shahzad et al., 2017; patterson et al., 2005; recker, 2013). on tact, we decided to carry out the preliminary assessment through a casestudybecausewewantedtocapturetheperceptionofthe teams’ climate in different data sources. the results and analysis presented in the previous sections established a chain of evidence that allows us to infer that tact can capture the context of organizational climate experienced in the teams. in the evaluation by specialists (see section 3.3), every tact item related to agile software development teams was considered. this assessment is already evidence of the content validity of tact. in a qualitative analysis (see section 5.2), the leaders confirmed that the items represent agile values, principles, and practices. through the dimensions of tam (venkatesh and davis, 2000; venkatesh and bala, 2008), leaders rated tact positively for intention to use, perceived usefulness, and output quality. in quantitative analysis, the correlation matrix (figure 3) revealed a high and moderate positive correlation between most of the items of each dimension. only the items it04 and it32 showed an insignificant correlation with the other dimension items. thus, we excluded it04 and it032 from the factor analysis. development teams that talk about the subject of it04 reported a positive emotion, contributing to the group’s optimism (licorish and macdonell, 2014). however, team a members understood that talking about these issues would be a negative behavior when team a analyzed the results. this misinterpretation may have been caused by the description of the item “it04. team members frequently talk about club (...)”. regarding it04, leader team c said, “perhaps the word ’frequently’ caused the misunderstanding”. considering quantitative analysis and the reports, we excluded it04 of tact. we have not captured reports of misinterpretation on item it32. the low correlation may be relative to the sample and not to the construct. thus, we opted to keep it32 in tact. factor analysis allowed, based on the response patterns, to verify the proposed structure of tact, i.e., the associated items in each dimension. the parallel analysis graph (figure 4) indicated that a single factor could explain all dimensions. furthermore, most tact items have high (> 6) factor loadings (see table 3). therefore, there is initial empirical evidence that the structure proposed in tact is acceptable. quantitative analysis revealed high reliability of the tact dimensions (see table 3). the α-cronbach indexes for each tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 dimension are α > 0.8, which implies the reliability of tact for this case study is high (landis and koch, 1977). 6.3 tact use recommendations wagner et al. (2020) recommend that software engineering research should either adopt or develop psychometrically validated questionnaires. we extend that recommendation to the companies that realize organizational climate assessment. validating a climate instrument without selling intent is challenging because it is necessary to find companies or persons willing to invest their time answering a questionnaire without a counterpart. we highlight that all evidence of validity and reliability are conditioned to the date this research was conducted, i.e., the more investigations are executed using tact, the more evidence of validity and reliability there will be. thereby, tact items can be used by researchers who want to measure proposed constructs or investigate other possible factor structures. the organizational climate is measured through manifested behaviors or perceived feelings by the employees. climate instruments are self-reports. only the team member knows how he is feeling. although many factors can skew team member’s views, when several individuals point in the same direction, a point of investigation is revealed. for example, if an item with too many negative ratings might indicate a lack of practice, a specific problem, or a misunderstanding about the agile mindset. therefore, climate instruments only allow for a pre-diagnosis of what must be investigated and dealt with in the later stages of the organizational climate management process. organizational climate instruments measure some latent variables (those that are not directly observable). tact items represent examples of good behaviors or practices widely used in agile software development teams. therefore, team leaders, managers, or the responsible for preparing and conducting organizational climate assessments can use the tact items for a more accurate diagnosis. if a specific item has many neutral or negative evaluations, an investigation point is revealed. for example, the assessment of item “it35. my team has time to plan the changes without excessive stress or pressure” shows how the team member feels (stressed/pressured) and suggests what project activities or situations (such as interaction planning, task estimation, abusive or unrealistic deadlines given by po or manager) might be the cause of that feeling. notice that terms in the item description (for example, plan, stress, and pressure in it35) allows team members to reflect on how they are feeling about the day-to-day events. to create every tact items description, we used generic nomenclatures for roles and practices used in hybrid and agile processes. scrum is the most used agile methodology (digital.ai, 2020). however, we do not use the names of the roles or ceremonies from scrum, e.g., we use team facilitator, iteraction, and meeting review instead of scrum master, sprint, and sprint retrospective. by doing that, we expect to reach more teams using different process configurations. thereby, if tact is used by a team where the software development process has other names to roles or ceremonies or still does not have a specific role, the team members can misunderstand the items. to address this limitation and threat, at the beginning of the climate survey, we show the vocabulary of terms used in tact compared with scrum terms. regarding the number of items and time interval of the application of tact, based on a previous study (dutra and santos, 2020), we claim that using many items and having a long time interval in the organizational climate survey in agile teams can hinder the assessment, diagnosis, and establishment of actions to climate management. having too many items in climate surveys and the lack of control activities can also demotivate the team member’s participation in new climate surveys. in that regard, we recommend that practitioners adopt one or two dimensions by cycle, performing several cycles per year. however, more critical than measuring the organizational climate is involving the team in discussions of possible actions that allow a climate change. a simple open-ended question that can help team engagement in climate management is “how to improve your team’s organizational climate?”. 7 limitations and threats to validity the research procedures used in this study are adequate to build an organizational climate instrument, but we faced some limitations. the main one concerns the small sample size. as mentioned in section 5.3, the quantitative procedures have an initial exploratory purpose. due to the small sample size, the use of factor analysis (fa) is not possible without segregating the data. due to that, we conducted fa by each tact dimension. in future studies (see section 8.1), we will perform exploratory and confirmatory fa. in pulse 3, only 4 out of 9 team a members answered the survey. the number of participants can hinder team a’s organizational climate assessment because the four respondents may have the same perspective of the team organizational climate while the other members of team a that did not participate in pulse 3 would have another perspective. the team leader a interview helped us confirm the results and deal with this limitation. recker (2013) proposes some principles for evaluating qualitative and quantitative studies. concerning reliability, a contextual description of the organization was presented as well as direct quotes from team members and leaders which were considered to support the analysis. thus, it is possible to guarantee that individuals other than the researchers, when considering the same observations or data, will reach the same or similar conclusions (recker, 2013). from a quantitative point of view, an investigation was carried out to assess the study’s reliability, using descriptive statistics, correlations, and cronbach’s α coefficient. thus, the reliability of tact dimensions for the case study sample is high. to address possible threats to internal validity, we decided to use multiple sources of evidence. the team members assessed the organizational climate through the tact items and open-ended questions. in addition, the leaders’ perceptions were captured through interviews. in this way, a chain of evidence was established, and the review of the evaluation results was assured (which also relates to measurement tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 validity). regarding tact, two auditors experienced in agile methods and the leaders in the study assessed whether the item descriptions represented elements of agile values, principles, and practices. external validity concerns how much and when the results of a study can be generalized to other cases or domains (recker, 2013). to mitigate this threat, we provide detailed descriptions of the study context. however, schneider et al. (2014) claim that everything that happens in the organization changes its climate. thus, it is not possible to guarantee similar results in another cycle in the same examined teams or even in other teams of the same organization. 8 final considerations we presented the initial version of tact (instrument to assess the organizational climate of agile teams), designed to measure the dimensions of communication, collaboration, leadership, autonomy, decision making, and client involvement. we also presented a case study to evaluate tact and measure the organizational climate of three agile teams from the same organization. data collection included tact results, interviews with team leaders, and answers to openended questions by the participants. the sample data revealed a positive organizational climate for all dimensions in teams b and c and negative for autonomy, decision making, and client involvement dimensions for team a. thereby, some items assessed as negative or neutral indicated points of attention. through open-ended questions and interviews with leaders, the evaluation carried out through tact was confirmed and the points of attention were better explored. we identified the abandonment of some agile ceremonies, difficulties in planning the iteration, the inability of the product owner to keep up with the speed and capacity of the team, and even the absence of leadership. based on the statistical analysis of the data from assessing the organizational climate, there is an initial evidence that the validity and reliability of tact dimensions are high. 8.1 future works besides the tact dimensions proposed in the present study, we are investigating new constructs: motivation, trust, learning, and knowledge. other case studies are being executed to assess the climate of the same three teams mentioned in this study and other four teams of another organization. after finishing the case studies cycle, we will execute a survey to investigate and validate the factorial structure of all tact dimensions. we will use exploratory and confirmatory factor analysis to investigate and confirm the measured dimensions. as a result, tact dimensions and items will likely be reduced. after conducting the survey, we will have the means to create guidelines for using tact and interpret the results. we also intend to investigate the influence of gender, team size, and team members’ experience on agile methodologies in the organizational climate. moreover, in the future, there might be some value in digging deeper into an investigation on whether the organizational climate of employees and outsourced team members differs. acknowledgements we thank unirio (ppq-unirio 01/2019 and 04/2020; ppinstunirio 05/2020) for their financial support. references açikgöz, a. (2017). the mediating role of team collaboration between procedural justice climate and new product development performance. international journal of innovation management, 21(04):1750039. acuña, s. t., gómez, m., and juristo, n. (2008). towards understanding the relationship between team climate and software quality—a quasi-experimental study. empirical software engineering, 13(4):401–434. açıkgöz, a. and gunsel, a. (2016). individual creativity and team climate in software development projects: the mediating role of team decision processes. creativity and innovation management, 25(4):445–463. açıkgöz, a., günsel, a., bayyurt, n., and kuzey, c. (2014). team climate, team cognition, team intuition, and software quality: the moderating role of project complexity. group decision and negotiation, 23(5):1145–1176. açıkgöz, a. and ö. i̇lhan, ö. (2015). climate and problem solving in software development teams. procedia social and behavioral sciences, 207(20 october 2015):502– 511. ahmed, s., ahmed, s., naseem, a., and razzaq, a. (2017). motivators and demotivators of agile software development: elicitation and analysis. international journal of advanced computer science and applications, 8(12):1– 11. ancona, d. g. and caldwell, d. f. (1992). bridging the boundary: external activity and performance in organizational teams. administrative science quarterly, 37(4):634. anderson, n. r. and west, m. a. (1998). measuring climate for work group innovation: development and validation of the team climate inventory. journal of organizational behavior, 19(3):235–258. annosi, m. c., martini, a., brunetta, f., and marchegiani, l. (2020). learning in an agile setting: a multilevel research study on the evolution of organizational routines. journal of business research, 110:554–566. askarinejadamiri, z. (2016). personality requirements in requirement engineering of web development: a systematic literature review. in 2016 second international conference on web research (icwr), pages 183–188, tehran, iran. ieee. bandura, a. (2006). summary for policymakers. in intergovernmental panel on climate change, editor, climate change 2013 the physical science basis, pages 1–30. cambridge university press, cambridge. beck, k., beedle, m., and et al. bennekum, a van (2001). manifesto for agile software development. tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 chagas, a. (2015). (in portuguese) o impacto dos fatores humanos nos métodos ágeis. chagas, a., santos, m., santana, c., and vasconcelos, a. (2015). the impact of human factors on agile projects. in 2015 agile conference, pages 87–91, national harbor, md, usa. ieee. curtis, b., hefley, w. e., and miller, s. a. (2009). people capability maturity model (p-cmm ) version 2.0, second edition. technical report, carnegie mellon university. davis, k. g., kotowski, s. e., daniel, d., gerding, t., naylor, j., and syck, m. (2020). the home office: ergonomic lessons from the “new normal”. ergonomics in design: the quarterly of human factors applications, 28(4):4– 10. delgado-rico, e., carretero-dios, h., and ruch, w. (2012). content validity evidences in test development: an applied perspective. international journal of clinical and health psychology. digital.ai, t. (2020). 14th annual state of agile report. technical report, digital.ai. dima, a. l. (2018). scale validation in applied health research: tutorial for a 6-step r-based psychometrics protocol. health psychology and behavioral medicine, 6(1):136–161. dutra, e., diirr, b., and santos, g. (2021). human factors and their influence on software development teams a tertiary study. in brazilian symposium on software engineering, sbes ’21, page 442–451, new york, ny, usa. association for computing machinery. dutra, e., lima, p., and santos, g. (2020). an instrument to assess the organizational climate of agile teams a preliminary study. in 19th brazilian symposium on software quality, pages 1–10, são luis, brazil. acm. dutra, e. and santos, g. (2020). organisational climate assessments of agile teams – a qualitative multiple case study. iet software, 14(7):861–870. dutra, j. s., fischer, a. l., nakata, l. e., pereira, j. c. r., and veloso, e. f. r. (2012). the use categories as indicator of organizational climate in brazilian companies. revista de carreiras e pessoas, 2:145–176. dybå, t. (2000). instrument for measuring the key factors of success in software process improvement. empirical software engineering, 5:357–390. dybå, t. and dingsøyr, t. (2008). empirical studies of agile software development: a systematic review. information and software technology, 50(9-10):833–859. field, a., miles, j., and field, z. (2012). discovering statistics using r. sage publications, london, 1 edition. franca, a., gouveia, t., santos, p., santana, c., and da silva, f. (2011). motivation in software engineering: a systematic review update. in 15th annual conference on evaluation and assessment in software engineering (ease 2011), pages 154–163, durham, uk. iet. ganesh, m. p. (2013). climate in software development teams: role of task interdependence and procedural justice. asian academy of management journal. ganesh, m. p. and gupta, m. (2006). study of virtualness, task interdependence, extra-role performance and team climate in indian software development teams. proceedings of the 20th australian new zealand academy of management (anzam) conference on management, pragmatism, philosophy, priorities, central queensland university, rockhampton, 20:1–19. gonzález-romá, v., fortes-ferreira, l., and peiró, j. m. (2009). team climate, climate strength and team performance. a longitudinal study. journal of occupational and organizational psychology, 82(3):511–536. graziotin, d., lenberg, p., feldt, r., and wagner, s. (2020). psychometrics in behavioral software engineering: a methodological introduction with guidelines. acm trans. softw. eng. methodol., i(1):article 111 – 49 pages. grobelna, k. and stefan, t. (2019). the impact of organizational climate on the regularity of work speed of agile software development teams. entrepreneurhip and management, 12(1):229–241. hinkle, d., wiersma, w., and jurs, s. (2003). applied statistics for the behavioural sciences. houghton mifflin, boston, 5 edition. hoda, r., kruchten, p., noble, j., and marshall, s. (2010). agility in context. acm sigplan notices, 45(10):74– 88. hohl, p., klünder, j., van bennekum, a., lockard, r., gifford, j., münch, j., stupperich, m., and schneider, k. (2018). back to the future: origins and directions of the “agile manifesto” – views of the originators. journal of software engineering research and development, 6(1):15. horn, j. l. (1965). a rationale and test for the number of factors in factor analysis. psychometrika, 30(2):179–185. jia, j., zhang, p., and capretz, l. f. (2016). environmental factors influencing individual decision-making behavior in software projects. in proceedings of the 9th international workshop on cooperative and human aspects of software engineering, pages 86–92, new york, ny, usa. acm. kaiser, h. f. (1960). the application of electronic computers to factor analysis. educational and psychological measurement, 20(1):141–151. karhatsu, h., ikonen, m., kettunen, p., fagerholm, f., and abrahamsson, p. (2010). building blocks for selforganizing software development teams: a framework model and empirical pilot study. in icste 2010 2010 2nd international conference on software technology and engineering, proceedings. kettunen, p. (2014). directing high-performing software teams: proposal of a capability-based assessment instrument approach. in bergsmann j., editor, lecture notes in business information processing, chapter model-base, pages 229–243. springer, cham. klünder, j., karajic, d., tell, p., karras, o., münkel, c., münch, j., macdonell, s. g., hebig, r., and kuhrmann, m. (2020). determining context factors for hybrid development methods with trained models. in proceedings of the international conference on software and system processes, pages 61–70, new york, ny, usa. acm. kyriazos, t. a. (2018). applied psychometrics: sample size and sample power considerations in factor analysis (efa, cfa) and sem in general. psychology. tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 landis, j. r. and koch, g. g. (1977). the measurement of observer agreement for categorical data. biometrics, 33(1):159. lee, j.-n. (2001). the impact of knowledge sharing, organizational capability and partnership quality on is outsourcing success. information and management, 38(5):323– 335. lenberg, p., feldt, r., and wallgren, l. g. (2015). behavioral software engineering: a definition and systematic literature review. journal of systems and software, 107(september 2015):15–37. licorish, s. a. and macdonell, s. g. (2014). understanding the attitudes, knowledge sharing behaviors and task performance of core developers: a longitudinal study. information and software technology, 56(12):1578–1596. mcavoy, j. and butler, t. (2007). the impact of the abilene paradox on double-loop learning in an agile team. information and software technology, 49(6):552–563. miller, g. j. (2020). framework for project management in agile projects: a quantitative study. misra, s. c., kumar, v., and kumar, u. (2009). identifying some important success factors in adopting agile software development practices. journal of systems and software. moe, n. b., dings, t., and dyb, t. (2008). understanding self-organizing teams in agile software development. in 19th australian conference on software engineering (aswec 2008), pages 76–85. ieee. moe, n. b. and dingsøyr, t. (2008). scrum and team effectiveness: theory and practice. in lecture notes in business information processing. moe, n. b., dingsøyr, t., and øyvind, k. (2009). understanding shared leadership in agile development: a case study. in 2009 42nd hawaii international conference on system sciences, pages 1–10. ieee. nianfang ji and jie wang (2012). a software project management simulation model based on team climate factors analysis. in 2012 international conference on information management, innovation management and industrial engineering, pages 304–308, sanya, china. ieee. noll, j., razzak, m. a., bass, j. m., and beecham, s. (2017). a study of the scrum master’s role. in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), pages 307–323. springer, innsbruck, austria. patterson, m. g., west, m. a., shackleton, v. j., dawson, j. f., lawthom, r., maitlis, s., robinson, d. l., and wallace, a. m. (2005). validating the organizational climate measure: links to managerial practices, productivity and innovation. journal of organizational behavior, 26(4):379–408. pmi,p.m. i.and agilealliance,a. a.(2017). agile practice guide. pmi, pennsylvania, 1st edition. recker, j. (2013). scientific research in information systems. springer berlin heidelberg, berlin, heidelberg. revelle, w. (2018). how to: use the psych package for factor analysis and data reduction. technical report, northwestern university. runeson, p. and höst, m. (2009). guidelines for conducting and reporting case study research in software engineering. empirical software engineering, 14(2):131–164. schneider, b., barbera, k. m., schneider, b., and barbera, k. m. (2014). summary and conclusion. in barbera, b. s. and m., k., editors, the oxford handbook of organizational climate and culture, chapter summary an, pages 1–14. oxford university press, new york, ny, usa. senapathi, m. and srinivasan, a. (2013). sustained agile usage. in proceedings of the 17th international conference on evaluation and assessment in software engineering ease ’13, page 119, new york, new york, usa. acm press. serrador, p., gemino, a., and horner, b. (2018). creating a climate for project success. journal of modern project management, may/august:38–47. shahzad, f., xiu, g., and shahbaz, m. (2017). organizational culture and innovation performance in pakistan’s software industry. technology in society, 51:66–73. sharma, a. and gupta, a. (2012). impact of organisational climate and demographics on project specific risks in context to indian software industry. international journal of project management, 30(2):176–187. shull, f., singer, j., and sjøberg, d. i. (2008). guide to advanced empirical software engineering. springer london, london. soomro, a. b., salleh, n., mendes, e., grundy, j., burch, g., and nordin, a. (2016). the effect of software engineers’ personality traits on team climate and performance: a systematic literature review. information and software technology, 73(may 2016):52–65. spector, p. (1992). summated rating scale construction: an introduction. sage publications, inc. stewart, k. j. and gosain, s. (2006). the moderating role of development stage in free/open source software project performance. software process: improvement and practice, 11(2):177–191. stone, r. w. and bailey, j. j. (2007). team conflict selfefficacy and outcome expectancy of business students. journal of education for business, 82(5):258–266. vandenbos, g. r. e. (2017). apa dictionary of psychology. venkatesh, v. and bala, h. (2008). technology acceptance model 3 and a research agenda on interventions. decision sciences, 39(2):273–315. venkatesh, v. and davis, f. d. (2000). a theoretical extension of the technology acceptance model: four longitudinal field studies. management science, 46(2):186–204. vishnubhotla, s. d., mendes, e., and lundberg, l. (2018). an insight into the capabilities of professionals and teams in agile software development. in proceedings of the 2018 7th international conference on software and computer applications icsca 2018, pages 10–19, new york, new york, usa. acm press. vishnubhotla, s. d., mendes, e., and lundberg, l. (2020). investigating the relationship between personalities and agile team climate of software professionals in a telecom company. information and software technology, 126:106335. wagner, s., mendez, d., felderer, m., graziotin, d., and kalinowski, m. (2020). contemporary empirical methods in software engineering. springer international pubtact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 lishing, cham. yin, r. k. (2013). case study research: design and methods. sage publications, los angeles, 5 edition. zaineb, g., shaikh, b., and ahsan, a. (2012). recommended cultural and business practices for project based software organization of pakistan for supporting restructuring of functional organization for implementing agile based development framework in software projects. in 2012 international conference on information management, innovation management and industrial engineering, pages 16–20. ieee. a appendix a.1 constructs construct communication • conceptual definition – frequent communication between project stakeholders is core to agile software development (chagas et al., 2015; chagas, 2015). – “the perception of participatory safety could encourage team members to be open in communicating their ideas with the team, which could otherwise be risky” (ganesh, 2013). – vishnubhotla et al. (2018) reported “the ‘insider’voices of scrum practitioners about the soft skills they consider most valued to have by product owner and scrum master. communication skills and teamwork were most valued for both roles. besides them, customer orientation was expressed as important for program managers, whereas commitment, responsibility, interpersonal and planning skills were considered valuable for scrum masters”. – “gap in communication between developer and customer can guarantee the success of the project while in contrast lack of communication skill causes project problems” (askarinejadamiri, 2016). • operational definition – communication is a capability for the team member (vishnubhotla et al., 2018). – communication is an attribute for team (vishnubhotla et al., 2018). – the team has formal and informal communication (dybå and dingsøyr, 2008). – the team discusses the project and impediments (moe and dingsøyr, 2008; pmi and agile alliance, 2017). – the team discusses how to improve the process and the project (moe and dingsøyr, 2008; pmi and agile alliance, 2017). construct collaboration • conceptual definition – “team collaboration is a set of functions and activities carried out before, during, and after teamwork to achieve team objectives” (açikgöz, 2017). – “customer collaboration over contract negotiation” (beck et al., 2001). – “communication and collaboration (c&c) are at the heart of agile software development. as the agile manifesto states, ‘individuals and interactions over processes and tools’and ‘customer collaboration over contract negotiation. one aspect in c&c is customer cooperation” (karhatsu et al., 2010). • operational definition – team collaboration involves communication and coordination (karhatsu et al., 2010). – collaboration involves work as a team with i) the client (or their representative), ii) the team, and iii) others stakeholders (açıkgöz et al., 2014; chagas et al., 2015; vishnubhotla et al., 2018). construct leadership • conceptual definition – the leadership (in agile projects) is based on the role of the servant leader (pmi and agile alliance, 2017). – “team leadership plays a significant role in improving interpersonal and group processes within the team. team leaders who play the role of ‘communication integrators’ are very crucial for the success of the team. the team leader should also ensure periodically whether the members are clear with the team objectives and understand their level of agreement with those objectives” (ganesh and gupta, 2006). – “agile software engineering adopts a leadership style that empowers the people involved in the development process” (chagas, 2015). • operational definition – leadership is played by a formal role (pmi and agile alliance, 2017; noll et al., 2017). – the leader facilitates ceremonies, removes impediments, and shields the team from outside interference (pmi and agile alliance, 2017; noll et al., 2017). – the leader is a “communication integrator” (ganesh and gupta, 2006). construct autonomy • conceptual definition – “the autonomy of a team is defined as the ability to continue to operate in its own way without external interference. the role of formal authority is redesigned, so that governance and coordination appear to be the outcome of actions of networks, operating without any formal sanction” (annosi et al., 2020). tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 – “autonomy refers to the authority and responsibility that a team has in their work. it is a significant factor for team effectiveness. a team must have a real possibility to influence relevant matters; otherwise self-organization is more symbolic than real. on the other hand, a team should not be left completely alone. instead, while management should give a team substantial freedom, it should maintain subtle control and have regular checkpoints. three levels of autonomy are external, internal, and individual. the external refers to the degree that the people outside of a team influence the team’s decisions. moreover, it sets the decision-making boundaries for the team. meanwhile, internal autonomy defines how the work is organized inside the team. the team may have substantial power to make decisions while some individuals have none. great care should be taken to make sure that there really is internal autonomy instead of, for example, team leader autonomy. finally, individual autonomy, on its part, tells how much an individual has freedom to decide about his or her own work processes” (karhatsu et al., 2010). • operational definition – individual, internal, and external autonomy (karhatsu et al., 2010). – the team planning the tasks (karhatsu et al., 2010). – the leader protects the team (noll et al., 2017; pmi and agile alliance, 2017). – the team has good communication with the client (moe et al., 2008). construct decision-making • conceptual definition – “responding to change over following a plan” (beck et al., 2001). – “at regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly” (beck et al., 2001). – “software development involves interdependent individuals working together to achieve favorable outcomes, so the decision-making behavior of each individual will influence behaviors of other teammates and the project outcome. individuals have many chances to make a decision in a development process. for example, individuals may choose a resolution to deal with a conflict. in agile development, each one makes a decision about effort estimation and gives user story points. individuals may often independently make ‘work’or ‘shirk’choices in teamwork. under these conditions, different individual decision-making behaviors will generate different results, which are pertinent to the success or failure of the project” (jia et al., 2016). – “product development teams quite often experience problems, barriers and setbacks during the new product development project, which require an immediate and effective decision process to generate sufficient courses of action. decision processes refer to team members’ collective efforts to process knowledge about key task-related components, emerging issues and problems. individual creativity represents a possible contribution to the teams to deal with these difficulties. moreover, creativity-based decision processes likely allow the teams to become more proactive when dealing with emerging issues. indeed, product development teams have to think outside the box when making decisions, as well as offer practical solutions for problems that can be implemented beyond organizational constraints. such a process is characterized by the ability to understand complexity, to break through prevailing cognitive patterns, and to try new paths when old sets do not work” (açıkgöz and gunsel, 2016). • operational definition – task identity and significance (jia et al., 2016). – the member perceives recognition of management and leadership (jia et al., 2016). – the team has fast and effective communication (chagas, 2015; chagas et al., 2015). – the team plains the project without stress or pression (jia et al., 2016). – the team shares decision-making (chagas, 2015; chagas et al., 2015). – the team autonomy influences decision-making (chagas, 2015; chagas et al., 2015). construct client involvement • conceptual definition – having a client focus is one of the main aims of an agile team (karhatsu et al., 2010). – “customer collaboration over contract negotiation” (beck et al., 2001). – “agile processes promote sustainable development. the sponsors, developers, and users should be able to maintain a constant pace indefinitely” (beck et al., 2001). – “lack of client involvement is ‘the biggest problem’because agile [requires] fairly strong client involvement” (karhatsu et al., 2010). – “welcome changing requirements, even late in development. agile processes harness change for the customer’s competitive advantage” (beck et al., 2001). • operational definition – client satisfaction, collaboration, and commitment are features of client involvement. (jia et al., 2016). – a good relationship with users/clients is a motivating aspect for the team (franca et al., 2011). – the client (or their representative) provides and elucidates requirements (dybå and dingsøyr, 2008). tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 – the client (or their representative) validates the software (dybå and dingsøyr, 2008). a.2 the items of tact by dimension table 4. items used to measure the communication dimension items source it01. in this team, we can freely talk to each other about difficulties we are having stewart and gosain (2006) it02. the team keeps the list of impediments, risks and control actions updated # anderson and west (1998); miller (2020); pmi and agile alliance (2017) it03. my opinion is always listened to by my team anderson and west (1998) it04. team members frequently talk about club, entertainment, gym, parties, sports, and films # * anderson and west (1998); licorish and macdonell (2014); shahzad et al. (2017) it05. during the retrospectives, the team finds the best way to do things # chagas et al. (2015); chagas (2015); gonzálezromá et al. (2009) it06. the team knows the skills and technical expertise of team members, and they use the skills and technical expertise appropriately and adequately # nianfang ji and jie wang (2012) it07. in the current project, the daily meeting allows to know project problems and team difficulties # chagas et al. (2015); dybå and dingsøyr (2008) it08. the team and the product owner always reach consensus on the priority of the user stories by negotiating which bug to fix or functionality to add # chagas (2015); nianfang ji and jie wang (2012) it09. in the current project, the team and the product owner always solve the disagreements about the iteration scope # miller (2020); noll et al. (2017) # represents original items table 5. items used to measure the collaboration dimension items source it10. team members consider sharing know-how with each other lee (2001) it11. team members always help each other when there is a need shahzad et al. (2017) it12. my team works efficiently together when in the face of difficulties açikgöz (2017); shahzad et al. (2017) it13. team members work together as a whole anderson and west (1998) it14. all project-related decisions are applied consistently across to affected team members anderson and west (1998) it15. the team collaborates to look for new ways to analyze the problems # patterson et al. (2005); vishnubhotla et al. (2018) it16. the team has excellent ability to design the software based on user stories # açıkgöz et al. (2014); pmi and agile alliance (2017) it17.inthecurrentproject, theteam, the product owner, and team facilitator work excellently together to plan the iteration # dybå and dingsøyr (2008); noll et al. (2017) # represents original items tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 table 6. items used to measure the leadership dimension items source it18. the team facilitator gives me helpful feedback on how to be more effective sharma and gupta (2012) it19. the team facilitator eliminates barriers, encourages, and facilitates the use of agile methods # noll et al. (2017); senapathi and srinivasan (2013) it20. the team facilitator listens to my ideas and concerns sharma and gupta (2012) it21. the team facilitator discusses the problems of the team açıkgöz and ö. i̇lhan (2015) it22. the team facilitator protects the team from outside interference ancona and caldwell (1992) it23. the team facilitator helps my team to acknowledge and solve our disagreements stone and bailey (2007) it24. the team facilitator assists to understand whether the iteration objectives are clear and whether the team agrees with these objectives # ganesh and gupta (2006); pmi and agile alliance (2017) it25. the team facilitator gives the team helpful feedback on how to be more agile # pmi and agile alliance (2017); sharma and gupta (2012) it26. the team facilitator is always free to support the team when business requirements conflict with the technical reality # noll et al. (2017); pmi and agile alliance (2017) it27. the team facilitator investigates and helps the team to be more effective, taking into account the team velocity and the team capacity # chagas et al. (2015); miller (2020); noll et al. (2017) # represents original items table 7. items used to measure the autonomy dimension items source it28. in the current project, i am free to choose the tasks i want to execute in the iteraction # karhatsu et al. (2010) it29. in the current project, the team facilitator protects the team autonomy from external interferences # karhatsu et al. (2010); moe and dingsøyr (2008) it30. in this organization, we have the autonomy to suggest change the team’s software process development # patterson et al. (2005) it31. in this team, we switch assignments in tasks to avoid specialization and individualism # moe and dingsøyr (2008); chagas (2015) it32. the team has autonomy to adopt technical solutions without consulting the product owner or the management # patterson et al. (2005) it33. my team has autonomy to communicate with the product owner and other relevant stakeholders # moe and dingsøyr (2008); chagas (2015) it34. my team has decision authority and responsibility to plan the iteration # karhatsu et al. (2010); pmi and agile alliance (2017) # represents original items table 8. items used to measure the decision-making dimension items source it35. my team has time to plan the changes without excessive stress or pressure # jia et al. (2016); kettunen (2014) it36. in my team, members must not need to think equally # chagas (2015); mcavoy and butler (2007) it37. in the iteration planning, the team analyzes the technical alternatives and chooses the most appropriate one # chagas (2015); moe et al. (2009); pmi and agile alliance (2017) it38. in the retrospective, the team identifies, analyzes and selects improvement items # jia et al. (2016); pmi and agile alliance (2017) it39. my team has open and effective communication # misra et al. (2009) it40. this organization allows the team to make their own technical decisions about the best way to develop the project # patterson et al. (2005); chagas (2015) it41. the dependencies between the tasks do not hinder the fluidity of the project and do not cause major restrictions # jia et al. (2016); pmi and agile alliance (2017) it42. in the current project, my work is recognized by management # jia et al. (2016) # represents original items tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 table 9. items used to measure the client involvement dimension items source it43. during the demo review, the team shows and validates the new functionalities with the right people # ancona and caldwell (1992); pmi and agile alliance (2017) it44. in the current project, there are frequent meetings with business representatives and the team serrador et al. (2018); zaineb et al. (2012) it45. stakeholders always have the opportunity to suggest changes or improvements to the software # pmi and agile alliance (2017) it46. in the demo review, project problems and improvements are identified with stakeholders participation # serrador et al. (2018); pmi and agile alliance (2017) it47. the current project does not have frequent requirement changes due to bad user stories definition # sharma and gupta (2012); ahmed et al. (2017) it48. the current project has met or exceeded the client expectations # misra et al. (2009); ahmed et al. (2017) it49. the product owner is always available to explain the user stories’ details # hoda et al. (2010); pmi and agile alliance (2017) # represents original items introduction background specific characteristics for the formation of the organizational climate of agile teams organizational climate in agile teams tact overview conceptual definition of the construct design/adaptation/selection of items evaluation by specialists pretesting case study planning and execution research questions description of the organization and teams data collection case study results how is the organizational climate in the examined agile teams? (rq1) analysis of organizational climate from team a analysis of organizational climate from team b analysis of organizational climate from team c how did working from home affect the organizational climate of the teams for the analyzed dimensions? (rq1.1) how do leaders perceive tact? (rq2) which are the most influential items in each dimension for the analyzed case? (rq3) discussion case study preliminary evaluation of tact tact use recommendations limitations and threats to validity final considerations future works appendix constructs the items of tact by dimension 473-##_article-632-5-15-20200116 1 journal of software engineering research and development, 2019, 8:3, doi: 10.5753/jserd.2020.473 this work is licensed under a creative commons attribution 4.0 international license. towards a new template for the specification of requirements in semi-structured natural language raúl mazo [ lab-sticc, ensta bretagne, brest, francia. giditic, universidad eafit, medellín, colombia | raul.mazo@ensta-bretagne.fr ] carlos andrés jaramillo [ universidad eafit, medellín, colombia | cajaramilg@eafit.edu.co ] paola vallejo [ giditic, universidad eafit, medellín, colombia | pvallej3@eafit.edu.co ] jhon harvey medina [ universidad eafit, medellín, colombia | jhmedinaa@eafit.edu.co ] abstract requirements engineering is a systematic and disciplined approach for the specification and management of software requirements; one of its objectives is to transform the requirements of the stakeholders into formal specifications in order to analyze and implement a system. these requirements are usually expressed and articulated in natural language, this due to the universality and facility that natural language presents for communicating them. to facilitate the transformation processes and to improve the quality of the resulting requirements, several authors have proposed templates for writing requirements in structured natural language. however, these templates do not allow writing certain functional requirements, non-functional requirements and constraints, and they do not adapt correctly to certain types of systems such as self-adaptive, product line-based, and embedded systems. this paper (i) presents evidence of the weaknesses of the template recommended by the ireb® (international requirements engineering institute), and (ii) lays the foundations, through certain improvements to the template proposed by the ireb®, for facilitating the work of the requirements engineers and therefore improving the quality of the products specified with the new template. this new template was built and evaluated through two active research cycles. in each cycle we identified the problems specifying the requirements of the corresponding industrial case with the corresponding base-line template, propose some improvements to address these problems and analyze the results of using the new template to specify the requirements of each case. thus, the resulting template was able to correctly write all requirements of both industrial cases. despite the promising results of this new template, it is still preliminary work regarding its coverage and the quality level of the requirements that can be written with it. keywords: requirement, requirements engineering, natural language, template, application requirement, domain requirement, self-adaptive requirement 1 introduction the requirements are perhaps the most important basis in the construction of software products because, through them, the stakeholders of the system that is going to be implemented can achieve a common understanding of it. according to wiegers and beatty (wiegers and beatty 2013), the two most important objectives in specifying a requirement are that (i) when several people read the requirement they reach the same interpretation; and (ii) the interpretation of each reader coincides with what the author of the requirement was trying to communicate. in this sense, pohl (pohl 2010) states that nl (natural language) is the most common way to communicate and document the requirements of a system since nl is universal and available to any individual in any field; besides, it does not require any kind of special training in the interpretation of notations or symbols as occurs when using an engineering language such as uml (unified modeling language). however, these advantages are overshadowed by the disadvantages of natural language (rupp 2007). according to mavin et al. (mavin et al. 2009) some of the problems susceptible to appear in the requirements specification in nl are (i) ambiguity: a word or phrase has two or more different meanings; (ii) vagueness: lack of precision, structure or detail; (iii) complexity: composite requirements that contain complex sub-clauses or several interrelated statements; (iv) omission: missing requirements, particularly the requirements to handle unwanted behavior; (v) duplication: repetition of requirements defining the same need; (vi) verbosity: use of an unnecessary number of words; (vii) implementation: statements of how the system should be built, rather than what the system should do; and (viii) untestability: requirements that cannot be proven (true or false) when the system is implemented. to reduce these problems in the specifications of the requirements of a system, several authors have defined what is known as template, mold, pattern or boilerplate (rupp 2007). a template defines the structure that the requirements written in nl should have; that structure is flexible so that the resulting requirements have the advantage of being in nl and the advantage of having a well-defined structure. this nl bounded by the possibilities and restrictions of the template is known as semi-structured natural language. the notations in semi-structured language make it possible to build requirements by following a template and assigning a similar structure to each requirement. this approach helps to avoid errors in the early stages of the development process by specifying high-quality requirements efficiently in time and cost (sophist 2014). the template proposed by rupp (rupp 2007) also known as master (mustergultige anforderungen die sophist templates fur requirements) (sophist 2014) has been accepted as a standard for the syntactic specification of system requirements. this template has been recognized as a valuable aid tool so that the requirements are more precise and presenting the new requirements specification template mazo et al. 2019 have a standard syntactic structure that facilitates their understanding (rupp 2007). however, anyone who has used the rupp template in real projects has realized that some requirements cannot be expressed with that structure without some degree of ambiguity or inconsistency. that is the reason this article focuses on investigating the following research question: what are the gaps that requirements engineers find when writing requirements in natural language and how to fill those gaps? to find an answer to this research question, we have designed an experiment inspired by the action science (or action research) research method (o'brien 2001). two cycles of this method were conducted to analyze the requirements of two independent industrial projects. the first cycle of this action research method was reported in (mazo and jaramillo 2019) and the resulting template was used as input for the second cycle, which was oriented to requirements specifications for self-adaptive systems and represents an improved version of the mazo and jaramillo template, using the relax language (whittle et al. 2009) as a reference in this cycle. thus, with this research we aim to analyze the rupp template in order to (i) evaluate their ability to represent industrial product requirements in a semi-structured way, and (ii) propose possible improvements to the template; from the point of view of two academics and two experienced requirements engineers in the context of two technology-based companies. this paper is an extension of our previous work that appeared at cibse’19 (mazo and jaramillo 2019). in this paper, we significantly extended and improved the conference paper. first, we significantly extended the empirical study by evaluating our approach with one more real industrial project. second, we introduce the implementation of the resulting template in the variamos tool (mazo et al. 2015). finally, we enriched the related work in this version. the work resulting from this research is an adaptable and extensible template for specifying requirements of different domains (application systems, software product lines, cyber-physical systems, self-adapting systems). in the future, the template will be adapted and improved to address more domains. this article is structured as follows: section 2 explains the rupp template; section 3 describes the research method used for the experiment; section 4 presents, using some examples, the most evident problems identified when using with the rupp template; section 5 presents the proposed improved template; section 6 presents the preliminary evaluation of the new template; section 7 presents the threats of validity of our study. section 8 presents other initiatives specification templates for individual requirements and some related works; and section 9 finally describes the conclusions and future work related to this research. 2 the syntactic structure of the rupp template as shown in figure 1, the rupp template consists of six spaces (denoted with a, b, c, d, e and f letters) to compose the syntax of a requirement. this section briefly describes each space of the template. figure 1. rupp template. (a) conditions: the first space is a condition or a set of conditions, usually optional, at the beginning of the requirement. a condition can be logical: composed by the conjunction “if”; or temporary: composed by the conjunction “as soon as” or “after that”. (b) the system: the second space is the name of the system, the subsystem or component of the system that is specified for the requirement. (c) degree of obligation: the third space establishes the degree of obligation that the requirement can acquire. the template establishes four levels of obligation nature. ● the mandatory requirements, using the verb “shall” ● the recommended requirements, using the verb “should” ● the future requirements, using the modal verb “will” ● the desirable requirements, using the verb “may” (d) functional activity: the fourth space characterizes the functional activity that the system can assume, which includes the process verb object of the requirement. there are three types of activities: ● autonomous requirement of the system: indicates a functionality that the system performs independently without the need for interaction with users. ● user interaction: indicates a functionality that the system provides to users. ● interface requirement: indicates a functionality that the system performs to react to events with other systems. (e) object: the fifth space is the object for which the behavior specified in the requirement is performed. (f) object details: the sixth and last space corresponds to the additional details (optional) about the object, the adjectives that qualify it or the characteristics that the object can possess. some examples proposed by rupp (2007) for the specification of requirements with this template are the following: the system should check whether the guest is registered. after the guest has selected the function “place order”, the system shall display the menu to the guest. presenting the new requirements specification template mazo et al. 2019 the system shall provide the guest with the ability to place his order. if the chef has rejected the guest's order, the system should ask the guest whether the guest would like to choose another dish. the requirements engineering magazine1 presents some industrial cases in which the rupp template was used. 3 research method the investigation reported in this paper was carried out through the research method called action research (o'brien 2001). action research is defined as “the intervention in a social situation in order to improve this situation and learn from it” (wieringa and morali 2012) (susman and evered 1978). the action research method aims to improve the practice by solving real problems and is conducted in order to investigate current phenomena in their natural context (koshy et al. 2010). we have chosen this method because it allows us to answer the research question and achieve the objective of this research from an empirical experiment in an industrial context. besides, (i) this research method can be executed at low cost since researchers play an active role in it; and (ii) the rigor of the action research method allows to reduce the threats to the validity of the experiment. susman (susman 1983) developed a detailed model of the action research method with the five stages that must be carried out in each cycle of the process: diagnosing, action planning, taking action, evaluation and specifying learning. in the diagnosing stage, researchers identify the problem and collect the data required to carry out a detailed diagnosis. the action planning stage aims to define the different possible solutions that address the problem defined in the first step. during the taking action stage, a solution should be chosen and implemented. in the evaluating stage, researchers should analyze the data corresponding to the results of the chosen action plan. finally, during the specifying learning stage, researchers should interpret the results of the action plan execution and learn according to the success or failure of the solution. therefore, the problem is reevaluated and a new cycle begins until the problem is solved and the stakeholders are satisfied with the obtained result. to answer the research question, we carried out two cycles of the action research method as presented in figure 2. in this experiment, each cycle corresponds to the analysis of a form of specification of the requirements for two industrial projects. the experiment was carried out as follows. in the first cycle, we analyzed the requirements specification of the peopleqa system of the sqa s.a. company. peopleqa is a system for human resource management, which facilitates the self-management of employees in different corporate activities such as permissions, vacation, performance measurement, and internal relations. through the peopleqa system, we proposed the first version of the new template to specify requirements in semi-structured nl. in this cycle, three possible solutions were analyzed: prose style requirements specification (as the stakeholders expressed them), specification using the rupp template and requirements specification using an improved version of the rupp template that we call the mazo & jaramillo template. in the second cycle we analyzed the requirements specification of the yuke-greenhouse system of the koral company, yuke-greenhouse is a self-adaptive system for controlling irrigation, temperature, and environment in greenhouses and coffee crops in colombia. in the second cycle, three possible solutions were analyzed: prose style requirements specification, specification using the mazo & jaramillo template, and requirements specification using the new improved template presented in this paper. figure 2. research process 1 re magazine (https://re-magazine.ireb.org/) presenting the new requirements specification template mazo et al. 2019 in each cycle the following stages were executed: 1. diagnosing: some problems were identified when using prose style and the rupp template to write the requirements of the first case, and when using prose style and the mazo & jaramillo template to write the requirements of the second industrial case. this stage was conducted through several mini-cycles of requirements specification in order to identify the problems associated with this activity and to collect the information needed to create the new template proposed in each cycle and to achieve a systematic response to the research question. 2. action planning: templates of requirements proposed by other authors were considered. in each cycle, it was evaluated that the improved template (resulting from each cycle) was consistent with other than the rupp template, we considered other templates such as ears (mavin et al. 2009), adv-ears (majumdar et al. 2011a) (majumdar et al. 2011b) and iso/iec/ieee 29148-2011 (iso/iec/ieee 2011). to ensure that the improved template produced in the second cycle remained consistent with the considered templates, we planned and executed the following strategy: at the beginning of each cycle of requirements writing, the templates found in the literature (not all of them were found from the first cycle) were used as inspiration artifacts to incorporate their relevant elements in the new template produced at each cycle. thanks to this strategy it was possible to improve our baseline templates (i.e., the rupp template in the first cycle and the mazo & jaramillo template in the second cycle) in the situations where this template was not adequate. 3. taking action: in this stage, we first considered the requirements that could not be fully specified using the reference templates of each cycle. for these requirements, we evaluated to what extent they could be syntactically specified using the templates found during stage 2. we performed this evaluation in order to find requirements specification reproducible patterns. every time that a reproducible pattern was identified in at least three requirements with similar conditions, this pattern was added to the new template proposed in each cycle in order to enrich them. 4. evaluating: at the end of each cycle, it was evaluated whether the proposed template allowed to specify at least 98% of the industrial case requirements corresponding to the current cycle. the main criteria to evaluate the representation of requirements is that they do not present problems of ambiguity, vagueness, complexity, omission, duplication, verbosity, non-implementation and untestability. mavin et al. (mavin et al. 2009) and (rupp 2007) give us a more detailed description of these criteria, which are considered a de facto standard in requirements engineering. 5. specifying learning: at the end of each cycle, the authors interpreted of the results obtained. then, based on 2 requirements specification 1st cycle (http://shorturl.at/cpdeo) these results they determined the strengths and limitations of the improved template produced in each cycle. the various phases and the succession of cycles are collaborative since the research process and objective have been carried out in collaboration between the authors. this is another characteristic that led us to choose action research as a research method for this work. the research process consists of two cycles, one for each industrial case we had at our disposal. although two cases are not enough to propose a generic set of extensions for the rupp template, the second case provides supplementary evidence that allowed us to re-evaluate and improve the template we reported in the previous version of the article. the use of new real cases to evaluate an engineering artifact in its early stages is welcome and usual in empirical research processes such as the one reported in this article. we, therefore, hope that this new template will be evaluated in many more cycles with new and varied industrial cases that help to collectively build the re template that the industry requires. 4 problems identified in the baseline templates 4.1 first cycle the prose style requirements specification corresponding to the peopleqa system of the sqa s.a. company was rewritten with five requirements specification templates as presented in figure 2. the use of each template corresponds to a micro-cycle into the first cycle of the action research process. at the end of these micro-cycles, we produced the first version of the mazo & jaramillo template that was then evaluated and improved in the subsequent two stages of the first cycle. the problems and gaps detected when working with the templates considered in these micro-cycles are described below. these problems and gaps were saved in a document, available online2, which contains each of the requirements of the industrial case and each of the problems encountered during the investigation. in particular, the first sheet presents the requirements in prose style; the second sheet presents the requirements using the rupp template; the third sheet summarizes the problems identified when using the rupp template; and the fourth and last sheet presents the requirements specified with the constructs borrowed from other templates found in the literature. for each of these types of problems, we have defined a descriptive name, a brief description and an example to better understand the problem. missing reasons sometimes it is necessary to express the reason for a requirement. for example, in agile development frameworks, one of the most important aspects in the specification of requirements through user stories is to specify the “why” or the “for what” of the requirement (cohn 2004) (beck 1999). this gives a better context to who implements the functionality or presenting the new requirements specification template mazo et al. 2019 behavior that describes the requirement and will allow him to better understand the level of importance or priority of the requirement. for example, the requirements: the vms (vital monitoring system) must have the ability to interact with other devices of nearby people to know their vital activity. and if any sensor exceeds the defined tolerable limits, the home automation system must light a siren to warn the homeowner. have a “for what” of vital importance because both requirements belong to critical systems. if a requirements specification template allows defining the reason for the requirements, developers can easily understand it, because it is explicitly stated how important is to implement those requirements with high-quality levels. omission of quantities and ranges sometimes the requirements refer not only to a specific object but to several objects or a range of objectives of the same nature. some of the analyzed templates (e.g., the rupp template) do not explicitly allow the possibility of specifying ranges or quantities of objects in the requirements. as presented in the following example, the omission of an amount would have led to ambiguities or inaccuracies: the point of sale subsystem must provide the pos administrator with the ability to link between one and maximum 10 warehouses at a point of sale. omission of biconditionals some requirements require certain behaviors performed only if certain conditions are met; otherwise, the behavior cannot be performed. we call this biconditional to express that behavior a is performed “if and only if” behavior b is fulfilled and vice versa. for example, in the requirement: the point of sale subsystem must show the boxes if and only if they are in the active state, the “show the boxes” behavior will be performed only for objects that are in a certain state and not for all objects within the domain. here there is an explicit condition that the requirement must effect through the process verb “show”, using the conditional “if and only if”. consider another example of a requirement: after the vacuum has been turned on, the ivaccum system should start the cleaning cycle if and only if the vacuum's battery charge is 90% or more. in this case, the behavior of the object depends on a condition on the charge of the battery. as can be seen, these types of conditions are common when specifying requirements in industrial cases; however, some of the analyzed templates (e.g., the rupp template) do not explicitly provide a way to express this kind of specifications. gap in conditionals requirements behaviors are conditioned by different factors, which imply different interpretations depending on these conditions. for example, a requirement that specifies while the temperature control is on, the system must balance the ambient temperature can have a different interpretation to the requirement that specifies if the temperature control is on, the system must balance the ambient temperature and also both can be differentiated from a requirement that specifies as soon as the temperature control is turned on, the system must balance the ambient temperature. in all three cases, although a similar condition is used, the interpretation is different. in the rupp template, only two types of conditionals are used, which are: the logical conditionals and the temporary conditionals (rupp 2007). however, we found other types of conditions in the rest of templates, for example, for behaviors that are triggered by events and for behaviors that take place while the system is in a certain state. lack of verifiability of non-functional requirements some of the templates analyzed in the first cycle were created to specify functional requirements. thus, explicit structure for the adequate writing of measurable and finite factors to define the satisfaction (level) of non-functional requirements and restrictions was a recurrent weakness of the templates analyzed during the first cycle. for example, these two quality requirements: the system should be available 7x24x364 for users and the performance of the system must be optimal, trying to respond to users in less than two seconds have a measurable and finite factor to determine that the requirement will or not satisfy the need of the interested parties. lack of reference to external systems or devices in case the type of system activity is an interface requirement, the syntactic structure of some of the analyzed templates does not explicitly refer to external systems or devices. for example, the requirements the point of sale subsystem must be able to read bar codes on item labels and the system should be able to obtain the information of a client follow the syntactic structure proposed by the rupp template; however, none of these requirements mention the name of the system or device with which information is exchanged, nor it is established if the information goes to or from the device or system. lack of concepts to write domain requirements in some cases, the requirements do not refer to a product but several products of the same family (mazo 2018a). product lines are based on the concept of variability management to specify, design and intensively develop the products of the same family in a prescribed manner. although some of the analyzed templates can be used to specify requirements with different priority levels, they cannot be used to specify their variability. for example, in the requirements: the product line of virtual stores must calculate the vat value of each purchase and the product line of virtual stores could calculate the vat value of each purchase two levels of priority are specified, but the variability of the requirements is not considered. indeed, it is not said if it is for all products of the product line (mandatory for all the products) or only for some of them (optional). presenting the new requirements specification template mazo et al. 2019 4.2 second cycle the requirements specification of the yuke-greenhouse case written in prose style was rewritten with two templates. the first template used in this second cycle is the one produced in the first cycle and the second one corresponds to the relax language (whittle et al. 2009). each rewriting of the requirements of the yuke-greenhouse case with those two artifacts corresponds to a micro-cycle into the second cycle of the action research process as presented in figure 2. at the end of these two micro-cycles, we produced the evaluated new template that was then evaluated and improved in the subsequent two stages of the second cycle. the problems and gaps detected when working with the artifacts considered in these micro-cycles are described below and available online3. for each of these types of problems, we have defined a descriptive name, a brief description and an example to better understand the problem. lack of concepts to write requirements for selfadaptive systems self-adaptive systems have the ability to autonomously modify their behavior at runtime in response to environmental and changing system conditions. self-adaptation is particularly necessary for applications that must be executed continuously, even in adverse conditions and with changing requirements (whittle et al. 2009). in general, self-adaptive systems include automotive systems, telecommunication systems, environmental monitoring, and smart home systems. the main problem faced by requirements engineers is that the typical behaviors of this type of system can vary due to environmental uncertainty conditions, caused by multiple reasons such as weather, sensor failures, unexpected conditions, the variability of data, among others. inability to manage uncertainty uncertainty is one of the characteristics of self-adaptive systems, therefore this type of requirement must ensure that the system meets the needs of the stakeholders while at the same time adapting to the conditions of the environment. 3 requirements specification 2nd cycle (http://shorturl.at/cpdeo) thus, the satisfaction of these requirements should be defined with satisfaction at some level on a continuous scale defined by a fuzzy function (jureta et al. 2015). the mazo & jaramillo template does not consider the uncertainty for selfadaptive requirements. for example, a requirement that specifies: if the ambient temperature rises above 25 degrees, then the self-adaptive system oktupus must raise the temperature level to 30° establishes an invariant restriction (whittle et al. 2009) that make it difficult to adapt the system to certain environment variables. lack of specificity in temporality self-adaptive systems use timing functions and frequencies to adapt themselves to the environment. handling these aspects is also a weakness of the mazo & jaramillo template. let’s consider the following requirement: the oktupus selfadaptive system must measure the temperature of the room every hour. in this case, it would be desirable to be able to relax the requirement to better adapt the measurement period to also consider the changing conditions. this would imply that the system would be able to measure the temperature not only every hour but also every time there is a major change in the system. 5 proposing a new requirements specification template considering each of the problems encountered during the execution of the action research method and exemplified in section 4, then we have improved the rupp template (rupp 2007) and subsequently the mazo & jaramillo template (mazo and jaramillo 2019). the mazo & jaramillo template (c.f. figure 3) was created as a result of the first action research cycle and is composed of eight spaces. each space was structured thinking a simple and robust syntactic specification to cover the most types of requirements in several types of systems. the rectangles in yellow represent conditionals; gray rectangles are used to represent the family of systems, the system or a part of it; the orange rectangles represent the degree of obligation; the green rectangles are the figure 3. mazo & jaramillo template. presenting the new requirements specification template mazo et al. 2019 activities characterizing the system; the blue rectangles represent the objects (nouns), with their respective quantities and complements; and the purple rectangle describes the measurable criterion of verification of the requirement. the latter is optional, for that reason is represented through a dotted line. the improvement made to the mazo & jaramillo template is inspired by concepts from other related works found in the literature, e.g. the ears template (mavin et al. 2009), which establishes a set of syntactic rules for the specification of requirements through the use of conditional clauses that trigger functional behaviors and described in the relax requirements language (whittle et al. 2009), which incorporates various types of operators to address the uncertainty in the behavior of a self-adaptive system. thus, the new requirements specification template proposed in this paper is presented in figure 4 and it is the result of the second action research cycle that follows the first research cycle reported in the cibse conference (mazo and jaramillo, 2019). templates for user requirements specifications, such as connextra for writing user stories (davies 2001), were not considered in this article because our template is oriented to the specification of system and software requirements, while user stories are oriented to the stakeholders (wiegers and beatty 2013). templates oriented to user requirements specification are beyond the scope of this article. in the remainder of this section, we describe each of the components of the resulting template at the end of the two action research cycles. 5.1 conditions under which a behavior occurs some requirements do not describe continuous behaviors, but behaviors that are performed or provided only under certain conditions; for example, logical or temporary, as is shown below. a. requirements with logical conditions. they are used for describing behaviors that are triggered only when a logical condition is met (rupp 2007) or when an unexpected event occurs (mavin et al. 2009). the form is: if then (all|some systems of the )|(the ) shall|should|could for example: if the number of products in a warehouse reach the defined minimum limit then, the inventory subsystem should generate a product replacement alert for that warehouse. b. requirements guided by the state. they are used for describing behavior that must be performed in the system while the system is in a specific state. this condition was proposed by (mavin, et al. 2009). the form of this specification is: while|during (all|some systems of the )|(the ) shall|should|could for example: while the payment of an invoice from a customer has not been confirmed, the subsystem must send a daily text message to the cell phone number registered by the customer. c. requirements with optional elements. they are used for describing behavior that must be performed only if a particular characteristic is included (mavin, et al. 2009). the form of this is specification is: in case is included (all|some systems of the )|(the ) shall|should|could this condition is especially useful in domain requirements when you want to incorporate certain requirements depending on the characteristics provided by the product line. for example: in case the text entry action is included, all systems of the test automation framework product line shall provide the tester with the ability to enter a specific text, in a form field. d. requirements with temporary conditions. they are used for describing behavior that must occur after another behavior occurs. they occur sequentially, it means, behavior a is done after b. this condition was proposed by (rupp 2007). the form is: after|before|as soon as (all|some systems of the )|(the ) shall|should|could after means that the system must have completed a running behavior before initiating another behavior. before means that the system must initiate a behavior before another behavior takes place. as soon as means that the system does not necessarily have to have finished a running behavior before initiating another behavior. for example: after reading the products for a particular location, the inventory subsystem should provide the warehouse owner with the ability to close the product count for that location. e. requirements with complex conditions: for requirements with more complex conditional clauses, it can be necessary to add with keywords as when, while, where. the keywords can be integrated into more complex expressions to specify richer behaviors of the system (mavin et al. 2009). as expressed in the following example: when a cash settlement operation is performed on a cash register, while the box is temporarily closed, the point of sale subsystem should show the amount of cash that is in the box. conditional clauses can also be structured using the boolean operators and, or and combined with not (rupp 2014). for example: if a location contains products and the option to delete a location has been selected, then the inventory subsystem should display an alert message indicating that the selected location cannot be deleted. presenting the new requirements specification template mazo et al. 2019 figure 4. new template for the specification of requirements in semi-structured natural language. presenting the new requirements specification template mazo et al. 2019 the requirements guided by the state and the requirements with optional characteristics were taken from ears (mavin et al. 2009), and the requirements with logical and temporal conditions were taken from the rupp template (rupp 2007). 5.2 family of systems, systems or parts of a system this space in the template is reserved for the name of the product line, system, subsystem or system component. in the case of a product line requirement, it must be specified whether the requirement is valid for all or only for some systems. we completed the second space of the rupp template (cf. b space in figure 1) with the possibility of specifying product line requirements since this template was not correctly adapted to be able to write them in semi-structured nl. the structure of the second space of the new template is as follows: all|some systems of the in some cases, we must consider certain behaviors that some systems of the product line must incorporate if certain conditions or restrictions are met when this happens, we will use the expression: those systems of the some examples of product line requirements, using the improved template are: in case the action of comparing text is included, those systems of the automation framework product line that only include the option to enter text shall provide the tester with the ability to configure a text for comparison with another element. if the automation framework is web-based, all systems of the test automation product line shall provide the tester with the ability to select the type of browser where the test will be run (be it chrome, firefox or safari). 5.3 the degree of priority in the rupp template, this space (cf. c space in figure 1) is traditionally reserved to specify the degree of obligatory nature of the requirement; however, we changed the “obligatory” concept to the “priority” concept in order to not confuse it with the “mandatory” concept of product lines. to define the priority of the requirements we have used the moscow technique (clegg and barker 1994), in which three degrees of priority are established: essential, recommended and desirable. a. essential requirements. these requirements must be implemented to achieve the success of the product or the product line. the word shall is used. b. recommended requirements. these requirements are important, but not necessary to achieve the success of the product or the product line. the word should is used. c. desirable requirements: these requirements are desirable, but not necessary. they could improve the user experience and customer satisfaction. the word could is used. some examples of requirements with differentiation of the degree of priority, using the improved template are: all systems of the test automation product line shall incorporate a click action. if a motion sensor is activated, then oktupus system should send an instant image to the home owner's email. 5.4 the activity the fourth space, the same as the rupp template (d in figure 1), specifies the characterization of the activity that is conducted by the system or by the systems of the corresponding line. there are three types of activities that can be performed: a. autonomous activity. in this kind of activities there is no user involved, which means that the (sub) system or systems initiate and execute the behavior autonomously. the form of this type of activity is: all|some systems of the )|(the ) shall|should|could b. user interaction. in this activity, the (sub) system or systems provide a user with the ability to use certain behavior that is initiated or stimulated by a user (actor) that interacts with the system(s). the form of this part is: all|some systems of the )|(the ) shall|should|could provide with the ability to where who is the actor or user that should have the ability to use the functionality. the user must be correctly characterized and not incur the undue use of nouns without a reference index (rupp 2007); it means, indicating “the user” would be an error that would lead to an ambiguity in the specification. c. interface requirement. in this activity, the system performs a behavior dependent on another entity (which can be another system or a physical device). this space was improved in the new template by explicitly adding the name of the external entity with which the system interacts and the direction of the relationship. the form of this type of activity is: all|some systems of the )|(the ) shall|should|could be able to in addition, this structure is completed with the entity with which the system interacts: ● if the behavior is executed by the external system that transmits data to the receiving system interface, then the specification will be complemented by adding: from ● if the behavior is performed by the system and interacts or affects another system or external device then the specification will be complemented by adding: presenting the new requirements specification template mazo et al. 2019 towards an example in this case for an interface requirement is: the point of sale subsystem shall have the ability to read a valid credit card from a branch's dataphone 5.5 the object or objects this space is reserved for the object or objects that make up the system. in the new template, we have incorporated the concept of range, since the objects can be affected in different ranges. the ranges in the new template are specified as follows: a. single object: one