Journal of Software Engineering Research and Development, 2021, 9:9, doi: 10.5753/jserd.2021.1802  This work is licensed under a Creative Commons Attribution 4.0 International License.. How are test smells treated in the wild? A tale of two empirical studies Nildo Silva Junior  [ Federal University of Bahia | nildo.silva@ufba.br ] Luana Martins  [ Federal University of Bahia | martins.luana@ufba.br ] Larissa Rocha  [ Federal University of Bahia / State Univ. of Feira de Santana| lrsoares@uefs.br ] Heitor Costa  [ Federal University of Lavras | heitor@ufla.br ] Ivan Machado  [ Federal University of Bahia | ivan.machado@ufba.br ] Abstract Developing test code may be a time­consuming process that requires much effort and cost, especially when done manually. In addition, during this process, developers and testers are likely to adopt bad design choices, which may lead to introducing the so­called test smells in the test code. As the test code with test smells size increases, these tests might become more complex, and as a consequence, much more challenging to understand and evolve them correctly. Therefore, test smells may harm the test code quality and maintenance and break the whole software testing activities. In this context, this study aims to understand whether software testing practitioners unintentionally insert test smells when they implement test code. We first carried out an expert survey to analyze the usage frequency of a set of test smells and then interviews to reach a deeper understanding of how practitioners deal with test smells. Sixty professionals participated in the survey, and fifty professionals participated in the interviews. The yielded results indicate that experienced professionals introduce test smells during their daily programming tasks, even when using their companies’ standardized practices. Additionally, tools support test development and quality improvement, but most interviewees are not aware of test smells’ concepts. Keywords: Test Smells, Survey Study, Interview Study, Mixed­Method Research. 1 Introduction Software projects, both commercial and open­source ones, commonly include a set of automated test suites as one cru­ cial support to verify software quality (Garousi and Felderer, 2016). However, creating test code may require high ef­ fort and cost (Wiederseiner et al., 2010; Yusifoğlu et al., 2015; Garousi and Felderer, 2016). Automated test genera­ tion tools, such as Randoop1, JWalk2, and Evosuite3, emerge as alternatives to facilitate and streamline this activity. If designed with high quality, automated testing offers bene­ fits over manual testing, such as repeatability, predictabil­ ity, and efficient test runs, requiring less effort and costs (Yusifoğlu et al., 2015; Garousi and Küçük, 2018). Therefore, tests should be concise, repeatable, robust, sufficient, nec­ essary, clear, efficient, specific, independent, maintainable, and traceable (Meszaros et al., 2003). However, the development of well­designed test code is neither straightforward nor a simple task. Developers are usu­ ally under time pressure and must deal with constrained bud­ gets, which can stimulate anti­patterns in test code, leading to the occurrence of the so­called test smells. Test smells are indicators of poor implementation solutions and problems in test code design (Greiler et al., 2013). The presence of test smells in test code may lead to reduced quality and, conse­ quently, may not reach its expected capabilities at finding bugs while remaining understandable, maintainable, and so on (Yusifoğlu et al., 2015; Garousi and Küçük, 2018). The lit­ erature reports 196 test smell types classified in the following 1https://randoop.github.io/randoop/ 2http://staffwww.dcs.shef.ac.uk/people/A.Simons/ jwalk/ 3http://www.evosuite.org/ groups (Garousi and Küçük, 2018): behavior, logic, design­ related, issue in test steps, mock and stub­related, association in production code, code­related, and dependencies. The literature presents studies aimed to identify and an­ alyze the effect of test smells in software projects in sev­ eral aspects (Greiler et al., 2013; Garousi and Felderer, 2016; Van Rompaey et al., 2006). The authors introduce test smells as non­functional quality attributes within the Software Test Code Engineering process in those studies. In addition, they discussed existing test smell types and their consequences in terms of test code maintenance (Garousi and Felderer, 2016). Some authors attempted to correlate metrics, and the pres­ ence of test smells (Greiler et al., 2013). However, few dis­ cussions about daily practices and programming styles that may contribute to insert test smells exist in the literature. Un­ derstanding the relationship between development practices and the introduction of test smell may support improving the activity of test creation. This study extends our previous investigation (Silva Junior et al., 2020), which aimed to understand whether software testing practitioners 4 unintentionally insert test smells. We used an expert survey with sixty practitioners from Brazilian companies to analyze which and how often they adopt prac­ tices that might introduce test smells during test creation and execution. In this extension, we sought to understand (i) how much the practitioners know about test smells and (ii) how the practitioners deal with the test code quality regarding test smells. For identifying whether and to what extent the practi­ tioners know about test smells and how they deal with them, we interviewed fifty practitioners. The results from both stud­ 4For simplicity, we will use “practitioners” to inform “software testing practitioners” https://orcid.org/0000-0003-1763-3421 mailto:nildo.silva@ufba.br https://orcid.org/0000-0001-6340-7615 mailto:martins.luana@ufba.br https://orcid.org/0000-0002-8069-5249 mailto:lrsoares@uefs.br https://orcid.org/0000-0002-9903-7414 mailto:heitor@ufla.br https://orcid.org/0000-0001-9027-2293 mailto:ivan.machado@ufba.br https://randoop.github.io/randoop/ http://staffwww. dcs.shef.ac.uk/people/A.Simons/jwalk/ http://staffwww. dcs.shef.ac.uk/people/A.Simons/jwalk/ http://www.evosuite.org/ Silva Junior et al. ies are complementary. We found that most of the intervie­ wees did not know anything about the concept of test smells. They commonly used practices that introduced test smells, but they hardly removed them from the test code. We mapped which daily programming practices would be associated with each test smell for both test creation and ex­ ecution. Then, we asked the practitioners if they used those practices without the need to name the test smells. We used the interviews to complement the survey and analyze the practitioners’ unit test creation, maintenance, and quality ver­ ification activities. In addition, we investigated the practition­ ers’ knowledge about test smells and how they treat those smells during unit test creation and maintenance. Our study may provide insights to understand how and which practices may introduce test smells in test code. In ad­ dition, we presented the practitioners’ point of view about activities related to unit test code and their beliefs about test smells’ treatment. Thus, we investigated the following re­ search questions: RQ1: Do practitioners use test case design prac­ tices that might lead to the introduction of test smells? We investigated whether bad design choices may be related to test smells. RQ2: Which practices are present in practitioners’ daily activities that lead to introducing test smells? We investigated which test smells are as­ sociated with the most frequent practitioners’ prac­ tices. RQ3: Does the practitioners’ experience interfere with the introduction of test smells? We investi­ gated whether, over time, practitioners improve the activity of test creation. RQ4: How aware of test smells are the practitioners? We investigated the practitioners’ knowledge of test smells. RQ5: What practices have practitioners employed to treat test smells? We investigated how the practi­ tioners deal with test smells in their daily activities. The remainder of this article is structured as follows: Sec­ tion 2 introduces the concept of test smells; Section 3 details the research method applied in this study; Section 4 presents the survey’s design and results; Section 5 presents the inter­ view’s design and results; Section 6 discusses the main find­ ings of this investigation; Section 7 presents the threats to va­ lidity; Section 8 discusses related work, and Section 9 draws concluding remarks. 2 Test Smells Automated tests may generate more efficient results when compared to manually executed ones. Due to their repeata­ bility and non­human interference, automated tests might lead to time and execution effort reductions (Yusifoğlu et al., 2015; Garousi and Küçük, 2018). However, developing test code is not a trivial task, and the automated tools may not en­ sure the system quality because they can generate one poor design (Palomba et al., 2016; Virgínio et al., 2019). In real­ world practice, developers are likely to use anti­patterns dur­ ing test creation and evolution, leading to errors in imple­ menting test code (Van Deursen et al., 2001; Bavota et al., 2012). These anti­patterns may negatively impact test code maintenance (Van Rompaey et al., 2006). Several studies investigated different types of test smells. Initially, Van Deursen et al. (2001) defined a catalog of 11 test smells and refactorings (to remove test smells from the test code). After that, other authors extended this catalog and analyzed the effects of the smells on the production and test code (Van Deursen et al., 2001; Meszaros et al., 2003; Van Rompaey et al., 2006; Bavota et al., 2012; Greiler et al., 2013; Bavota et al., 2015; Garousi and Felderer, 2016; Palomba et al., 2016; Peruma, 2018; Virgínio et al., 2019; Virgínio et al., 2020). For example, Garousi and Küçük (2018) identified more than 190 test smells in a literature re­ view of 166 studies. In this study, we selected 14 types of test smells frequently studied and implemented in cutting­edge test smell detection tools (Van Deursen et al., 2001; Meszaros et al., 2003; Pe­ ruma, 2018). These are described next: • Assertion Roulette (AR). A test method that contains assertions without explanation. If one of those asser­ tions fails, it is not possible to identify which one caused the problem (Van Deursen et al., 2001); • Conditional Test Logic (CTL). A test method with conditional logic (if­else or repeat instructions). Tests with this structure do not guarantee that the same flow is verified, as they might not test a specific code piece (Meszaros et al., 2003); • Constructor Initialization (CI). A test class that presents a constructor method instead of a setUp method to initialize fields (Peruma, 2018); • Eager Test (ET). A test method checks many object methods at the same time. This test may be hard to un­ derstand and execute (Van Deursen et al., 2001); • Empty Test (EpT). A test method does not contain ex­ ecutable assertions (Peruma, 2018); • For Testers Only (FTO). A production class has meth­ ods only used by test methods (Van Deursen et al., 2001); • General Fixture (GF). The fields instantiated in the setUp method are not used by all test methods of a test class. It may be hard to read and understand and may slow down the test execution (Van Deursen et al., 2001); • Indirect Testing (IT). A test class has methods that perform tests in different objects because there are ref­ erences to those objects at the test class (Van Deursen et al., 2001); • Magic Numbers (MN). A test method contains asser­ tions with literal numbers as a test parameter (Meszaros et al., 2003); • Mystery Guest (MG). A test method uses an external resource, such as a file with test data. If the external file is removed, the tests may fail (Van Deursen et al., 2001); • Redundant Print (RP). A test method contains irrele­ vant print statements (Peruma, 2018); • Resource Optimism (RO). A test method contains op­ timist assumptions about the presence or absence of ex­ ternal resources. The test may return a positive result Silva Junior et al. Figure 1. Research method overview. once, but it may fail at other times (Van Deursen et al., 2001); • Test Code Duplication (TCD). A test method has un­ desired duplication (Van Deursen et al., 2001); • Test Run War (TRW). A test method fails when sev­ eral tests run simultaneously and access the same fix­ tures (Van Deursen et al., 2001). 3 Research Method We carried out two empirical studies in this investigation: a survey and an interview study (Miles et al., 2014). Figure 1 shows the methodological steps employed in this study. Initially, we designed our study by defining the research questions and the suitable research methods to investigate them (Fig. 1 ­ Design). We used the survey research method to identify which programming practices respondents (practi­ tioners who participate in the survey) adopt that might insert test smells in the test code (Fig. 1 ­ Survey). We next applied the interview study method to identify how the interviewees (practitioners who participate in the interview) deal with test smells during the test creation and execution (Fig. 1 ­ Inter­ view). We compared results obtained from both surveys and interviews to understand the adoption of practices that might lead to introducing test smells with the practitioners’ knowl­ edge about test smells from different perspectives (Fig. 1 ­ Data Comparison). For the survey, we adopted the design of observation by case­control. Case­control is a descriptive design used to in­ vestigate previous situations to support understanding a cur­ rent phenomenon (Pfleeger and Kitchenham, 2001). It en­ compasses activities for the design, application, and analy­ sis of a survey questionnaire. We designed the questionnaire not to require specific knowledge about test smells. We corre­ lated each test smell to a set of programming practices, which the participants should read and analyze. Section 4 details the survey study. To complement the findings of the survey questionnaire, we carried out a semi­structured interview (Singer et al., 2008; Gubrium et al., 2012). The interview’s structure aims to capture the interviewees’ perception of test smells. As we needed the interviewees to know the definition of test smells for elaborating on how they deal with them, we first intro­ duced them to the concept of test smells. Section 5 details the interview study. The survey and interview instruments were written and applied in the Portuguese language with Brazilian practitioners. Finally, the data comparison summa­ rizes the survey and interview results methods to answer the research questions (Creswell and Clark, 2018). Section 6 presents the results. 4 Survey Study We applied the survey research method to investigate how the respondents commonly insert test smells in the test code when designing or implementing their software projects (Melegati and Wang, 2020). Throughout this section, we pro­ vide readers with detailed information about the research de­ sign and data analysis. All material used in the survey study, including the dataset, is publicly available at (Junior et al., 2021). 4.1 Design We structured the questionnaire so that the respondents were not required to be aware of test smells beforehand. Thus, we Silva Junior et al. Table 1. Examples of practices related to test smells. Test Smell Test Creation Practices Test Execution Practices Mystery Guest I often create test cases using some configuration file (or supplementary) as support. A test case fails due to the unavailability of access to any configuration file. Eager Test I often create tests with a high number of parameters (number of files, database record, etc.). I run some tests without understanding what their pur­ pose is. Assertion Roulette I pack different test cases into one (i.e., put together tests that could be run separately). Some tests fail, and it is not possible to identify the fail­ ure cause. For Testers Only I have already created a test to validate some feature that will not be used in the production environment. I run some tests to validate features that will not be used in the production environment. Conditional Test Logic I have already created conditional or repeating tests. I run tests with conditional or repeating structures. Empty Test I have already created an empty test with no executable statement. I find empty tests, with no executable statement. covered a larger number of potential practitioners. We cor­ related the concepts of test smells to commonly applied test creation and execution practices. Table 1 shows examples of those practices. For instance, the practices associated with Conditional Test Logic (CTL) use loops or conditions in the test code. In this case, the respondents should analyze the practices to determine whether and how often they adopt them. In CTL, the respondents should indicate how often they create tests with those structures or face them during test execution. Questionnaire Instrument The questionnaire comprises three blocks of questions. The first block characterizes the respondents (profile) and has thirteen questions to identify their age, gender, education degree, and software testing/programming skills. The second block has fourteen statements and six comple­ mentary questions (four objective and two open­ended ques­ tions). The statements describe creation practices related to test smells. We structured those statements in a five­point Likert scale, where the respondents could choose one of the following answers: always, frequently, rarely, never, or not applicable. In this scale, always indicates the adoption of bad practices for test creation. For example, the “I have already created a test to validate a feature that would not be used in the production environment” statement corresponds to the For Testers Only test smell. Therefore, the answer “Always” means that the respondent usually uses that practice in her daily tasks. As a consequence, it is likely that she uninten­ tionally inserts that test smells in the test code. We designed the six complementary questions to understand how the prac­ titioners deal with the test creation activity. The third block has fourteen statements and one additional question. Those statements describe execution practices re­ lated to test smells. Like the former block, we structured those statements on a five­point Likert scale. The respon­ dents could choose one of the following answers: always, fre­ quently, rarely, never, or not applicable, where always indi­ cates that the respondent comes across with test smells. We designed the complementary question to understand which problems the respondents deal with when executing the tests. The survey was available from April 3rd, 2019, to June 3rd, 2019. Appendix A includes all the questionnaire statements and questions used in this study. Pilot Application We ran a pilot survey with four practitioners to identify improvement opportunities. Based on the responses, we im­ proved the questionnaire before running the survey. It is worth mentioning that we did not include data gathered in the pilot application in the research results. Participants We sent invitations and one questionnaire copy (C1 ­ C8) to practitioners from eight Brazilian companies on a conve­ nience sampling basis. The questionnaire’s different versions served to control the number of respondents from the compa­ nies. Those companies have 4 to 66 practitioners who per­ form manual and automated tests (Table 2). In addition, we also sent the questionnaire through direct message (D1) and posted it on a Facebook group dedicated to discussing soft­ ware testing (G1). In total, we contacted 305 practitioners, and 60 practitioners participated in the survey (#S1 ­ #S60). Analysis Procedure To answer RQ1, we analyzed the objective questions (statements) on test creation (second block) and execution (third block). To answer RQ2, we grouped the practices by frequency to identify the most commonly used ones. The practices may be associated with test smells according to their characteristics, such as external file usage, conditional structure, and programming style. To answer RQ3, we com­ pared the professional experience with the frequency of use of test smells. We also used the same answer format of RQ1 but only considered test creation (second block). During the test execution, respondents identify test smells instead of cre­ ating them. We analyzed the three open­ended questions through cod­ ing and continuous comparison (Kitchenham et al., 2015). The objective was to understand why the respondents use practices that may insert test smells. In addition, we also in­ tended to understand which difficulties they encounter when creating and executing tests. Two researchers performed the coding task and validated it by consensus. We also associated some practices with the test code characteristics defined by Meszaros et al. (2003). We employed open coding on the data collected to identify additional reasons why the respondents may use bad prac­ tices in their software testing activities. The obtained codes were peer­reviewed and changed upon agreement with the Silva Junior et al. Table 2. Respondents Source Professionals Answers C1 66 14 C2 30 1 C3 10 0 C4 6 0 C5 5 0 C6 4 4 C7 4 4 C8 4 0 D1 52 35 G1 124 2 Total 305 60 paper authors. We used coding to complement our results on open­ended questions because they were optional. 4.2 Results We received 60 answers (out of 305 potential respondents) from three Brazilian states: 40 respondents from Bahia (66.7%), 19 respondents from São Paulo (31.7%), and one re­ spondent from Paraná (1.6%). The respondents ranged from 22 to 41 years old, and their experience with quality assur­ ance ranged from 0 to 13 years (5.16 on average). Experi­ ence as software developers also ranged from 0 to 13 years (average 1.67). Regarding gender, 35 respondents were male (65%), 19 respondents were female (32%), and two respon­ dents were non­binary (3%). Most of the respondents hold a degree in Computer Science­related courses (50 respondents ­ 83.3%), six respon­ dents (10%) hold a degree in other STEM (Science, Tech­ nology, Engineering, and Mathematics) courses, and four re­ spondents (6.7%) hold a degree in other areas. Most of the re­ spondents (54 respondents ­ 90%) pursued higher education degrees, as follows: 40 respondents hold a bachelor’s degree (66.7%), 13 respondents hold a graduate degree (21.7%), and one respondent holds a postdoc (1.6%). Regarding the software testing tasks they commonly per­ form, (i) 26 respondents reported they create and run tests at the same rate (43.3%); (ii) 13 respondents execute tests with more frequency than create (21.7%); and (iii) 8 respon­ dents create tests with more frequency than execute (13.3%). Moreover, 12 respondents only execute test cases (20%); one respondent only creates test cases (1.7%). They perform tests over many different platforms; 35 respondents (58%) work with two or more platforms (Web ­ 39 respondents (65%), Android ­ 35 respondents (58%), Desktop ­ 29 respondents (48%), and Apple ­ 17 respondents (28%)). They also cited other platforms, such as back­end, microservices, API, main­ frame, and cable TV ­ one respondent each (1.67%). In terms of domain, 39 respondents claimed they test mo­ bile applications (65%), and 36 respondents test web appli­ cations (60%). We also identified the following domains: 14 respondents work with embedded systems (23.33%), 11 re­ spondents work with cloud computing (18.33%), seven re­ spondents test information security (11.67%), four respon­ dents test Internet of Things systems (6.67%). They also men­ tioned other domains: big data, retail, artificial intelligence, cable TV, bioinformatics, commercial information, desktop Figure 2. Test Smells frequency in test creation. system, and payment solutions ­ one respondent each (1,67% each). 4.2.1 Test creation and execution practices We asked whether the respondents search for test duplica­ tion and whether it was either personal or company prac­ tice. Twenty­nine respondents (48,3%) answered that it was only an individual activity. Eleven (18,3%) responded that it was only a company’s practice, and three respondents (5%) claimed that it was a personal and company activity. How­ ever, seventeen respondents (28,3%) do not apply this activ­ ity. Checking tests with the same objective reduces the Test Code Duplication (TCD) test smell. In addition, we established a relationship between the test creation and execution practices and the test smells occur­ rence using the data collected. Figures 2 and 3 show the us­ age frequency of test smells during the test creation and exe­ cution activities, respectively. During test creation, the Conditional Test Logic (CTL) and General Fixture (GF) test smells were the most re­ ported ones. The former obtained 28 (47%) of Always and Frequently responses, and the latter, 27 (45%) in both re­ sponses (Figure 2). The high rate of those responses may in­ dicate a common everyday use of practices related to CTL and GF. We also analyzed why developers create tests with bad practices (one open­ended non­mandatory question an­ swered by 27 respondents ­ 45%). The main reasons were related to the company or personally employed standards, limited time, and attempt to reach better coverage and effi­ ciency. We also asked whether they modified existing test sets when they came across tests containing any of the problem­ atic test patterns illustrated in the survey. We found that seven respondents (11,7%) always perform any test code changes, twenty­three respondents (38,3%) frequently change, six­ teen respondents (26,6%) rarely change, seven respondents (11,7%) never edit test code, and seven respondents (11,7%) answered as not applicable. Among the reasons to modify the test, eighteen respondents reported ambiguities reduction (30%), sixteen respondents claimed execution speed im­ provement (26,7%), fourteen respondents stated adequacy Silva Junior et al. to the company standards (23,3%), eight respondents did not understand test objective (13,3%) and four respondents stated corresponding production class evolution (6,7%). In addition, the respondents pointed out that they used to face test structure problems. Thirty­one respondents in­ dicated that some tests depended on third party resources (52%), 29 respondents reported that they were hard to under­ stand (48%), 24 respondents claimed to contain unnecessary information (40%), 24 respondents said ambiguous informa­ tion (40%), 20 respondents reported to depend on external files (33%), six respondents pointed to use an external config­ uration file (10%). One respondent presented resources limi­ tation (2%). Regarding difficulties in creating test cases (one open­ ended non­mandatory question answered by 23 respondents (38%)), requirement issues were the most frequent ones, re­ ported by twelve respondents (52%). Other problems were related to the difficulties in the test code reuse, lack of knowl­ edge, production code issues, code coverage, test environ­ ment problems, and time and resource limitation. The test execution questions also presented a sequence of statements about ordinary situations the developers usually face, in which respondents should answer according to the frequency. The CTL (52%) and GF (47%) test smells were also the most cited during test execution (Figure 3). Those test smells obtained 31 and 28 answers of Always and Fre­ quently frequencies, respectively. Figure 3. Test Smells frequency in test execution. Regarding difficulties in running test cases (one open­ ended non­mandatory question answered by 29 respondents ­ 48%), ten respondents reported test environment as a prob­ lem related to test execution (34%), such as test environ­ ment unavailability, demand for third­party features, and low­performance environments. The second most common problem is understanding the test purpose (28%), where eight respondents reported that tests were poorly written and with­ out a standard, allowing multiple interpretations. The lack of test maintenance was the third problem (24%), which in­ volves outdated and incomplete tests due to the system code evolution (7 respondents). Table 3. Answers grouped by experience range Experience (in years) Number of respondents Total 0 ­ 2 11 143 > 2 ­ 4 12 156 > 4 ­ 6 15 195 > 6 ­ 8 5 65 > 8 ­ 10 9 117 > 10 ­ 12 4 52 > 12 ­ 14 4 52 4.2.2 Professional Experience Although most respondents from the survey reported they create and execute tests simultaneously, our investigation presented a different scenario as the tester gets more expe­ rienced. Figure 4 shows the daily activities according to the professional experience, with the following highlights: 10 re­ spondents (16.7%) with experience ranging from 4 to 6, and 5 respondents (8.3%) with 8 to 10 years of experience create and execute tests at the same proportion. Eight respondents (13.4%) with less than two years of experience, six respon­ dents (10%) ranging from 2 to 4 years of experience, and four respondents (6.7%) ranging from 6 to 8 years of expe­ rience only run tests or run tests with more frequency than create. Three respondents (5%) with more than 12 years of experience mostly create rather than run tests. Therefore, less experienced respondents run more than creating tests, and re­ spondents with more experience create more than run tests. Figure 4. Testing tasks according to professional experience. We also analyzed whether the use of good practices to create tests increases as respondents become more experi­ enced. We provided the respondents with thirteen statements, with illustrative scenarios of problems with test cases. Each scenario relates to a given test smell. The respondents had to answer how often they experienced each scenario. Table 3 shows the number of respondents grouped by experience time (in years) and the number of valid responses. Figure 5 presents the frequency of test smells grouped by professional experience. When we analyzed the first expe­ rience range (0­2), 71 answers (50%) from the respondents could not identify the adoption of practices related to test smells (Not applicable). 9 (6%) answers pointed that respon­ Silva Junior et al. Figure 5. Test Smells frequency in test creation according to professional experience. Table 4. Interview questions # Question 1 How did you start working with software test? 2 What were your learning sources about test code? 3 Which programming languages do you create tests for? 4 Which programming languages do you use in your current software project? 5 How is your test creation process? 6 Is there any flowchart or template document that standardizes this process? 7 Which support tools are used for test creation and execution? 8 How do you verify the quality of unit tests? 9 Moving to the test code maintenance process, tell me how is this process inside the company? 10 What do you know about the test smell? 11 How did you learn about that? 12 Do you have any doubt about test smell? 13 How are the test smells handled in the unit testing creation process? 14 How are the test smells handled in the unit testing maintenance process? 15 How would it be possible to avoid the introduction of test smells during test creation? 16 Do you have any question, additional information or suggestion to improve this interview? dents Always adopted some practice related to test smells, 16 (11%) answers related to Frequently, 29 (20%) answers to Rarely, and 18 (13%) answers to Never adopted practices re­ lated to test smells. When we extended that analysis through the next experience ranges, we could not observe any in­ crease in responses Never and Rarely with the professional experience, indicating that the experience might not influ­ ence the adoption of practices that lead to the introduction of test smells. 5 Interview Study After carrying out the survey study, we interviewed Software Engineers to gather further evidence on how the practitioners deal with test smells, develop unit test code, and deal with test smells in test creation and maintenance. The interview dataset, including the interview transcriptions, interviewees profile, and coding summary, is publicly available at (Junior et al., 2021). 5.1 Design We employed a semi­structured interview approach, guided by a set of sixteen questions, as Table 4 shows. Interview Organization We organized the interview into three blocks: • Warm­up block (#1­3). Questions about the professional background, such as the learning resources on software test code the interviewees commonly use, as well as the programming language they often use to implement test code, if any; • Technical block (#4­9). Questions about how they cre­ ate, maintain, and assess the quality of developed unit tests; • Test Smell block (#10­15). Questions about the intervie­ wees’ awareness of test smell and how they handle these in test case creation and maintenance. The interviewees could also ask for more information or give additional information and suggestions to increase the interview quality (question #16). Unlike the survey, in the in­ terview, we employed the actual test smell term in the ques­ tions related to such the concept instead of considering a transitive approach through statements containing practices embedded with test smells. When the participants were not aware of the term or asked for more information on test smells, we presented the concept and two test smells samples, e.g., CTL and EpT (Virgínio et al., 2020). Those test smells were related to the most and the least frequently program­ ming practice used on survey results, respectively. There were no questions about challenges or problems involved in creating and maintaining test code. The interviewees an­ swered the questions in Table 4 according to their experi­ ences, concepts, and shared information during the meeting. Silva Junior et al. The interviewer and interviewees did not access any test code from interviewees to analyze the presence of test smells. At the beginning of the interview, the practitioners an­ swered a professional profile’ form with academic back­ ground and professional experience. They also provided an email to solve eventual doubts or collect more data during data analysis. We interviewed on June 3rd and June 30th. Due to the pandemic period, online meeting tools, such as Skype and Google Meet, were used upon the participants’ request. We recorded the interviews with either the Skype conversa­ tion recording tool or the Google Meet screen capture feature. Additionally, we used an external voice recorder for every in­ terview. Participants Initially, we contacted practitioners from the survey who agreed to keep contributing to research. Unlike the survey, we opted only for test code developers whose focus was creating and maintaining unit testing, including the treat­ ment of test smells. Some interviewees participated in the survey study because we applied the snowballing technique (Kitchenham et al., 2015). Next, we used LinkedIn to invite other potential participants, using the “unit testing” expres­ sion in the profile ability search Linkedin provides users. A total of 50 practitioners accepted the invitation (#I1 ­ #I50). Pilot Study We performed two pilot interviews with practitioners to measure the interview length and analyze whether it would be necessary to modify any part of the predefined instrument. As a result, there was no need to perform any changes in the instrument. The average interview length was around 30 minutes. Analysis Procedure The first author was the one responsible for transcrib­ ing the interviews. From them, we performed open coding (Corbin and Strauss, 2014) to answer the research questions. The remaining co­authors analyzed the transcriptions to un­ derstand how the practitioners develop tests and deal with test smells. First, we analyzed and validated the coding until we reach a consensus. In the following, two authors individ­ ually reviewed the proposed coding. In the end, one expert researcher reviewed the final coding. 5.2 Results The interviewees could answer open­ended questions in dif­ ferent ways, according to their reality. Therefore, when pre­ senting the results, some responses got more than 100% dur­ ing the quantitative analysis. The respondents’ age ranged from 20 to 48 years old, most of them ranging from 25 to 34 years old (60%). Regarding their education, six respondents have completed high school (12%), 31 respondents completed an undergraduate school (62%), and 13 respondents hold a graduate degree (26%). Additionally, 48 respondents either have a degree or were studying any Computer Science­related course (96%), one respondent holds a degree in Applied Business (2%), and one respondent holds a degree in Psychology (2%). The respondents worked in companies of different sizes, Table 5. Respondents’ roles Role Respondents % Developer 22 44% Software Engineer 7 14% Systems Analyst 7 14% Software Architect 5 10% Team Leader 3 6% Automation Engineer 2 4% Consultant 2 4% Project Manager 2 4% Quality Specialist 2 4% Quality Engineer 1 2% Test Developer 1 2% Test Analyst 1 2% Table 6. Programming languages Language Respondents % Java 25 30% JavaScript 14 17% C# 11 13% TypeScript 8 10% Python 7 9% Kotlin 5 6% PHP 4 5% Swift 3 4% Ruby 2 2% C 1 1% C++ 1 1% Elixir 1 1% Go 1 1% as follows: (i) 10 respondents worked in small companies (less than 50 employees ­ 20%); (ii) 5 respondents worked in medium­sized companies (number of employees in the range from 50 to 99 employees ­ 10%); and (iii) 35 respondents worked in large­sized companies (more than 99 employees ­ 70%). Additionally, the interviewees were responsible for different tasks within companies related to their current roles (Table 5). They created unit tests for mobile, desktop, and web platforms using different programming languages (Ta­ ble 6). Their experience in software development tasks var­ ied from 1 to 20 years, of which more than 50% were in the 1­ 6 years of experience range. Two out of them were not work­ ing with unit test creation when we interviewed them. In such cases, they should consider their previous experience. We compared and analyzed the information for the open coding analysis and grouped them into codes using sentences, paragraphs, or the entire document. For example, when we asked them about their unit test creation process, the intervie­ wee #I47 answered: “When I worked only with Java [...] if I know the context well if I have deep knowledge of the context that I will develop, I like to do a little TDD [...], but unfortu­ nately this is not something that can be 100% reality in the business, because you have N situations, N circumstances. So I cannot do TDD; at least I develop the specific feature, [...] the features, methods, etc., and then I will test it, for example, for each method that I know has a logic within that method, I do the test cases for N possibilities”. From this answer, we identified the following codes: CodeA ­ TDD; CodeB ­ TLD; Silva Junior et al. CodeC ­ depends on personal skill. We found 159 codes. We did not consider the warm­up block answers (#1 ­ 3) as we used them to stimulate the interviewees to pro­ vide as much information as possible. We used the technical block answers (#4­9) to analyze how the interviewees cre­ ated, maintained, and verified the test quality to complement and compare the survey’s supplementary questions. We used the answers for question #10 to analyze which information the interviewees presented about test smells. Therefore, we could answer RQ4. Questions #11 and #12 complemented question #10. We used answers for questions #13 and #14 to analyze the strategies for dealing with test smells and answer the RQ5. Then, we analyzed the answers given to question #16 to understand how the interviewees believe it was pos­ sible to avoid introducing test smells. Those questions let us understand better how they create, maintain, and verify unit test codes and how they deal with and possibly avoid test smells. 5.2.1 Unit test code creation and maintenance We found that the developers usually create unit test code using Test­Driven Development (TDD) (48%), Test Last De­ velopment (TLD) (42%), or Behavior Driven Development (BDD) (16%). Those strategies’ usage was motivated accord­ ing to the project task or developer’s knowledge about the project’s programming language or architecture. For exam­ ple, the interviewee #I16 stated that he used TDD when he dominated the programming language; otherwise, the func­ tional software code was created first and then tested (TLD). The interviewee #I25 claimed that she created unit tests ac­ cording to the stories from the BDD scenario. When there was no scenario, she used TDD. The method adoption could also depend on if the software was new or legacy. The in­ terviewee #I32 pointed that TDD was used on new projects when possible, and he used a BDD variation before the soft­ ware code creation. During the test code creation description, four intervie­ wees (8%) mentioned using Mocks to simulate components, and two interviewees (4%) used to adopt clean code prac­ tices. For instance, the interviewee #I22 claimed he creates easy­to­read and understand, fast, and independent test codes. The interviewee #I36 uses code patterns and creates less ver­ bose tests. Additionally, the focus of four interviewees (8%) is on test coverage. The interviewee #I12 claimed that he identifies “interesting features” to test. According to the inter­ viewee #I43, the test code should cover 80% of the software code. Moreover, the interviewee #I10 mentioned the SOLID principles, and the interviewee #I15 adopts the Model­View­ ViewModel (MVVM) project pattern (#I15) as practices dur­ ing the test creation. When we asked whether there was any document that stan­ dardized unit test creation, nine interviewees (18%) indicated the use of templates or some other documentation. The in­ terviewees #I5 and #I9 mentioned a test template in their projects that the team members could adopt. The interviewee #I29 claimed his team followed the Microsoft’s official doc­ umentation, but there was not any internal document. The interviewee #I39 mentioned using a Domain Specific Lan­ guage (DSL) to share project information, as follows: “On project day 0, we create and standardize an official DSL for the code. You have prerogatives, you have the test, and you have the result”. In addition, some interviewees answered that there was no documented standard, but they adopted the Given­When­Then (GWT) pattern and the Arrange­Act­ Assert (AAA) programming practices. Furthermore, the interviewees mentioned 90 different tools to create and run tests. Those tools are related to (i) code development (JUnit ­ 42%, Jest ­ 14%, and Visual Stu­ dio ­ 20%); (ii) metrics analysis (Sonar tools ­ 18%), and (iii) Continuous integration (Jenkins ­ 10%, Azure ­ 2%, and Cir­ cle CI ­ 2%). After creating unit test code, the test quality assessment was performed through code review (78%) by one or more developers inside the project team. This activity usually was supported by tools, such as Pull Panda. For example, the in­ terviewee #I2 claimed: “Pull Panda5 is a tool used to ran­ domly assign one or more developers to perform the code review. [...]”. Furthermore, two other interviewees (inter­ viewee #I4) and (interviewee #I16) reported that they per­ formed peer review (4%), and four interviewees claimed they commonly verify test code quality through pair program­ ming (8%). Other practices identified were: test coverage (30%), metric analysis tool (24%) (e.g., SonarQube tool), re­ viewing by continuous integration tool (16%), test execution (10%), application of programming practices (10%) (reuse, clean code, and libraries), running mutant test tool (6%), test validation by external Quality Assurance team (2%), and static validation (2%). Three interviewees reported that there were “no test quality assurance” activities because there were not enough tests to perform this activity or because the company does not support it. The interviewees adopted various test maintenance types distributed by corrective (62%), adaptive (36%), preventive (4%), and perfective (4%) maintenance. Four interviewees claimed there was no test code maintenance due to: (i) there was no defined maintenance process (interviewee #I22); (ii) the participation in one new project and no maintenance task was required (interviewee #I24); absence of maintenance ac­ tivity because of shortage of time (interviewees #I24 and #I36); and (iv) project environment (interviewee #I45). 5.2.2 Test smells treatment We asked the interviewees about their knowledge of test smells to understand whether they comprehended the study subject. Figure 6 summarizes the results. Seven interviewees (14%) demonstrated some knowledge of test smells. For ex­ ample, the interviewee #I2 answered: “I know a few things. I consider these as bad practices, bad choices that you make in your test code that difficult its maintenance and evolu­ tion.”. Twenty­three interviewees (46%) related test smells with code smells but claimed they have never heard of the test smells. The interviewee #I16 mentioned: “Test smell, I do not know the concept. The code smell is a problem that the static test analysis tool found in the program. Would test smell be that same analysis on top of the test code?”. Finally, twenty interviewees (40%) did not know test smells and did not relate to any smells type. 5https://pullpanda.com/ https://pullpanda.com/ Silva Junior et al. We presented the definition and examples of two test smells (CTL and EpT) for the interviewees who did not know about test smells or asked for more information. Table 7 shows how they prevent test smells during test code cre­ ation and how they treat test smells during the test code cre­ ation and maintenance. For example, during the test code cre­ ation, the Code review practice was the most recommended (38%), followed by Tool usage (26%) and Programming practices (24%). When developing the test code, the devel­ oper should follow the programming practices to prevent test smells. Tools and code reviews help to check the test smells insertion in an early stage of development. Two interviewees believed there were not test smells in their repository. For example, the interviewee #I39 said: “I think we do not have this problem (test smells) in the recent project because of its difficulty level, we follow a coding standard. We educate peo­ ple on how we code it [...]”. The interviewee #I11 also said: “As I am the only one working on the project, I coded, under­ stood, and never had this vision of test smells. I do not think I have any problem with that.”. Regarding maintenance, we asked how the interviewees treated test smells during the test code maintenance. The an­ swers were similar to the previous question (Table 7). For the test code maintenance, the Code review was also the most rec­ ommended practice (28%), followed by Refactoring (20%) and Tool usage (18%). As the test code was already devel­ oped and might have test smells, they suggested using tools to help detect test smells and refactoring techniques to re­ move them from the test code. The code review practice can double­check the test code to treat the test smells during the maintenance. We also asked the interviewees how to prevent test smells during test code creation (Table 7). For the test smells pre­ vention, the Tool usage was the most recommended practice (44%), followed by Developers’ skills (28%) and Code re­ view (20%). The developers’ skills are related to develop­ ing tests’ know­how by following good practices, guidelines, and coding patterns. It should help the developers identify and prevent flaws in designing and implementing a test code. The tool usage can support the developers when developing a test code by identifying possible test smells. The code re­ view is a manual analysis of the test code to double­check the test code for test smells prevention. At the end of the interviews, they could either provide or ask for further information about test smells and test code quality assurance. Therefore, the interviewee #I29 claimed: “For me, it is a quality guarantee in terms of dependence ex­ emption, in terms of development, cohesion, coupling, and fundamental architecture. From the moment you have unit testing or even TDD, it helps you improve the code and ar­ chitecture.”. The interviewee #I35 demonstrated interest in our study: “I would like to know more about the study, we can talk about it later if you want, [...] I thought the term ‘test smell’ is complicated, at least it does not seem to be a common industry expression.”. Figure 6. Prior knowledge about test smell. Table 7. Practices to prevent test smells or to treat them during the test code creation and maintenance # Practice Creation Maintenance Prevention 1 Code analysis – – 2 (4%) 2 Code removal – 1 (2%) – 3 Code reuse – – 1 (2%) 4 Code review 19 (38%) 14 (28%) 10 (20%) 5 Coding patterns 4 (8%) 5 (10%) 8 (16%) 6 Company support – – 1 (2%) 7 Culture’s development – – 3 (6%) 8 Developer skills 2 (4%) 2 (4%) 14 (28%) 9 Documentation – – 1 (2%) 10 Guidelines – – 3 (6%) 11 Individual analysis 2 (4%) 6 (12%) – 12 Mutant testing 1 (2%) 1 (2%) – 13 No treatment 13 (26%) 13 (26%) – 14 Pair programming 2 (4%) 1 (2%) 4 (8%) 15 Peer review 1 (2%) 1 (2%) 1 (2%) 16 Professional experience – – 6 (12%) 17 Programming practices 12 (24%) 8 (16%) 11 (22%) 18 Refactoring 5 (10%) 10 (20%) – 19 TDD – – 3 (6%) 20 Technical debt 1 (2%) 5 (10%) – 21 Technical meeting 1 (2%) – – 22 Tool usage 13 (26%) 9 (18%) 21 (44%) 23 Traceability 1 (2%) 1 (2%) 1 (2%) 24 Training – – 8 (16%) 25 Take breaks – – 1 (2%) 26 Software code improvement – 1 (2%) 27 Test Smell Catalog – – 1 (2%) 6 Discussion This section discusses the results obtained after conducting the survey and interview to answer the research questions. RQ1, RQ2, and RQ3 are related to the survey, and RQ4 and RQ5 are related to the interview. 6.1 RQ1: Do practitioners use test case design practices that might lead to the introduc­ tion of test smells? We observed that at least one respondent pointed to 1 out of 14 practices related to test smells from the results. We ana­ lyzed those practices when creating and maintaining tests to identify which types of test smells the participants frequently insert in the test code. Regarding test creation, we observed that every test smell presented at least three out of four possible answers (Always, Frequently, Rarely, and Never). We classified the data into two groups: the Commonly­used practices group (CPG) and Silva Junior et al. the Unused practices group (UPG). CPG contains test smells that mostly present Always and Frequently as answers, and UPG that mostly present Rarely and Never as answers. We considered a test smell belonging to one group when the difference between the Always and Frequently rates and the Rarely and Never rates is greater than 10%. For example, the Empty Test, For testers only, Test Run War, Constructor Initialization, Resource Optimism, Redundant Print, Magic Number, Indirect Test test smells belong to UPG, which means practitioners rarely insert those smells on the testing activities. On the other hand, the respondents frequently adopt prac­ tices related to the General Fixture test smell, the only member of CPG, indicating that they usually create tests with that smell. Still, four test smells presented a similar perti­ nence frequency to both groups (less than 10% of difference). For them, there was not a pattern among respondents. For in­ stance, the Eager Test test smell obtained 38% to CPG and 40% to UPG. In the test execution, UPG contains the Empty Test, Eager Test, Assertion Roulette, Redundant Print, Duplicated Test, Test Run War, For testers Only, Mystery Guest, Constructor Initialization, and Resource Optimism test smells, which means that the re­ spondents rarely face those smells during the test execution. Otherwise, the respondents frequently find practices related to two test smells, General Fixture and Conditional Test Logic, which compose the CPG group. In addition, we did not perceive a significant difference among respon­ dents for two other test smells, Indirect Test and Magic Number, which presented similar pertinence frequency to both groups. We also investigated the reasons that lead the respondents to adopt the practices presented in the survey. Thus, we an­ alyzed the open­ended questions and identified 16 different tags. The most common ones were company standard, per­ sonal standard, project politics, professional experience, sav­ ing time, and improving coverage. For example, the respon­ dent #S26 of the survey reported applying company stan­ dards when creating tests that may insert smells and com­ monly use bad practices “to match company development standards.” In another situation, respondent #S54 reported using personal standards when said: “I group tests by mod­ ules to execute them sequentially without compromising ef­ fectiveness.” This behavior suggests that participants may have misunderstood the test smells definition. When group­ ing tests, it is possible to insert the Assertion Roulette test smell and compromise test independence. A similar situ­ ation occurred with the respondents #S14, #S16, #S27, #S50, and #S59. In general, our study identified that all test smells appeared in testing activities. They all were cited by respondents, even if rarely. Practitioners adopt practices for test case design, which introduce test smells. Usually, those practices come from improper personal and company stan­ dards. 6.2 RQ2: Which practices are present in prac­ titioners’ daily activities that lead to intro­ ducing test smells? Although there are specific tools to support test automation (Fraser and Arcuri, 2011; Smeets and Simons, 2011), 62% of respondents perform more manual than automated tests. Besides, 55% have no experience with software development (less than two years of experience), the lack of knowledge does not influence the adoption of bad practices in the test code. According to the practices explored in the survey, we iden­ tified that the respondents usually come across: (i) the use of generic configuration data, which produces the General Fixture test smell (most frequent on the activities of test creation and execution ­ CPG); and (ii) the use of condi­ tional or repetition structure, directly associated with the Conditional Test logic test smell (second most de­ tected on the activity of test execution ­ CPG). The respondents indicated they usually face several prob­ lems with tests, such as poorly written tests and outdated and incomplete test procedures. According to them, when the tests are associated with generic configuration data, test cases are hard to understand and may cause incorrect results. More­ over, the test coverage on the production code is unclear due to the conditional logic presence on the tests. Understand­ ing which practices are most prevalent in the professionals’ activities supports improving test quality. Other identified problems are related to incompleteness, outdatedness, or lack of documentation. These may hinder traceability, evolution, and maintenance of the testing tasks. The practices most present in the practitioner’s daily life that lead to test smells insertion were conditional structure or repetition and generic configuration data. 6.3 RQ3: Does the practitioners’ experience interfere with the introduction of test smells? In the survey study, we analyzed the respondents’ experi­ ence and its influence in adopting practices that might lead to insert test smells in their projects. As a result, we did not identify any clear cause­effect correlation. For example, the Always option indicates they always use harmful practices. When we analyzed the answers’ frequencies for this option, the usage rate did not reduce over time. Instead of that, we may observe from Figure 5 that respondents with 8 to 10 years of experience achieved a higher usage rate of this fre­ quency. We also identified that behavior when we analyzed the other usage frequencies. However, we could not infer that inexperienced practitioners introduce more test smells than experienced ones regarding the activity of test creation. On the one hand, when testers are inexperienced program­ mers, they may write lower­quality tests. On the other hand, they can carry programming biases that may contain bad practices when they are more experienced. Thus, the absence of a tendency indicates a non­behavioral change between less and more experienced practitioners. Silva Junior et al. Experienced practitioners may not produce fewer test smells than inexperienced ones. 6.4 RQ4: How aware of test smells are the practitioners? The survey results indicate that the lack of information on test smells is one reason that leads practitioners to adopt program­ ming practices that may introduce test smells. Although the test smell concept had appeared in 2001 (Van Deursen et al., 2001), when we asked in the interview what they know about them, 14% of the interviewees demonstrated having some knowledge. For example, two interviewees mentioned: “I know a little bit about test smells. If I am not mistaken, there are smells like Test Assertion and Duplicated [...]” (#I5) and “Test smell? From smells? I know the basics” (#I19). We be­ lieve that the industry should explore this topic more through the initiatives proposed in academia (Santana et al., 2020). Some interviewees (46%) associated the test smells’ term with the code smells and related test smells detection with tool’s usage or personal practices. For example, interviewee #I04 mentioned: “Although I had never heard the term, it makes sense, because I saw everything as a code smell, but there are some strategies, some guidelines that I follow for unit tests.”. This behavior may generate disagreement on the tool functioning, such as the interviewee #I10 said: “One of the outputs of those software that I mentioned, SonarQube and Code Climate, are these test smells. They can find some of them, [...] because we can not publish a project with these types of test structures, tests with commented content, such as empty test, the test with a complexity greater than 1”. Con­ versely, In the SonarQube documentation, there is no infor­ mation about test smells analysis. Thus, we considered that those analyses are related to code smells in test code, which is different from test smells detection. Test practitioners do not know what test smell is. They can associate the test smell concept with code smells, but they have no information about test smell types and refactoring. 6.5 RQ5: What practices have practitioners employed to treat test smells? Commonly, the interviewees did not know what test smells are. After explaining the concepts to them in the interview, they could understand and explain how they deal with test smells in their daily activities. They reported adopting a set of project’s activities (e.g., code review, pair programming, and technical debt) and programming practices during the test creation and maintenance processes (e.g., the clean code approach and Given, When, Then (GWT), and Arrange, Act, Assert (AAA) patterns) to either prevent or treat test smells. The interviewees tended to develop unit tests according to their skills. The professional abilities also determine the result of code review. The interviewees who did not learn about test smells or programming practices can approve a submitted package with these issues. The code review was the most reported activity to treat test smell in test creation (38%) and the most common activity performed by the inter­ viewees (78%) during test quality verification. In this activ­ ity, one or more practitioners analyze the submitted code. In this context, the reviewer’s knowledge determines whether the code is good enough to merge it on the repository. Each team adopts different strategies to perform code re­ views based on the number of reviewers, number of ap­ provals, and professional experience. Although some inter­ viewees reported that only the experienced members review software and test code, the review may not avoid test smells in the project repository, mainly because both experienced and inexperienced practitioners adopt practices that intro­ duce test smells. When we asked about the test smell treatment during test maintenance, some interviewees reported creating a techni­ cal debt to refactor test smell in another moment (intervie­ wees #I08, #I09, #I22, #I25, and #I50). This behavior may indicate that the test smell correction is not a priority. The technical debt creation may also be the reason why test smells remain in the repository. For example, interviewee #I09 said: “There is nearly no treatment for test smells. [...] when remov­ ing a feature from the software or its business rule is changed, the test code is commented and left there. [...] Hardly the de­ velopers handle with commented test codes. [...]”. The interviewees hardly addressed the technical debt and failing tests because they needed to prioritize other tasks as software code development. With less time for testing, test smells would be introduced in the test code during test cre­ ation and maintenance and keep in the repository through postponing maintenance activities. We did not know whether practitioners have learned about test smells. Thus, we adopted the concepts of test smells in literature. The validation of those concepts was out of scope. Although we did not ask specifically whether the intervie­ wees considered test smell as a problem or agreed with the given test smells examples as a smell, during their answers about test smell treatment, part of them told about how they treat at least one of the given examples. For example, the interviewee #I07 said: “Despite not having worked exactly with this type of concept, Sonar itself warned us about these two problems, both when the logic was very complex, with a lot of ”if,” it warned us to break it in different methods, things like type. Moreover, I remember that it identified comments, commented code, and sends a warning”. Regarding the Conditional Test Logic smell example, #I37 said: “This specific code enters into a specific clean code case. This test may be doing more than it should”. Accord­ ing to these comments, the interviewees consider test smells, including the given examples, as structures to fix. Practitioners adopt a set of project activities and pro­ gramming practices to treat test smells. As they do not know well the test smells concepts, it is impossi­ ble to guarantee that those strategies treat test smells appropriately. Silva Junior et al. 7 Threats to validity Internal validity. Although there are more than 100 test smells, this study only considered 14 test smells. However, we selected the most frequent test smells discussed in the lit­ erature. In addition, the test smells were presented in the sur­ vey as practices. To mitigate ambiguities and text compre­ hension, we applied a pilot with four testers from different companies. We used professional social networking to reach as many respondents as possible from Brazilian companies demographically distributed for the survey and interview ex­ ecution. External validity. Our survey and interview respondents may not adequately represent the practices adopted by the practitioners in the wider software engineering industry. Al­ though our results may not generalize, they provide a prac­ tice adopted initial view by the testers. There is an agreement among the practitioners’ responses, indicating that additional data might not reveal new insights. Construct validity. The survey did not inform that the questions referred to test smells to investigate whether the practitioners non­intentionally insert test smells. We pre­ vented the respondents’ partiality when identifying the prac­ tices adopted. Complementary, to investigate how the prac­ titioners deal with test smells. We presented the concept to the interviewees who did not know this subject. After learn­ ing test smells, the respondents were interested in finding so­ lutions for this ”problem” (test smells). We collected open­ ended questions answers and performed one peer­reviewed coding process to avoid biases. The survey and interview in­ struments were written in Portuguese and translated to En­ glish by one author but reviewed by others. Conclusion validity. The data analysis was an exhaustive process, which depends on the researchers’ interpretation of the open­ended questions answers. To prevent biases, we per­ formed the data analysis in three steps: i) two researchers ana­ lyzed the data on pair to discuss the identification of the code, ii) two researchers analyzed the data individually, checking if new codes could emerge, and iii) all researchers discussed and compiled the results from steps i and ii. Additionally, to increase transparency, the crude survey and interview data are available online for other researchers to validate and repli­ cate our study. 8 Related work Bavota et al. (2015) presented a case study to investigate the test smells impact on maintenance activities. In that study, developers and students analyzed testing code to compare whether their experience would make a difference in test smell identification. As a result, they found that the inten­ sity of the test smells’ impact is different for different levels of experience; the number of impacting test smells is higher for students than industry professionals. Additionally, they found that test smells have a significantly negative impact on maintenance activities. Conversely, our survey found that the practitioners’ experience does not interfere in the test smell introduction during test creation and execution activ­ ities. Moreover, the interview revealed that the practitioners are not aware of test smells, reinforcing that the experience is not influencing the test smells insertion in the test code. Tufano et al. (2016) proposed an interview study with 19 participants to investigate developers’ perception of test smells. They performed an empirical investigation to ana­ lyze where test smells occur at the source code. The re­ sults showed that developers generally do not recognize test smells, and there are test smells since the first code commit in the repository. Similarly, our interview indicated a lack of awareness of the developers about the underlying concept of test smells. Additionally, we did not find any study investi­ gating how professional practices affect the test smells intro­ duction, and therefore we investigated it through a survey. Spadini et al. (2020) surveyed developers to evaluate the severity thresholds for detecting test smells and investigate test smells’ perceived impact on test suite maintainability. The developers had to classify whether a test smell instance is valid and rate the test smell instance regarding its importance to its maintainability. The evaluation of test smells instances requires knowledge about the topic. Therefore, our survey presented practices that might lead to test smells insertion, and our interview provided information about test smells to level the respondents about the topic. In our previous work (Silva Junior et al., 2020), we con­ ducted an expert survey to understand whether practitioners unintentionally insert test smells. We surveyed sixty Brazil­ ian practitioners regarding fourteen bad practices that might lead to the test smell insertion during the test code creation and execution. The results indicated that the practitioners’ experience might not influence the test smells insertion. Usu­ ally, practices that lead to test smells insertion came from im­ proper personal and company standards. This current study complements the previous one by investigating the practition­ ers’ knowledge about test smells and how they deal with the test code quality regarding the presence of test smells. We conducted interviews with fifty Brazilian practitioners to ask them about the test code creation and maintenance processes. As a result, the interviewees indicated a set of practices that might be useful to treat test smells. However, as they do not know about test smells concepts, those practices need further investigation for test smell treatment. 9 Conclusion Test smells may decrease the test code quality and main­ tenance. Our study aimed to identify whether practitioners unintentionally insert test smells in the test code and how they treat them. Therefore, we applied two complementary research methods: a survey and an interview study. We surveyed sixty respondents to investigate the uninten­ tional test smells insertion in the test code. They evaluated a set of practices related to the test smells insertion in the test code. The results indicated that the respondents adopt bad practices that might lead to insert the test smells. The bad practices adoption is more related to the improper company standards than the respondent’s experience with test code de­ velopment. To investigate how the practitioners treat test smells, we interviewed fifty respondents. They answered questions on Silva Junior et al. how they prevent and treat test smells during the test code de­ velopment. The results indicated an overall knowledge lack on the test smells. For most of the interviewees, it was their first contact with this subject. However, after explaining one test smell to the respondents, they recognized it in their test code and identified practices that they adopted to deal with it. Among the recommended practices, we highlight the adop­ tion of tools, coding patterns, programming practices, code review, and training to improve the developers’ skills and expertise. After analyzing the answers to the survey and the inter­ view, we could identify that practitioners did not know test smells. Thus, they insert different test smells types, even the experienced ones. They have tried to treat test smells through some strategies, but as they have not learned about this sub­ ject, they have inserted test smells in their test code, and the strategies may not be enough to avoid that. Those studies are starting points to researches that consider practitioners as agents in the test smell treatment. As future work, we aim to follow the Grounded Theory methodology (Corbin and Strauss, 1990) to leverage a com­ mon understanding of how the software industry is receptive to improving the test code quality by taking test smells into consideration. We would validate the respondents’ practices to prevent and treat test smells and elaborate a checklist for test code quality development and assurance with an in­depth study. Acknowledgements We would like to thank the participants in our survey and pilot study. This research was partially funded by INES 2.0; CNPq grants 465614/2014­0 and 408356/2018­9 and FAPESB grants JCB0060/2016 and BOL0188/2020. References Bavota, G., Qusef, A., Oliveto, R., Lucia, A., and Binkley, D. (2012). An empirical analysis of the distribution of unit test smells and their impact on software maintenance. In 28th IEEE International Conference on Software Main­ tenance (ICSM). Bavota, G., Qusef, A., Oliveto, R., Lucia, A., and Binkley, D. (2015). Are test smells really harmful? An empirical study. Empirical Software Engineering, 20(4). Corbin, J. and Strauss, A. (2014). Basics of qualita­ tive research: Techniques and procedures for developing grounded theory. Sage publications. Corbin, J. M. and Strauss, A. (1990). Grounded theory re­ search: Procedures, canons, and evaluative criteria. Qual­ itative sociology, 13(1):3–21. Creswell, J. W. and Clark, V. L. P. (2018). Designing and Conducting Mixed Methods Research. SAGE Publica­ tions, third edition. Fraser, G. and Arcuri, A. (2011). Evosuite: Automatic test suite generation for object­oriented software. In 13th Eu­ ropean Conference on Foundations of Software Engineer­ ing, ESEC/FSE, New York, NY, USA. ACM. Garousi, V. and Felderer, M. (2016). Developing, verifying, and maintaining high­quality automated test scripts. IEEE Software, 33(3). Garousi, V. and Küçük, B. (2018). Smells in software test code: A survey of knowledge in industry and academia. Journal of systems and software, 138. Greiler, M., van Deursen, A., and Storey, M. (2013). Auto­ mated detection of test fixture strategies and smells. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation. Gubrium, J. F., Holstein, J. A., Marvasti, A. B., and McKin­ ney, K. D. (2012). The SAGE Handbook of Interview Re­ search: The Complexity of the Craft. SAGE Publications, 2nd edition. Junior, N. S., Martins, L., Rocha, L., Costa, H., and Machado, I. (2021). How are test smells treated in the wild? A tale of two empirical studies [Dataset]. Available at: https: //doi.org/10.5281/zenodo.4548406. Kitchenham, B. A., Budgen, D., and Brereton, P. (2015). Evidence­based software engineering and systematic re­ views, volume 4. CRC press. Melegati, J. and Wang, X. (2020). Case survey studies in software engineering research. In Proceedings of the 14th ACM / IEEE International Symposium on Empirical Soft­ ware Engineering and Measurement (ESEM), ESEM ’20, New York, NY, USA. ACM. Meszaros, G., Smith, S. M., and Andrea, J. (2003). The test automation manifesto. In Maurer, F. and Wells, D., editors, Extreme Programming and Agile Methods ­ XP/Agile Uni­ verse 2003. Springer Berlin Heidelberg. Miles, M. B., Huberman, A. M., and Saldaña, J. (2014). Qual­ itative Data Analysis. SAGE Publications, fourth edition. Palomba, F., Di Nucci, D., Panichella, A., Oliveto, R., and De Lucia, A. (2016). On the diffusion of test smells in au­ tomatically generated test code: An empirical study. In 9th International Workshop on Search­based Software Testing. ACM. Peruma, A. S. A. (2018). What the Smell? An Empiri­ cal Investigation on the Distribution and Severity of Test Smells in Open Source Android Applications. PhD Thesis, Rochester Institute of Technology. Pfleeger, S. L. and Kitchenham, B. A. (2001). Principles of survey research: part 1: turning lemons into lemonade. ACM SIGSOFT Software Engineering Notes, 26(6):16– 18. Santana, R., Martins, L., Rocha, L., Virgínio, T., Cruz, A., Costa, H., and Machado, I. (2020). Raide: A tool for asser­ tion roulette and duplicate assert identification and refac­ toring. In Proceedings of the 34th Brazilian Symposium on Software Engineering, SBES ’20, page 374–379, New York, NY, USA. Association for Computing Machinery. Silva Junior, N., Rocha, L., Martins, L. A., and Machado, I. (2020). A survey on test practitioners’ awareness of test smells. In Proceedings of the XXIII Iberoamerican Con­ ference on Software Engineering, CIbSE 2020, pages 462– 475. Curran Associates. Singer, J., Sim, S. E., and Lethbridge, T. C. (2008). Soft­ ware engineering data collection for field studies. In Shull, F., Singer, J., and Sjøberg, D. I. K., editors, Guide to Ad­ https://doi.org/10.5281/zenodo.4548406 https://doi.org/10.5281/zenodo.4548406 Silva Junior et al. vanced Empirical Software Engineering, pages 9–34, Lon­ don. Springer London. Smeets, N. and Simons, A. J. (2011). Automated unit testing with Randoop, JWalk and µJava versus manual JUnit test­ ing. Research report, Department of Computer Science, University of Sheffield/University of Antwerp, Sheffield, Antwerp. Spadini, D., Schvarcbacher, M., Oprescu, A.­M., Bruntink, M., and Bacchelli, A. (2020). Investigating severity thresh­ olds for test smells. In Proceedings of the 17th Interna­ tional Conference on Mining Software Repositories, MSR. Tufano, M., Palomba, F., Bavota, G., Di Penta, M., Oliveto, R., De Lucia, A., and Poshyvanyk, D. (2016). An empiri­ cal investigation into the nature of test smells. In 31st Inter­ national Conference on Automated Software Engineering. IEEE. Van Deursen, A., Moonen, L., Van Den Bergh, A., and Kok, G. (2001). Refactoring test code. In Proceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (XP). Van Rompaey, B., Du Bois, B., and Demeyer, S. (2006). Characterizing the relative significance of a test smell. In 22nd International Conference on Software Maintenance, ICSM’06. IEEE Computer Society. Virgínio, T., Martins, L., Rocha, L., Santana, R., Cruz, A., Costa, H., and Machado, I. (2020). Jnose: Java test smell detector. In Proceedings of the 34th Brazilian Symposium on Software Engineering, SBES ’20, page 564–569, New York, NY, USA. Association for Computing Machinery. Virgínio, T., Martins, L. A., Soares, L. R., Santana, R., Costa, H., and Machado, I. (2020). An empirical study of automatically­generated tests from the perspective of test smells. In SBES ’20: 34th Brazilian Symposium on Software Engineering, pages 92–96. ACM. Virgínio, T., Santana, R., Martins, L. A., Soares, L. R., Costa, H., and Machado, I. (2019). On the influence of test smells on test coverage. In Proceedings of the XXXIII Brazilian Symposium on Software Engineering. ACM. Wiederseiner, C., Jolly, S. A., Garousi, V., and Eskandar, M. M. (2010). An open­source tool for automated gen­ eration of black­box xunit test code and its industrial eval­ uation. In Bottaci, L. and Fraser, G., editors, Testing – Practice and Research Techniques. Springer Berlin Hei­ delberg. Yusifoğlu, V. G., Amannejad, Y., and Can, A. B. (2015). Software test­code engineering: A systematic mapping. In­ formation and Software Technology, 58. A Appendix A Block 1: Respondents’ Profile Q1. What is your gender? Q2. What is your age? Q3. Which course do you have an academic background in? Q4. What is the highest degree or level of education you have completed? Q5. Which Brazilian state do you currently work? Q6. How long have you been working with software test­ ing? Q7. How long have you been working with software de­ velopment? Q8. Which activity do you perform daily? Q9. What is the platforms of the projects that you have worked on? Q10. What is the application domain of the last project that you worked on? Q11. Which test technique do you execute? Q12. Are the tests executed more often manually or auto­ mated? Q13. How do you describe your expertise with coding? Block 2: Test Creation Q14. What is the source for creating the test cases for the projects in which you work? Q15. Is there verification to detect duplicate tests (with the same writing or with different writing and the same objective)? More than one option could be selected. Evaluate the following statements according to your daily activities: Q16. “I usually create test cases using some configuration file (or complementary file) as a backup” Q17. “When creating a test, I analyze whether it can be ex­ ecuted at the same time with others or if it should be executed in isolation, due to the availability of exter­ nal resources .“ Q18. “I analyze the possibility of a test failing because it uses a resource that is being used at the same time by another test.” Q19. “I have a habit of creating tests with a high number of parameters (number of files, database record, etc.).” Q20. “I group different test cases into one (that is, combine tests that could be run separately).” Q21. “I create tests that depend on resources that may not have their own tests for validation (eg a test that in­ volves retrieving information from the database, but there is no test to validate database research). “ Q22. “I have already created a test to validate some feature that will not be used in the production environment” Q23. “I have already created a test with a high value for a specific parameter (eg number of records in the database, number of files in folder) even that makes it difficult to repeat. “ Q24. “I have already created a test with a conditional or repetitive structure.” Q25. “I have already created an empty test, with no exe­ cutable instructions.” Q26. “I usually create tests using some data from a config­ uration file.” Q27. “I usually create tests with printing or displaying re­ sults in a redundant way, or without need.” Q28. “I have already created a test considering the exis­ tence of a resource, without checking its existence or availability.” Silva Junior et al. Q29. “I already changed a test by identifying one of the pre­ vious points.” Q30. If you answered “always”, “frequently” or “rarely” in the previous questions, why were the tests created with these standards? Q31. If you changed any tests according to the design stan­ dards above, why were they edited? Q32. What problems in the test structure have you encoun­ tered? Q33. What difficulties do you often encounter when creat­ ing test cases? Block 3: Test Execution Evaluate the following statements according to the fre­ quency found in daily activities: Q34. “A test case fails due to unavailability of access to a configuration file.” Q35. “Repeat a test case because it previously failed due to competition with some other test case that was run­ ning at the same time.” Q36. “Execute tests that could be executed performed more quickly, when modifying the contents of the configu­ ration file.” Q37. “Run a test without understanding its purpose.” Q38. “Some test fails and it is not possible to identify the cause of the failure.” Q39. “Run a test that depends on an external resource that does not have a test for direct validation.” Q40. “A test case fails due to unavailability of access to any external resource.” Q41. “Run test with a high value for a specific parameter (eg: number of records in the database, number of files in folder) even if it makes it difficult to repeat.” Q42. “Run a test to validate a feature that will not be used in the production environment.” Q43. “Find duplicate test (with the same or different writ­ ing).” Q44. “Run test with conditional or repetitive structure.” Q45. “Find empty test, with no executable instruction.” Q46. “Run test with printing or display of results in a re­ dundant way, or unnecessary.” Q47. “Run a test considering the existence of a resource, without checking the existence or availability of it.” What difficulties do you usually encounter when run­ ning test cases? Introduction Test Smells Research Method Survey Study Design Results Test creation and execution practices Professional Experience Interview Study Design Results Unit test code creation and maintenance Test smells treatment Discussion RQ1: Do practitioners use test case design practices that might lead to the introduction of test smells? RQ2: Which practices are present in practitioners' daily activities that lead to introducing test smells? RQ3: Does the practitioners’ experience interfere with the introduction of test smells? RQ4: How aware of test smells are the practitioners? RQ5: What practices have practitioners employed to treat test smells? Threats to validity Related work Conclusion Appendix A