Journal of Software Engineering Research and Development, 2021, 9:8, doi: 10.5753/jserd.2021.1893  This work is licensed under a Creative Commons Attribution 4.0 International License.. On the test smells detection: an empirical study on the JNose Test accuracy Tássio Virgínio  [ Federal Institute of Tocantins | tassio.virginio@ifto.edu.br ] Luana Martins  [ Federal University of Bahia | martins.luana@ufba.br ] Railana Santana  [ Federal University of Bahia | railana.santana@ufba.br ] Adriana Cruz  [ Federal University of Lavras | adriana.cruz@estudante.ufla.br ] Larissa Rocha [Federal University of Bahia / State Univ. of Feira de Santana | larissa@ecomp.uefs.br] Heitor Costa  [ Federal University of Lavras | heitor@ufla.br ] Ivan Machado  [ Federal University of Bahia | ivan.machado@ufba.br ] Abstract Several strategies have supported test quality measurement and analysis. For example, code coverage, a widely used one, enables verification of the test case to cover as many source code branches as possible. Another set of affordable strategies to evaluate the test code quality exists, such as test smells analysis. Test smells are poor design choices in test code implementation, and their occurrence might reduce the test suite quality. A practical and large- scale test smells identification depends on automated tool support. Otherwise, test smells analysis could become a cost-ineffective strategy. In an earlier study, we proposed the JNose Test, automated tool support to detect test smells and analyze test suite quality from the test smells perspective. This study extends the previous one in two directions: i) we implemented the JNose-Core, an API encompassing the test smells detection rules. Through an extensible architecture, the tool is now capable of accomodating new detection rules or programming languages; and ii) we performed an empirical study to evaluate the JNose Test effectiveness and compare it against the state-of- the-art tool, the tsDetect. Results showed that the JNose-Core precision score ranges from 91% to 100%, and the recall score from 89% to 100%. It also presented a slight improvement in the test smells detection rules compared to the tsDetect for the test smells detection at the class level. Keywords: Tests Quality, Test Evolution, Test Smells, Evidence-based Software Engineering 1 Introduction Ensuring end-user satisfaction, detecting software defects be- fore go-live, and increasing software or product quality is among the most commonly reported software testing objec- tives, as written by the annual report of a global consulting firm (Capgemini, 2018). Recently published reports estimate over $ 2 trillion to quantify the impact of poor software qual- ity on the United States economy, referencing publicly avail- able source material for the year 2020 (CISQ, 2021). Such data illustrates the need for employing software test- ing techniques in software development processes, as they could anticipate bug identification and fixing, thus reducing its likely effects still during implementation (or even when existing functionalities are under evolution) (Palomba et al., 2018; Spadini et al., 2018; Grano et al., 2019). In a well-defined Software Engineering process, test code should co-evolve together with production code, as high- quality test code is essential to ease the maintenance and evolution of production and test code (Yusifoğlu et al., 2015; Guerra Calle et al., 2019). However, it might be time-consuming and cost-ineffective (Yusifoğlu et al., 2015; Guerra Calle et al., 2019). Several approaches have been proposed in the literature to assess the quality of test suites. For example, code coverage measurement has been widely used to check the quality of au- tomated tests. It measures the test suite quality based on how much a test covers structural elements, such as functions, in- structions, branches, and lines of code (Gopinath et al., 2014). Nonetheless, even with high code coverage, the test code might encompass poor design choices in their implementa- tion, the so-called test smells. The presence of smells in test code may reduce the quality of test suites and, consequently, the production code qual- ity (Deursen et al., 2001). Additionally, poorly-written tests can be challenging to comprehend and onerous for testers to maintain the code and detect faults (Bavota et al., 2015; Grano et al., 2019). The software testing literature has introduced a set of tools focused on validating the quality of test suites, mainly through metrics analysis. For example, CodeCover1 is an open-source Java tool for code coverage executed via a graphical user interface (with Eclipse IDE) and command- line; tsDetect2 is a command-line tool for test smells detec- tion. Other tools use code coverage results to predict test smells, such as TeReDetect (Negar and Garousi, 2010) and TeCReVis (Koochakzadeh and Garousi, 2010). Generally, these tools have many different data outputs, which might be hard for testers to establish a relationship between code cov- erage and internal test code quality. Moreover, several types of test smells have not been investigated in conjunction with code coverage yet, but could also provide opportunities to improve test code quality. In previous studies (Virginio et al., 2019, 2020), we introduced the JNose Test, a tool to analyze the quality 1Available at: https://codecover.org 2Available at: https://testsmells.github.io https://orcid.org/0000-0001-6259-4957 mailto:tassio.virginio@ifto.edu.br https://orcid.org/0000-0001-6340-7615 mailto:martins.luana@ufba.br https://orcid.org/0000-0002-1153-8960 mailto:railana.santana@ufba.br https://orcid.org/0000-0001-5196-6356 adriana.cruz@estudante.ufla.br https://orcid.org/0000-0002-8069-5249 mailto:larissa@ecomp.uefs.br https://orcid.org/0000-0002-9903-7414 mailto:heitor@ufla.br https://orcid.org/0000-0001-9027-2293 mailto:ivan.machado@ufba.br https://codecover.org https://testsmells.github.io On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 of test suites from the test smells perspective. The JNose Test provides an automated test strategy focused on (i) iden- tifying possible test design flaws, (ii) analyzing the software project quality evolution, and (iii) reducing the effort for performing quality assurance of a test suite. The JNose Test integrates a conceptual framework which encompasses strategies for test smells prevention, identification, refac- toring, and visualization to improve the test code quality. RAIDE3 (Santana et al., 2020) and TSVizzEvolution4 tools are part of this framework. In this study, we proposed the JNose-Core, an API (Ap- plication Programming Interface) to detect test smells in the test code. It provides a flexible architecture to support the in- sertion of new test smells detection rules. The JNose Test implements the interface methods the JNose-Core provides and organizes the data flow in a web-based user interface. In this new version, our tool: i) detects test smells in different code granularities (line, method, block, and class); ii) detects test smells more accurately according to the literature defi- nition; and iii) presents the outputs in a more user-friendly interface. Additionally, we also extended our previous work by val- idating the test smells detection rules implemented in the JNose Test tool. We conducted an empirical evaluation to investigate two objectives: (i) verify the JNose Test accu- racy compared with the tsDetect in terms of precision and recall at a class level, and (ii) verify the JNose Test accu- racy compared with the manual analysis in terms of precision and recall at a fine-grained level. The results show that in a test class level, the JNose Test obtained slightly better re- sults than the tsDetect for specific types of test smells, such as Assertion Roulette, Lazy Test, and Eager Test. When ana- lyzing the test smells at a fine-grained level, our tool shows higher accuracy when detecting the test smells location. The remainder of this paper is structured as follows. Sec- tion 2 introduces the test smells concept and types. Section 3 presents an overview of the JNose-Core API. Section 4 presents the JNose Test, a web application for test smells detection. Section 5 describes the empirical study to eval- uate the JNose Test accuracy. Section 6 presents the re- sults. Section 7 discusses related work. Section 8 presents the threats to the validity of our study. Finally, Section 9 draws concluding remarks. 2 Background Test code development is not a trivial task (Palomba et al., 2018; Virginio et al., 2019). In real-world practice, devel- opers are likely to use anti-patterns during test development (Bavota et al., 2012; Junior et al., 2020). Those anti-patterns may negatively impact the test code quality and maintenance and reduce its capability for detecting software faults (Bell et al., 2018; Spadini et al., 2020). Several studies have investigated different types of test smells. Initially, Deursen et al. (2001) defined a catalog of 11 test smells and refactorings to remove them from the test 3Available at https://raideplugin.github.io 4Available at https://github.com/arieslab/TSVizzEvolution code. Next, several authors extended this catalog and ana- lyzed the test smells effects on the production and test code (Meszaros et al., 2003; Bavota et al., 2012; Greiler et al., 2013; Bavota et al., 2015; Bell et al., 2018; Virginio et al., 2019; Spadini et al., 2020). As a result of the researchers’ efforts to identify anti-patterns, Garousi and Küçük (2018) listed more than 190 test smells in a literature review. In this study, we selected twenty-one types of test smells currently discussed in the literature (Peruma et al., 2019): • Assertion Roulette (AR). It occurs when a test method contains non-documented assertions. If an assertion fails, it can be difficult to identify which one failed; • Conditional Test Logic (CTL). It occurs when a test method contains conditional expression or loop struc- tures. Conditions within the test method may alter its behavior which leads the test to fail; • Constructor Initialization (CI). It occurs when a test method contains a constructor; • Default Test (DT). It occurs when a test class is created by default; • Dependent Test (DepT). It occurs when the test being executed depends on other tests’ success; • Duplicate Assert (DA). It occurs when a test method tests for the same condition multiple times within the same test method; • Eager Test (ET). It occurs when a test method checks more than one method of the production class; • Empty Test (EpT). It occurs when a test method does not contain executable statements; • Exception Catching Throwing (ECT). It occurs when a test method is explicitly dependent on the production method throwing an exception; • General Fixture (GF). It occurs when the test methods only access part of the test case fixture (setup method); • Ignored Test (IgT). It occurs when a test method is sup- pressed from running; • Lazy Test (LT). It occurs when several test methods check the same production method; • Magic Number Test (MNT). It occurs when assert statements contain numeric literals; • Mystery Guest (MG). It occurs when a test method uti- lizes external resources (e.g., a file containing test data), and thus it is not self-contained; • Print Statement (PS). It occurs when unit tests con- tains print statements; • Redundant Assertion (RA). It occurs when the test method contains an assertion statement that always is true or false; • Resource Optimism (RO). It occurs when a test method makes optimistic assumptions about the exis- tence and state of external resources; • Sensitive Equality (SE). It occurs in test methods that contains an equality check using a toString() method. The test may fail when the toString() method is changed; • Sleepy Test (ST). It occurs when the execution of a test https://raideplugin.github.io https://github.com/arieslab/TSVizzEvolution On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 method is paused for a certain period (e.g., simulate an external event) and then continues its execution; • Unknown Test (UT). It occurs when a test method does not encompass an assertion statement. • Verbose Test (VT). It occurs when the tests use too much code to do what they are supposed to do. In other words, the test code is not clean and simple. 3 JNose Core In our previous work (Virginio et al., 2020), we introduced the first version of the JNose Test, a web application for the detection and coverage calculation of test smells. We reused and also expanded the test smells detection rules from the tsDetect (Peruma et al., 2020). Therefore, the JNose Test provides: (i) a graphical interface to facilitate the interaction between user and tool, (ii) the amount and location of the de- tected test smells, and (iii) support for the test smells analysis through several project versions. When improving the detection rules from tsDetect, we faced some challenges regarding the coupling and depen- dency between the test framework and test code. The test frameworks, specifically the JUnit framework5, require dif- ferent implementations depending on the version used. For example, JUnit 4 uses a tag @Ignore to disable a test class or test method, while JUnit 5 uses the tag @Disabled. Re- garding the assertions, JUnit 4 accepts an optional parameter for error message as the first argument, and JUnit 5 uses the last argument in the method signature. Therefore, to facilitate the detection rules expansion and reuse of other tools, we implemented the JNose-Core API.6 It is beneficial for the conceptual framework we are working on to evaluate the test code quality. The detection module is the framework base; the test smells detected are the same that should be removed by the refactoring module (RAIDE tool) and presented to the user by the visualization module (TSVizzEvolution). 3.1 Architecture We designed the JNose-Core as a Maven7 project to sim- plify and standardized the build process. Additionally, we provided a JNose-Core compiled version that can be im- ported by other projects built with Maven. The requirement to use the compiled version is to import the library in the pom.xml of the project, as Listing 1 shows. As a result, the JNose-Core provides methods to instantiate for the test smells detection. The JNose-Core is licensed under the GNU general public license, and its architecture comprises four packages, as follows (Figure 1): • core. It implements the JNoseCore, a facade class that receives a instance of the Config interface. The Con- 5JUnit is a Java library for testing source code, which has advanced to the de-facto standard in unit testing. Available at https://junit.org/. 6Available at https://github.com/arieslab/jnose-core 7Maven is a software project management and comprehension tool. Maven can manage a project’s build, reporting and documentation from a central piece of information. Available at https://maven.apache.org/ Figure 1. JNose-Core API internal architecture fig interface contains the methods signature for the test smells detection; • detector. It implements a structure to detect the smelly elements and contains classes to support a test code static analysis through an AST (Abstract Syntax Tree) generated by JavaParser8. • smell. It implements the detection rules for JUnit 4 and improves the detection rules from tsDetect (Section 2) to identify test smells at different granularity levels. Several classes are implemented (for each type of test smell) and use JavaParser to collect additional informa- tion on the location and number of test smells. • dto (data transfer object). It implements the classes responsible for transferring data among the packages. 1 2 br. ufba .jnose 3 jnose -core 4 0.7 - SNAPSHOT 5 Listing 1: pom.xml configuration to use JNose-Core 3.2 Detection Rules We revisited the test smells definitions in the literature to identify how we should improve the detection rules from tsDetect. Table 1 shows the granularity levels that we de- fined to detect the exact test smells location in the test code, as follows: (i) line, test smells that occur in a specific line; (ii) block, test smells that occur in a statement block level, e.g., try/catch and conditional statements; (iii) method, test smells that occur in the method level; and (iv) class, test smells that occur in a test class level. Additionally, we made improvements in the test smells detection rules. We next de- tail the main modificationsw we performed: • Nested Structures. We improved the rules for detecting the CTL, ECT, and MNT test smell to consider nested 8Available at: https://javaparser.org/ https://junit.org/ https://github.com/arieslab/jnose-core https://maven.apache.org/ https://javaparser.org/ On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 Table 1. Test Smells detection rules. Name Detection Rule Granularity Assertion Roulette A line with assertion statements without the explanation/message parameter Line Constructor Initialization A method that is a constructor declaration Method Conditional Test Logic A code block with conditional statements Block Duplicate Assert A line with assertion whose parameters equal the other assertion inside of same test method Line Default Test A method called ExampleUnitTest() or ExampleInstrumentedTest() Method Dependent Test A method that depends on the previous execution of another test method Method Empty Test A method that does not contain a single executable statement Method Eager Test A line that contains a call to another production method Line Exception Catching Throwing A block that contains either a throw statement or a catch clause Block General Fixture A line with a field instantiated within the setUp() method that is not utilized by all test methods Line Ignored Test A method that contains the @Ignore annotation Method Lazy Test A line of method that calls the same production method that other test method Line Mystery Guest A method that accessing object instances of files and databases classes method Magic Number Test A line with assertion method that contains a numeric literal as an argument Line Print Statement A line that invokes either the print() or println() or printf() or writes method of the System class Line Redundant Assertion A line containing an assertion statement in which the expected and actual parameters are the same Line Resource Optimism A method that uses an external resource without checking the state of the object Method Sensitive Equality A method that contains an assertion that invokes the toString() method of an object Method Sleepy Test A line that invokes the Threadsleep() method Line Unknown Test A method that uses the @Test annotation but does not contain assertions statement Method Verbose Test A method with more than 30 lines Counting non executable statements and annotations Method structures. When the tool reports a nested conditional structure as one test smell, it might be hard to identify which part of the test code needs refactoring at first glance. If the nested conditional is too long, the user may refactor parts of it. When rerunning the tool, the user will see that the problem is still there, making the refactoring process longer. Therefore, the tool presents one test smell for each structure; • Empty or Non-assertive. The UT and EpT test smells present similar definitions. The UT test smell identi- fies methods without assertions, and the EpT test smell identifies methods with non-executable statements. Test methods without a body neither contain executable statements nor assertions. Therefore, we added another rule to separate both definitions; the UT test smell iden- tifies methods that contain a body and does not identify asserts; • General Fixture. The GF test smell occurs when test methods use only a setup method part, representing the cohesion among the test class’s methods. Therefore, we improved the detection rules to show that all the test class methods are used with setup fixtures. It allows the user to identify the test method to which a fixture should be moved; • Missing Structures. Each version of the test framework requires the static analysis of different code structures. The assert structures used in JUnit 3 is different from JUnit 4, which is also different from JUnit 5. Therefore, to improve the detection rules to JUnit 4, we added the code structures that were missing to detect the CTL, AR, DA, and ECT test smells; • Methods Overload. Similar to the preceding item, there are differences among the JUnit versions regard- ing the overloaded methods. When analyzing test cases written with JUnit 3, we were not concerned about over- loaded methods. However, to focus on the current detec- tion rules for JUnit 4, we needed to improve the AR, and DA test smells to support the overloaded methods. 4 JNose Test The JNose Test9 enables test code quality analysis through test smells detection and code coverage over several soft- ware project versions. Therefore, it is possible to compare whether a project test quality has either improved or de- clined throughout its life cycle. The JNose Test operation involves three key processes (Figure 2): (i) Data Input, re- ceives the settings for the tool execution, i.e., the list of types of test smells, analysis mode (By TestClass, By TestSmell, By TestFile, and Evolution), and the project to be analyzed; (ii) Project Analysis, calls the JNose-Core, an API to per- form the project analysis according to the analysis mode se- lected; and (iii) Data Output, shows the execution status and the analysis results. 4.1 Processes Description Java Development Kit (JDK) 11 and Maven 3 (or superior) are necessary to install the JNose Test. Upon installation, the user would be able to use Jetty (embedded on Maven) and build and run the JNose Test. 9Available at https://jnosetest.github.io https://jnosetest.github.io On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 Figure 2. Schematic overview of the JNose Test tool and its main features After starting the tool, the user must configure the Data Input (Figure 2). First, the user should import the projects to be analyzed (Figure 3a - Step 1). The JNose Test clones the repository directly from GitHub, and allows the user to manage it (Figure 3a - Step 2). Second, the user selects the analysis mode, i.e., By TestClass, By TestSmells, By TestFile, and Evolution (Figure 3a - Step 3). Each analysis mode pro- vides a menu where the user chooses the repositories to be analyzed. By default, the tool detects twenty-one types of test smells, but the user could configure this feature as well (Figure 3a - Step 4). After completing the project import and defining the set- tings detection, the tool starts the Project Analysis (Figure 2). For each analysis mode, the JNose Test Tool presents an interface with (i) a list of cloned projects (Figure 3b - Step 1), (ii) a menu with specific analysis mode settings (Figure 3b - Step 2), and (iii) a menu with the data output options (Fig- ure 3b - Step 3). The Project Analysis considers the analysis mode selected by the user, described below. (1) By TestClass. In the Data Input process, the user could enable the coverage metrics calculation and select the projects to be analyzed. Then, to analyze the project by test class, the Project Analysis calls the JNose-Core and option- ally executes the Code Coverage module. Finally, the Data Output process generates a view that contains a table with the number of test smells by test class. That table presents a row for each test class, and each column represents the type of parameter collected: project name, test class, and production class location, twenty-one columns for the types of test smells, the number of test class lines, the number of test methods, and five columns with coverage data. That table could be downloaded as a .csv file. Additionally, the user could view a chart or download it as a .png file with the amount of each test smell in the project. (2) By TestSmells. The Project Analysis process only calls the JNose-Core to analyze the project by test smell. During the Data Input process, the user needs to select the projects to explore. Unlike the previous analysis, By TestSmells provides the exact location of a test smell. The last, the Data Output, offers a view with the data analysis results, which could also be downloaded as a .csv file. Each row of the table represents a test smell, and it has five columns to show the type of parameter collected: the project name, the test class location, the production class location, the test smell name, the test smell location. (3) By TestFile. The Project Analysis process only calls the JNose-Core to analyze the project by test file. During the Data Input process, the user should select a test class and optionally its respective production class. Besides the pro- duction class selection is an optional feature, the Eager Test and Lazy Test test smells are not detected without it. Then, the Data Output provides a view containing a row for each detected test smell and its location. (4) Evolution. The Project Analysis process executes the Git Mining module and the JNose-Core to analyze the project by version. During the Data Input, the user should select projects to explore and search to be applied (by commits or by tags). This analysis provides the test smell detection for each project version, in addition to data about the author who committed the test smell. The Data Output process provides a view containing the data analysis results by test smells, downloadable as a .csv file. The table rows represent the test classes by commit. The columns encompass the following parameters: project name, test class and production class lo- cation, number of test smells, commit identification, author- ship, date, and message. Additionally, the user could view a chart and download it as a .png file with the amount of test smells in each project version or the number of test smells committed by an author. The tool also automatically calcu- lates the authorship of a test smell by guilt, i.e., the tester who last modified the method and did not fix it. Different analysis mode allows other data visualization. On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 (a) Data Input: cloning projects from GitHub (b) Project Analysis: configuring By TestClass analysis mode (c) Data Output: an excerpt of the table with the By TestClass results Figure 3. JNose Test - process execution Therefore, the Data Output generates tables or charts de- pending on the analysis mode. Tables are generated for all analysis modes (Figure 3c). Charts are generated for By Test- Class and Evolution. By TestClass charts present the total amount of test smells inserted in a project, and Evolution charts present the amount of test smells by project version or by author. 4.2 Tool Architecture The JNose Test is implemented as a Java project and com- prises five packages, as Figure 4 shows: (i) base, responsi- ble for instantiating the JNose-Core interface implementa- tion and calculating the coverage metrics; (ii) page, responsi- ble for presenting the web pages and their content; (iii) dtolo- cal, responsible for encompassing the classes used in dto; (iv) entity, responsible for the domain objects persistence from the database; and (v) business, responsible for applying the business rules to present the results. The base package implements the Project Analysis (Fig- ure 3a), which was split into three other packages, as follows: • Coverage. It applies the rules necessary to calculate Figure 4. Packages of the JNose Test On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 coverage. It runs the JaCoCo library10 to calculate code coverage in the Java language. It performs dynamic analysis of the production code branches (BC), instruc- tions (IC), lines (LC), complexity (CC), and methods (MC) to determine which one is either missed or cov- ered by the test (Virginio et al., 2019); • Git Mining. It applies business rules for GitHub mining. It uses the GitHub API for Java library11 to clone the projects from GitHub and extract information about the project’s tags, commits, and authors; • JNose-Core. It performs test code static analysis through an AST generated by JavaParser.12 Then, it ex- tracts information about the code structure to apply the rules for the test smells detection, and it collects ad- ditional information about the location and number of test smells. The detection rules were improved from the tsDetect tool (Section 2) to identify test smells at dif- ferent granularity levels (Table 1). The JNose Test interface was implemented in the page package based on the Apache Wicket13, a framework for web application development in Java. We also used HTML5 and CSS3 to develop the web pages. This package implements the Data Input (Figure 2). The business implements utility classes responsible for generating the results. It is possible to generate a different type of report For each analysis mode. This package implements the Data Output (Figure 2). In the dto package, we have the classes used to transfer data among the project layers. That package implements the communica- tion among Data Input, Project Analysis, and Data Out- put (Figure 2). Additionally, a local database stores the data generated by those processes, comprising persistence rules implemented in the entity package. The JNose Test execution uses parallel processes, i.e., the tool creates threads for each uploaded project, for each test class, and so on. With parallel processing, the JNose Test could be used to analyze a massive set of projects in a short time (Virginio et al., 2019). 4.3 Running Example We carried out an experimental study to verify the correla- tion between the coverage metrics and test smells in previous work. We selected eleven software projects to perform that study, in which we collected twenty-one test smells and five coverage metrics using the JNose Test. This section presents an example considering the different types of analysis modes supported by the JNose Test. We used the commons-io project14 (release 2.7-RC1), a library of utilities, to assist I/O development. We next discuss each supported method. 4.3.1 By TestClass Analysis We ran the JNose Test by TestClass to analyze which type of test smells would achieve the highest diffusion over the 10Available at https://www.eclemma.org/jacoco/ 11Available at https://github-api.kohsuke.org/ 12Available at https://javaparser.org/ 13Available at https://wicket.apache.org/ 14Available at https://github.com/apache/commons-io commons-io project. Therefore, we took the following steps: (i) select all types of test smells; (ii) select the project path; and (iii) enable code coverage. The tool returned 58 test classes. We checked the number of classes where each test smell was present to understand the test smell type diffusion. For example, the ECT test smell was present in 23 classes, followed by AR test smell in 17 test classes, and ET test smell in 16 test classes. Each type of test smell could occur many times in a test class. Those three types of test smell presented the highest occurrence in the project, counting 316, 175, and 157 times, respectively. Table 2 shows five test classes with the highest num- ber of ECT, AR, and ET test smells. For example, the test class ProxyCollectionWriterTest contains the highest number of those test smells. Additionally, most test classes achieved good code coverage when considering the IC, LC, and MC coverage metrics (>70%). Therefore, even with high coverage, the test code might present low-quality. 4.3.2 By TestSmell Once we found that the ECT, AR, and ET test smells had the highest diffusion numbers in the commons-io project test classes, we may improve the test code quality by fixing the problems. Then, we executed the JNose Test by TestSmell by taking the following steps: (i) select the ECT, AR, and ET test smells; and (ii) select the project. Table 3 shows a results excerpt filtered by the ProxyCollectionWriterTest test class. 4.3.3 By TestFile In the previous example (By TestSmells), we filtered the results to present only the ones related to the ProxyCollectionWriterTest test class. In the By Test- File analysis, that class could be analyzed individually. Therefore, we executed the JNose Test by taking the following steps: (i) select the ECT, AR, and ET test smells; and (ii) select the ProxyCollectionWriterTest and ProxyCollectionWriter files. The results are the same as the filter presented in Table 3. Listing 2 shows the ProxyCollectionWriterTest test class with the testArrayIOExceptionOnAppendChar1() test method (lines 39-53). We observed that the assertEquals() method is called twice within the test method (lines 50-51). Each one checks a different condition, but there is no explanation message for them. Thus, if the test method fails, there is no clue to identify which assertion caused the failure. That issue refers to the AR test smell. Moreover, those assertions are also related to the ECT test smell because they may fail when a specific exception occurs. Furthermore, a test method is supposed to check just one production class method; otherwise, the code has one ET test smell (ProxyCollectionWriter() on line 43 and append() on line 46). 4.3.4 Evolution Analysis The evolution analysis might help us identify whether the commons-io has improved over time. We should take the fol- lowing steps to perform this analysis: (i) select all test smells, https://www.eclemma.org/jacoco/ https://github-api.kohsuke.org/ https://javaparser.org/ https://wicket.apache.org/ https://github.com/apache/commons-io On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 Table 2. Classes with high diffusion of test smell - by TestClass TesFileName ... LOC Met UT IgT RO ... ST LT DA ET AR CTL CI DT EpT ECT GF MG PS DpT IC BC LC CC MC ProxyCollectionTest ... 448 23 1 0 0 ... 0 61 1 23 21 1 0 0 0 23 0 0 0 0 72 0 76 100 100 TreWriterTest ... 448 23 1 0 0 ... 0 30 1 2 21 1 0 0 0 23 0 0 0 0 100 0 100 100 100 ProxyWriteTest ... 275 21 3 0 0 ... 0 23 0 4 0 0 0 0 0 21 0 0 0 0 83 0 87 93 93 BoundedReaderTest ... 246 22 1 1 1 ... 0 48 1 8 3 2 0 0 0 16 0 1 0 0 100 100 100 100 100 EndianUtilsTest ... 316 22 1 0 0 ... 0 46 8 20 15 1 0 0 0 14 0 0 0 0 100 100 100 100 100 Table 3. Test Smells location in ProxyCollectionWriterTest TesFileName ... TestSmell MethodLocationName Lines ProxyCollectionTest ... AR testArrayIOExceptionOnAppendChar1 50,51 ProxyCollectionTest ... AR testArrayIOExceptionOnAppendChar2 66,67 ProxyCollectionTest ... AR testArrayIOExceptionOnAppendCharSe 82,83 ProxyCollectionTest ... ET testArrayIOExceptionOnAppendChar1 50,51 ProxyCollectionTest ... ET testArrayIOExceptionOnAppendChar2 66,67 ProxyCollectionTest ... ET testArrayIOExceptionOnAppendCharSe 82,83 ProxyCollectionTest ... ECT testArrayIOExceptionOnAppendChar1 45-52 ProxyCollectionTest ... ECT testArrayIOExceptionOnAppendChar2 61-69 ProxyCollectionTest ... ECT testArrayIOExceptionOnAppendCharSe 77-84 37 public class ProxyCollectionWriterTest { 38 39 @Test 40 public void testArrayIOExceptionOnAppendChar1 () throws IOException { 41 final Writer badW = new BrokenWriter (); 42 final StringWriter goodW = mock ( StringWriter . class ); 43 final ProxyCollectionWriter tw = new ProxyCollectionWriter (badW , goodW , null ); 44 final char data = 'A'; 45 try { 46 tw. append ( data ); 47 fail (" Expected "+ IOException . class . getName ()); 48 } catch ( final IOExceptionList e) { 49 verify ( goodW ). append ( data ); 50 assertEquals (1,e. getCauseList (). size ()); 51 assertEquals (0,e. getCause (0, IOIndexedException . class ). getIndex ()); 52 } 53 } Listing 2: ProxyCollectionWriterTest test class (ii) select the analysis by commit, and (ii) select the project path. The project has 2,337 commits, 52 releases, and 56 con- tributors from the beginning until the release 2.7RC1. We fil- tered the five test class results with more ECT, ET, and AR test smells (Table 4). Figure 5 shows the evolution of those classes and the project. The ProxyCollectionWriterTest, TreWriterTest, and ProxyWriterTest test classes are stable, as no test smell was either inserted or fixed. However, the BoundedReaderTest test class presented novel test smells during 2014-2016 and fixed them during 2016-2020. We could observe that the number of test smells increased over time, which might indicate that people involved in the project test suite development have not worked to get rid of test smells yet. In addition, authorship is calculated by fault, so the authors from that example might not have inserted all detected test smells. Table 4. Classes with high diffusion of test smell - Evolution TesFileName ... TestSmell CommitID CommitName CommitDate ProxyCollectionWrite ... 153 b739ce7c Adam Retter 03:39:47 2020 ProxyCollectionWrite ... 153 bcb36041 David Georg 00:09:03 2018 TreWriterTest ... 101 b739ce7c Adam Retter 03:39:47 2020 TreWriterTest ... 101 bcb36041 David Georg 00:09:03 2018 ProxyWriteTest ... 59 b739cc7c Adam Retter 03:39:47 2020 ProxyWriteTest ... 59 bcb36041 David Georg 00:09:03 2018 BoundedReaderTest ... 92 b739ce7c Adam Retter 03:39:47 2020 BoundedReaderTest ... 96 51f13c84 Kristian Rose 15:36:15 2016 BoundedReaderTest ... 83 9a9b8385 Gary D. Greg 01:17:05 2014 EndianUtilsTest ... 118 b739ce7c Adam Retter 03:39:47 2020 EndianUtilsTest ... 117 8940848G Gary D. Greg 18:47:06 2018 5 Empirical Evaluation This empirical evaluation aims to investigate the JNose Test accuracy in detecting test smells. We designed the em- pirical study in four steps, as Figure 6 shows: (i) Dataset Selection, in which we defined the test classes to analyze; (ii) Oracle Definition, in which we manually detected the test smells instances; (iii) Data Collection, where we applied the JNose Test and tsDetect to collect the test smells in- stances; and (iv) Data Analysis, in which we analyzed the data collected to investigate our objectives. 5.1 Dataset Selection For this analysis, we used the dataset made available by Pe- ruma et al. (2020), which contains 65 test classes extracted from GitHub projects. As we initially reused the JNose Test detection rules from the tsDetect, we decided to use the same dataset they used to perform a fair comparison be- tween both tools and assess the JNose test effectiveness. To build the dataset, Peruma et al. (2020) selected Android apps neither duplicated nor forked. Upon the smells identifi- cation in a test file, they randomly selected 65 test classes from the selected projects and followed the definitions to de- tect the test smells. Although the tsDetect implements de- tection rules for twenty-one types of test smells, only nine- teen were validated. It did not detect the DT and DpT test smells. The same limitation applies to our study. Since the authors did not have access to the test results from manual detection performed by Peruma et al. (2020), we created a new oracle using the same test and produc- tion classes for this study. Even if we had access to the Pe- ruma et al. (2020) manual detection results, we would have to detect the test smells at a fine-grained level to validate the JNose Test. The reason for such assumption is that the JNose Test detects the test smells exact location, rather than just their presence (like the tsDetect). On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 Figure 5. Evolution of the commons-io project and classes with high diffusion of test smells 5.2 Oracle Definition To manually detect the test smells instances, we followed a design not fully crossed to assign coders to the subjects, i.e., different subjects are analyzed by different subsets of coders (Hallgren, 2012). The subjects are the 65 test classes, and four authors of this study served as coders. The coders are ex- perts in test smells with at least three years of experience. Ad- ditionally, their Java programming development experience ranged from 4 to 15 years, including unit test development. We organized the codes into two groups of two coders each, where one group analyzed 32 test classes and the other group 33 test classes. Two coders individually analyzed each test class. They collected data regarding the test smells type and location, following definitions from Table 1. As a re- sult, each coder generated a document with all the test smells detected. Subsequently, the coders compiled the individual records into one document after discussing the divergences. The review process of the test smells manually detected was time and effort-consuming (~60 minutes). The final or- acle version supports the detection of eighteen types of test smells. In addition to the non-existence of the DpT and DT test smells in the dataset, previously reported by Peruma et al. (2020), we did not detect any IgT test instances smell. The analysis process of the test classes and the discussion about the classification divergences took about 60 hours. 5.3 Data Collection Data collection consisted of detecting 65 test classes in two different analyses: detection with tsDetect and detection with JNose Test tool. Detection with tsDetect. We downloaded the tsDetect version 2.0 to collect the data. It executes three modules: i) the Test File Detector to detect the test classes, ii) Test File Mapping to link the test classes to production classes, and iii) tsDetect to detect the test smells. All modules were executed by command line in the terminal sequentially. As a result, the tsDetect generates a file that contains a boolean value for each type of test smell detected in the test class. Therefore, the result provided by the tsDetect has a class- level granularity. The detection process took about 7 minutes, considering the tool execution time and the participants’ ex- pertise with the operating system terminal to exercise the nec- essary commands for its execution. Detection with JNose Test. We use the JNose Test ver- sion 2.1 to detect the test smells. After running the tool, the output file with the result encompassed each test smell for each test class detected. The test smells detection granular- ity followed Table 1. The automated detection with JNose Test took about 1 minute due to the unified process to de- tect the test classes, production classes, and test smells. A friendly graphical interface makes this process easier. 5.4 Data Analysis We used the oracle to calculate the JNose test and tsDetect accuracy against the manual analysis. Both tools present distinct granularity levels to detect test smells. tsDetect indicates whether a test class contains a test smell instance, i.e., returns a boolean value for each test smell in a class. JNose Test detects all instances of a test smell with its exact location (line, block, method, or class). Therefore, we carried out what follows: 1. We compared the JNose Test and tsDetect accuracy considering the class-level. We treated the JNose Test output to show boolean values at the class-level to com- pare with the tsDetect. As the JNose Test detection rules were reused from the tsDetect, our goal is to determine the extension we improved those detection rules. In this comparison, the accuracy is given at the class-level considering its precision and recall. 2. We compared the JNose Test and manual analysis ac- curacy considering a fine-grained level. For example, by evaluating the line-level of granularity, we can de- tect the AR test smell; therefore, we collected data at the line level to see it manually and automatically. Our goal is to show the JNose Test accuracy to indicate the test smells location. Therefore, we provide the accu- On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 Figure 6. Steps to conduct the experiment racy value at a fine-grained level in terms of precision and recall. 6 Results This section reports the results of our empirical study. The data for replication purposes are available online (Virgínio et al., 2021). 6.1 Comparison between JNose and tsDetect Table 5 reports precision and recall accuracy when detecting test smells with JNose Test and tsDetect. This compari- son was made at the test class-level. Table 5. JNose Test and tsDetect Comparison - Class-level Accuracy (%) Precision (%) Recall (%) F1-Score (%) Test Smell JNose tsDetect JNose tsDetect JNose tsDetect JNose tsDetect AR 100 75.38 100 90 100 75 100 78 CI 100 100 100 100 100 100 100 100 CTL 100 100 100 100 100 100 100 100 DA 98.46 96.92 99 98 98 97 99 97 ECT 100 46.15 100 92 100 46 100 55 ET 95.38 86.15 95 87 95 86 95 86 EpT 100 100 100 100 100 100 100 100 GF 98.46 98.46 99 99 98 98 99 99 LT 100 93.85 100 94 100 94 100 94 MG 90.77 90.77 92 92 91 91 89 89 MNT 95.38 90.77 96 92 95 91 95 90 PS 100 100 100 100 100 100 100 100 RA 100 100 100 100 100 100 100 100 RO 89.23 89.23 91 91 89 89 88 88 SE 100 100 100 100 100 100 100 100 ST 100 100 100 100 100 100 100 100 UT 100 93.85 100 94 100 94 100 94 The results obtained with the tsDetect diverges from those reported by Peruma et al. (2020). Such study yielded precision values from 85.71% to 100% and recall values from 95% to 100%. They could detect nineteen types of test smells. The tsDetect achieved a precision from 87.71% to 100% and recall from 46% to 100% for eighteen types of test smells when using our oracle. As we mentioned earlier, we did not detect any IgT test smell instances in none of the tools. Those divergences highlight the challenges of building an or- acle due to different interpretations that a coder may have about the test smells definitions. Regarding the results obtained with the JNose Test, the precision ranged from 91% to 100%, and the recall from 89% to 100% to detect eighteen types of test smells. As we reused the tsDetect detection rules, we showed the improvements we achieved. Considering the F1-Score metric, the JNose Test presented accuracy improvement of 45% for the ECT test smell, followed by 22% for the AR test smell, 11% for the VT test smell, 9% for the ET test smell, 6% for the LT, and UT test smells, 5% for the MNT test smell, and 2% for the DA test smell. Other test smells detection rules did not present any relevant improvement at the test class level. Next, we showed the reason for the divergence between the results obtained by the tools for the ECT test smell de- tection. The JNose Test considers three compliant solu- tions to handle exceptions (Listing 3): i) the use of the tag Test with the expected parameter (lines 1-4), ii) the use of assertThrows statement (lines 6-9), or iii) throw the exception in the method signature (lines 11-14). As a non- compliant solution, it considers the try/catch structure within the method body (lines 16-23). The tsDetect con- siders the try/catch structure and the throw-in method sig- nature as a non-compliant solution (lines 11-23). We identified that the tsDetect does not consider the JU- nit overloaded methods when using an assert statement re- garding the AR test smell. For example, the assertEquals On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 asserts that (Listing 4) (i) two objects are equal (lines 1-9) or (ii) two objects are equal within a positive delta (lines 11- 19). The optional value is a string that describes the assertion. The tool simplifies the number of parameters expected by the assert statement. It detects as a test smell only methods with two parameters (lines 14). The problem occurs because the tool always classifies the assertEquals as a non-test smell when the assert has three parameters. However, it is neces- sary to verify the fourth parameter to decide whether it is either a test smell or not. We improved the JNose Test in this direction. Additionally, there was a conflict in the EpT, and UT test smells definition. The EpT test smell is a test method without executable statements (empty method). The UT test smell is a test method with executable statements but no assertions. The tsDetect considers methods without a body as both EpT and UT. Therefore, we implemented the rules necessary to differentiate those test smells. We performed some minor fixes to detect other types of test smells. For example, for the VT test smells, the tsDetect considers a class with more than 123 lines as one verbose test. As the JNose Test de- tects the test smells at a fine-grained level, we defined that a test method with more than 30 lines is verbose. Therefore, we found more instances because of our definition. 1 @Test ( expected = Exception . class ) 2 public void tag_usage (){ 3 // Some code 4 } 5 6 @Test 7 void trows_statement_usage () { 8 assertThrows (" Exception Message ", Exception . class , parameter ); 9 } 10 11 @Test 12 public void trows_signature_usage () throws Exception . class { 13 // Some code 14 } 15 16 @Test 17 public void try_catch_usage () { 18 try { 19 // Some code 20 } catch ( MyException e) { 21 Assert . fail (e. getMessage ()); 22 } 23 } Listing 3: (Non)Compliant Solutions for ECT considered by JNose Test 6.2 JNose and Manual Analysis Comparison Table 6 reports accuracy through precision and recall values when detecting test smells with JNose Test and manual analysis. This comparison considered the granularity level for the test smells detection. In a fine-grained level, the JNose Test precision score ranges from 84% to 100%, and the recall ranges from 47% to 100%. At the class level, the detection difficulties related 1 @Test 2 public void two_parameters (){ 3 assertEquals ( float expected , float actual ) 4 } 5 6 @Test 7 public void three_parameters_with_message (){ 8 assertEquals ( String message , float expected , float actual ) 9 } 10 11 @Test 12 public void four_parameters (){ 13 assertEquals ( String message , float expected , float actual , float delta ) 14 } 15 16 @Test 17 public void three_parameters_no_message (){ 18 assertEquals ( float expected , float actual , float delta ) 19 } Listing 4: Solutions for AR considered by JNose Test Table 6. JNose Test and Manual Analysis Comparison - Fine granularity level Test Smell Accuracy (%) Precision (%) Recall (%) F1-Score (%) AR 100 100 100 100 CI 100 100 100 100 CTL 100 100 100 100 DA 94.12 100 94 97 ECT 100 100 100 100 ET 89.13 100 89 94 EpT 100 100 100 100 GF 90 100 90 95 LT 96.55 100 97 98 MG 50 100 50 67 MNT 94.74 100 95 97 PS 100 100 100 100 RA 100 100 100 100 RO 47.06 84 47 60 SE 100 100 100 100 ST 100 100 100 100 UT 100 100 100 100 VT 100 100 100 100 to specific cases are not evident because it returns a Boolean value for test smells in the whole test class. However, when we performed a more detailed test smell detection, we no- ticed some test code-specific characteristics that the tool does not detect. The most divergent results between the class- and fine granularity-level are the MG and RO test smells. At the class level, those test smells have the accuracy of 90.77% and 89.23%, respectively. However, those test smells present ac- curacy of 50% and 47.06%, respectively. Both the test smells to deal with external resources. A test method that makes op- timistic assumptions about external resources’ existence has the RO test smell (Listing 5, lines 10-21). The test method that uses external resources has the MG test smell (Listing 5, lines 2-5). As the JNose Test performs test code static analysis, we only considered the direct calls for external resources (Listing 5, lines 1-15). However, whether a test method calls a production class from any part of the project and that class calls for external resources, the test class uses On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 external resources indirectly (Listing 5, lines 17-21). In this scenario, the MG and RO test smells need additional work to determine the indirect calls. We identified a specific characteristic that can detect other false positives instances using the DA test smell. That false positive occurs when one test method uses an asser- tion structure implemented by a JSON library similar to the assertion structure implemented by the JUnit. This is because the JUnit has the assertThat(String reason, T actual, M matcher) the other JSONAssert library im- plements the assertThat(String).contains(String). When performing the static analysis, all the statements that start with assert were considered a JUnit assertion. There- fore, we may improve it by detecting the libraries imported in the test class. However, the tool might miss test smells instances if using a test class with another assert library. Other types of test smell required minor fixes. The LT and ET test smell miss some instances due to default construc- tors. We considered that the same way a different test method should not call the same production class method, a class is instantiated several times in different test methods. If many test methods need to instantiate the same object, it should be moved to a setup method. Therefore, we need to improve the JNose Test to detect calls for the default constructors. 1 @Test 2 public void external_File (){ 3 File file = openFile (" config . xml "); 4 if ( file . exists ()){ 5 XmlPullParser config = XmlParserFactory . fromFile ( file ); 6 // Some code 7 } 8 } 9 10 @Test 11 public void external_File_Without_Checking (){ 12 File file = openFile (" config . xml "); 13 XmlPullParser config = XmlParserFactory . fromFile ( file ); 14 // Some code 15 } 16 17 @Test 18 public void external_Resource_Indirectly (){ 19 XmlReader reader = new XmlReader (" xml / config . xml "); 20 // Some code 21 } Listing 5: Mystery Guest and Resource Optimus 7 Related Work In large-sized test suites, software engineers barely perform manual detection of test smells. This practice is rather time- consuming and infeasible in many scenarios. Therefore, the research community has proposed automated tool support for detecting test smells. The Test Smell Detector (TSD) detects nine types of test smells (Bavota et al., 2015). The TSD detection rules over- estimate the presence of test smells in the code to ensure high recall (87%). It returns a list of candidate-affected classes. Similarly, tsDetect, the state-of-the-art tool to detect test smells, identifies twenty-one types of test smell (Section 2). It indicates whether a particular test smell appears in the test class with the precision score ranging from 85% to 100%, and recall score from 90% to 100% (Peruma et al., 2020). Other tools correlate test smells with structural and cover- age metrics. The IntelliJ plug-in coined VITRuM (VIzualiza- tion of Test-Related Metrics) is an extension of tsDetect. It collects a set of seven types of test smells and structural met- rics (Pecorelli et al., 2020). TeReDetect (Negar and Garousi, 2010) and TeCReVis (Koochakzadeh and Garousi, 2010) use code coverage analysis, held by CodeCover, to detect test smells related to code duplication. Our tool uses a test smells rule-based detection instead of a metric- or coverage-based detection. It extends the tsDetect tool in several respects. For example, our tool provides the number of test smells identified in a test class and the method line and name with each test smell’s location. Moreover, it supports the test suite analysis through several project versions, by mining Git for providing information about when and by who introduced the test smells. Additionally, our tool supports other tools for test smells refactoring (RAIDE) (Santana et al., 2020) and visualiza- tion (TSVizzEvolution). The RAIDE is an Eclipse IDE plu- gin to detect and refactor the AR and DA test smells. The TSVizzEvolution is a test smells visualization tool that aims to help the user understand problems in the test code by using three visualization techniques (Graph View, Treemap View, and Timeline View). It represents the twenty-one types of test smells detected by JNose Test. 8 Threats to Validity Internal Validity. In the manual analysis to construct the ora- cle, there may have been divergences among the researchers’ analysis. We mitigated this threat by resolving disagreements collectively. After collecting data with the JNose Test and tsDetect tools, we checked if any test smells detected by the tools were not considered in the manual analysis. External Validity. Our study results may not be general- ized to other suites of test classes or other types of test smells. To mitigate this threat, we used the same dataset used in the study to validate the tsDetect tool (Peruma et al., 2020). Conclusion Validity. Although the JNose Test detects twenty-one types of test smells, this study only validated eighteen ones because the dataset used did not have the DpT, DT, and IgT test smells. On the other hand, we used the same dataset used to evaluate tsDetect (Peruma et al., 2020). Construct Validity. Although we used four coders to build the Oracle, they were experts with more than three years of experience with test smells. They were aware of the test code of the test smells detection tools. 9 Conclusion This paper presents the JNose Test and its API, the JNose-Core. The API supports the detection of twenty-one types of test smells. It provides a flexible architecture to sup- On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 port the insertion of new test smells detection rules. The JNose Test tool is a web application to detect test smells and calculate coverage for Java projects. To validate the detection rules implemented by JNose-Core, we conducted an empirical study to com- pare our tool’s accuracy with the state-of-the-art tool and manual analysis. We built an oracle to detect test smells to perform the comparison. The oracle contains sixty-five test classes analyzed by specialists in the subject. The comparison between JNose and tsDetect was made at the class-level. The results showed that JNose presented higher accu- racy than tsDetect, in terms of precision and recall. As we reused the detection rules from the tsDetect to imple- ment the JNose Test, the results indicated that we success- fully improved them. Additionally, the JNose also detects test smells at a fine-grained level. As the tsDetect does not support this feature, we could only compare the fine-grained level detection against the manual analysis. Results showed a high accuracy to determine the exact line location, but it still needs further improvements. There are many opportunities for other investigations. For example, it would be interesting to validate our tool effi- ciency in a real-world environment through a user study. Such a study could also consider significant usability con- cerns. There is open room for introducing new features in the JNose Test in terms of both detection and refactoring, and as necessary, in terms of how it behaves in practice con- sidering quality attributes. Acknowledgements This research was partially funded by INES 2.0; CNPq grants 465614/2014-0 and 408356/2018-9 and FAPESB grants JCB0060/2016 and BOL0188/2020. References Bavota, G., Qusef, A., Oliveto, R., De Lucia, A., and Binkley, D. (2015). Are test smells really harmful? an empirical study. Empirical Software Engineering, 20(4):1052–1094. Bavota, G., Qusef, A., Oliveto, R., Lucia, A., and Binkley, D. (2012). An empirical analysis of the distribution of unit test smells and their impact on software maintenance. In 28th IEEE International Conference on Software Main- tenance (ICSM). Bell, J., Legunsen, O., Hilton, M., Eloussi, L., Yung, T., and Marinov, D. (2018). DeFlaker: Automatically Detecting Flaky Tests. In IEEE/ACM 40th International Conference on Software Engineering (ICSE), pages 433–444. Capgemini (2018). World Quality Report 2018- 19. https://www.capgemini.com/service/ world-quality-report-2018-19/. Accessed: March 1st, 2021. CISQ (2021). The Cost of Poor Software Quality in the US: A 2020 Report. https://www.it-cisq.org/pdf/ CPSQ-2020-report.pdf. Acessed: March 1st, 2021. Deursen, A., Moonen, L. M., Bergh, A., and Kok, G. (2001). Refactoring test code. In Refactoring Test Code, Amster- dam, The Netherlands, The Netherlands. CWI (Centre for Mathematics and Computer Science). Garousi, V. and Küçük, B. (2018). Smells in software test code: A survey of knowledge in industry and academia. Journal of Systems and Software, 138:52 – 81. Gopinath, R., Jensen, C., and Groce, A. (2014). Code cover- age for suite evaluation by developers. In Proceedings of the 36th International Conference on Software Engineer- ing (ICSE), New York, NY, USA. ACM. Grano, G., Palomba, F., Di Nucci, D., De Lucia, A., and Gall, H. C. (2019). Scented since the beginning: On the diffuse- ness of test smells in automatically generated test code. Journal of Systems and Software, 156:312–327. Greiler, M., van Deursen, A., and Storey, M. (2013). Auto- mated detection of test fixture strategies and smells. In IEEE Sixth International Conference on Software Testing, Verification and Validation, pages 322–331. Guerra Calle, D., Delplanque, J., and Ducasse, S. (2019). Exposing Test Analysis Results with DrTests. In Inter- national Workshop on Smalltalk Technologies, pages 1–5, Cologne, Germany. HAL. Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in quantitative methods for psychology, 8(1):23. Junior, N. S., Rocha, L., Martins, L. A., and Machado, I. (2020). A survey on test practitioners’ awareness of test smells. In Proceedings of the XXIII Iberoamerican Con- ference on Software Engineering, CIbSE 2020, pages 462– 475. Curran Associates. Koochakzadeh, N. and Garousi, V. (2010). TeCReVis: A Tool for Test Coverage and Test Redundancy Visualiza- tion. In Bottaci, L. and Fraser, G., editors, Testing – Prac- tice and Research Techniques, pages 129–136, Berlin, Hei- delberg. Springer Berlin Heidelberg. Meszaros, G., Smith, S. M., and Andrea, J. (2003). The test automation manifesto. In Maurer, F. and Wells, D., edi- tors, Extreme Programming and Agile Methods - XP/Agile Universe 2003, Berlin, Heidelberg. Springer Berlin Hei- delberg. Negar, K. and Garousi, V. (2010). A tester-assisted method- ology for test redundancy detection. Advances in Software Engineering, 2010. Palomba, F., Zaidman, A., and Lucia, A. D. (2018). Au- tomatic test smell detection using information retrieval techniques. In IEEE International Conference on Soft- ware Maintenance and Evolution (ICSME), pages 311– 322, Madrid, Spain. IEEE. Pecorelli, F., Di Lillo, G., Palomba, F., and De Lucia, A. (2020). VITRuM: A Plug-In for the Visualization of Test- Related Metrics. In Proceedings of the International Con- ference on Advanced Visual Interfaces, New York, NY, USA. ACM. Peruma, A., Almalki, K., Newman, C. D., Mkaouer, M. W., Ouni, A., and Palomba, F. (2019). On the distribution of test smells in open source android applications: An ex- ploratory study. In Proceedings of the 29th Annual Inter- national Conference on Computer Science and Software https://www.capgemini.com/service/world-quality-report-2018-19/ https://www.capgemini.com/service/world-quality-report-2018-19/ https://www.it-cisq.org/pdf/CPSQ-2020-report.pdf https://www.it-cisq.org/pdf/CPSQ-2020-report.pdf On the test smells detection: an empirical study on the JNose Test accuracy Virgínio et al. 2021 Engineering (CASCON), Riverton, NJ, USA. IBM. Peruma, A., Almalki, K., Newman, C. D., Mkaouer, M. W., Ouni, A., and Palomba, F. (2020). TsDetect: An Open Source Test Smells Detection Tool. ACM, New York, NY, USA. Santana, R., Martins, L., Rocha, L., Virginio, T., Cruz, A., Costa, H., and Machado, I. (2020). RAIDE: A Tool for Assertion Roulette and Duplicate Assert Identification and Refactoring. In Proceedings of the 34th Brazilian Sympo- sium on Software Engineering (SBES). ACM. Spadini, D., Palomba, F., Zaidman, A., Bruntink, M., and Bacchelli, A. (2018). On the relation of test smells to soft- ware code quality. In International Conference on Soft- ware Maintenance and Evolution (ICSME), pages 1–12. IEEE. Spadini, D., Schvarcbacher, M., Oprescu, A.-M., Bruntink, M., and Bacchelli, A. (2020). Investigating severity thresh- olds for test smells. In Proceedings of the 17th In- ternational Conference on Mining Software Repositories (MSR). ACM. Virginio, T., Martins, L., Soares, L. R., Railana, S., Costa, H., and Machado, I. (2020). An empirical study of automatically-generated tests from the perspective of test smells. In Proceedings of the XXXIV Brazilian Symposium on Software Engineering (SBES), New York, NY, USA. ACM. Virginio, T., Santana, R., Martins, L. A., Soares, L. R., Costa, H., and Machado, I. (2019). On the influence of test smells on test coverage. In Proceedings of the XXXIII Brazilian Symposium on Software Engineering (SBES), pages 467– 471, New York, NY, USA. ACM. Virgínio, T., Martins, L., Santana, R., Cruz, A., Rocha, L., Costa, H., and Machado, I. (2021). On the test smells detection: an empirical study on the JNose Test accuracy [Dataset]. Available at: https://doi.org/10.5281/ zenodo.4570751. Yusifoğlu, V. G., Amannejad, Y., and Can, A. B. (2015). Software test-code engineering: A systematic mapping. In- formation and Software Technology, 58:123 – 147. https://doi.org/10.5281/zenodo.4570751 https://doi.org/10.5281/zenodo.4570751 Introduction Background JNose Core Architecture Detection Rules JNose Test Processes Description Tool Architecture Running Example By TestClass Analysis By TestSmell By TestFile Evolution Analysis Empirical Evaluation Dataset Selection Oracle Definition Data Collection Data Analysis Results Comparison between JNose and tsDetect JNose and Manual Analysis Comparison Related Work Threats to Validity Conclusion