Journal of Software Engineering Research and Development, 2021, 9:4, doi: 10.5753/jserd.2021.816  This work is licensed under a Creative Commons Attribution 4.0 International License.. An Approach for the Generation of Multi­Objective Algorithms Applied to the Integration and Test Order Problem Giovani Guizzo [ University College London | g.guizzo@ucl.ac.uk] Thainá Mariani [ Federal University of Paraná | marianithaina@gmail.com] Silvia Regina Vergilio [ Federal University of Paraná | silvia@inf.ufpr.br] Aurora Pozo [ Federal University of Paraná | aurora@inf.ufpr.br] Abstract Multi­Objective Evolutionary Algorithms (MOEAs) have been successfully applied to solve hard real software engineering problems. However, to choose and design a MOEA is considered a difficult task, since there are several parameters and components to be configured. These aspects directly impact the generated solutions and the perfor­ mance of MOEAs. In this sense, this paper proposes an approach for the automatic generation of MOEAs applied to the Integration and Test Order (ITO) problem. Such a problem refers to the generation of optimal sequences of units for integration testing. The approach includes a set of parameters and components of different MOEAs, and is imple­ mented with two design algorithms: Grammatical Evolution (GE) and Iterated Racing (irace). Evaluation results are presented, comparing the MOEAs generated by both design algorithms. Furthermore, the generated MOEAs are compared to two well­known MOEAs used in the literature to solve the ITO problem. Results show that the MOEAs generated with GE and irace perform similarly, and both outperform traditional MOEAs. The approach can reduce efforts spent to design and configure MOEAs, and serves as basis for implementing solutions to other software engineering problems. Keywords: Multi­Objective Evolutionary Optimization, Grammatical Evolution, Iterated Racing, Software Testing 1 Introduction The testing activity is usually performed in different phases. In the unit testing phase, each module (unit) is individually tested. After this, in the integration phase, the units are com­ bined and tested in order to identify software design faults related to the interaction of units. In many cases, there are de­ pendency relations between the units, that is, to test a unit A, another unit B is required. When dependency cycles among units exist it is necessary to break the cycle and to construct a stub for B, if B is not available. The stubbing process may be an expensive and error prone task. Hence, to reduce stubbing costs, it is very important to determine the best sequence of units to be tested and integrated. Additionally, there are some factors to take into consideration, such as the number of re­ quired methods, attributes, and parameters to be emulated in the stub, besides other ones associated with the software de­ velopment (Assunção et al., 2014), making this task a multi­ objective problem that cannot be easily solved by the tester in a short time. Such problem is known in the literature as In­ tegration and Testing Order problem (ITO) and appears in different contexts such as, component based development, object­oriented development, aspect­oriented development, and software product line engineering. The ITO problem has been investigated in the SBSE (Search­Based Software Engineering) field (Harman et al., 2012) by using different search­based algorithms (Wang et al., 2011). The most promising ones are the Multi­ Objective Evolutionary Algorithms (MOEAs) (Assunção et al., 2014; Vergilio et al., 2012a), as they can solve hard real­world problems impacted by many conflicting objec­ tives. Moreover, MOEAs are widely used for software engi­ neering problems in general. Surveys of the SBSE field (Har­ man et al., 2012; Colanzi et al., 2019) report that evolutionary algorithms are the preferred and most used ones. In spite of this preference and large use, the design of a MOEA is not always easy. MOEAs are distinguished by dif­ ferent components, which directly affect the generated solu­ tions. Furthermore, they have a wide range of parameters to be configured (Eiben and Smit, 2011). For example, MOEAs have as components the search operators, the replacement and archiving procedures. They also have as parameters: pop­ ulation size, crossover and mutation probabilities. In this work, the word design refers to the choice and implementa­ tion of the best components and parameters configuration. In this way, the great number of alternatives makes the MOEA design a hard task. The best combination of components and parameters may depend on the problem being solved. Moreover, it is not always possible to know a priori what is the best choice among existing MOEAs. In the SBSE liter­ ature, the best MOEA for a problem is usually determined by conducting evaluation experiments, which requires ef­ fort and increases costs. Regarding the ITO problem, As­ sunção et al. (2014) conducted experiments with three dif­ ferent MOEAs ­ NSGA­II, SPEA and PAES ­ and no one of them proved to be the best to solve the problem considering all instances and contexts. To reduce such difficulties faced by software engineers, in this paper, we propose an approach for the automatic gener­ ation of MOEAs applied to the ITO problem. The approach includes an offline training process, performed by a design algorithm that receives as input an instance of the ITO prob­ lem reported in the literature (Assunção et al., 2014). The output is a MOEA that can be used by the tester on other instances. To allow the use of different design algorithms the approach encompasses a set of MOEA parameters and mailto:g.guizzo@ucl.ac.uk mailto:marianithaina@gmail.com mailto:silvia@inf.ufpr.br mailto:aurora@inf.ufpr.br Guizzo et al. 2020 components to be combined. It is implemented with irace (from Iterated Racing algorithms) (López­Ibáñez et al., 2011) and Grammatical Evolution (GE) (Ryan et al., 1998). irace defines a “parameter space” in which the parameters, their types, ranges and constraints are defined. GE is a type of Ge­ netic Programming (GP) (Koza, 1992) that uses a grammar containing a set of rules and values to guide the evolutionary process in the generation of programs. In this sense, by us­ ing the grammar of GE or the parameter space of irace, it is possible to map the components and parameters to be used in the automatic design. The approach is evaluated with seven systems and pro­ duces results statistically better (in terms of hypervolume) than MOEAs commonly used to solve the ITO problem in the literature. A comparison between both design algorithms shows that they present similar results. These results provide some evidence for the benefits of automating the design and configuration of algorithms in SBSE. Furthermore, our re­ sults can be seen as evidence for the effectiveness of GE and the competitive results between GE and irace, which are ongoing matters of discussion in the community (Whigham et al., 2017b; Ryan, 2017; Whigham et al., 2017a; O’Neill and Nicolau, 2017). An empirical evaluation of our approach was conducted using seven testing instances of the ITO problem in order to answer the following research questions: • RQ1 – How are the results of the MOEAs generated by our approach in comparison to the MOEAs used in the literature? and, • RQ2 – How are the results of the MOEAs generated using GE in comparison to the results of the MOEAs generated using irace? RQ1 concerns comparing MOEAs generated by our ap­ proach with the algorithms NSGA­II and SPEA, used in the ITO problem literature. RQ2 concerns comparing the de­ sign algorithms GE and irace. As a result, we observe that both design algorithms have similar performance and our ap­ proach generates MOEAs that are better or similar than the traditional ones, considering some statistical tests and hyper­ volume, the main quality indicator used in the optimization field to compare MOEAs (Zitzler et al., 2003, 2007). Some works on automatic design of Evolutionary Algo­ rithms (EAs) are based on GE (Lourenço et al., 2012, 2013, 2015) and irace (Bezerra et al., 2014, 2015). These papers use different parameters and components, which are success­ fully combined to evolve better algorithms. One of these pa­ pers addresses the automatic design of MOEAs specialized in some benchmark instances using irace (Bezerra et al., 2014, 2015). With respect to the use of such algorithms in SBSE, the work of Mariani et al. (2016) proposes a GE­based hyper­heuristic, named GEMOITO, for generating MOEAs to solve the ITO problem. The encouraging results of all the aforementioned works motivated the work herein described. In this sense, our work has the following main contributions: • Use of a set of components and parameters for MOEAs design that includes elements not used in related work; • Introduction of an approach that can be used to solve software engineering problems formulated as permuta­ tion problems. The main idea is to ease the design of MOEAs, reducing efforts; and • Application and evaluation of the approach to the ITO problem. Such an evaluation reports results comparing two design algorithms: GE and irace. We have not found works comparing them. This paper is organized as follows. Section 2 reviews sub­ jects related to this work: the ITO problem, automatic design of algorithms, and MOEAs. Section 3 introduces the pro­ posed approach, describing its main elements. Section 4 re­ ports the conducted empirical evaluation and the discussion of the results. Section 5 contains related work. And finally, Section 6 presents the conclusion and describes some future work. 2 Background This section introduces the ITO problem and reviews back­ ground on automatic design of algorithms and MOEAs. 2.1 Integration and Testing Order problem A testing strategy usually includes a set of phases with dis­ tinct goals. First of all, the unit testing focuses on each unit, the smallest part of a system to be tested. After this, an in­ tegration testing phase is performed to find problems in the interaction among the units. In this phase, the units are com­ bined and tested according to the testing plans. During such integration it is necessary to determine an order to integrate and test the units. Such an order impacts the sequence in which units are developed; the design and execution of test cases; the order in which integration faults are revealed; and the number of required stubs for the units that possibly are not available, but from which the unit being tested depends on. The stubbing process can be expensive and error­prone. The cost can be impacted by many factors such as number of attributes and parameters to be emulated, number of return types and so on. When there is a cyclic dependency between two classes, such a dependency needs to be broken and a stub needs to be built. In this sense, it is important that the class associated with the smaller cost, given for instance by the number of attributes and operations, appears first in the in­ tegration order. Due to this, we find in the literature many strategies to solve the Integration and Testing Order Prob­ lem (ITO) (Wang et al., 2011; Briand et al., 2002a). The most promising works are based on Multi­Objective Evolutionary Algorithms (MOEAs) (Vergilio et al., 2012a; Assunção et al., 2014). The proposed strategies are generally based on graphs that represent the dependencies among the units. A depen­ dency cycle in the graph, which needs to be broken, corre­ sponds to a required stub. The problem is to determine the best sequence of units associated with minimal stubbing cost. In this work, we use the formulation of the ITO problem proposed by Assunção et al. (2014), as well as the same benchmark. A static model representing the dependencies amongtheunitsandcosts(associatedwiththenumberofpub­ lic classes attributes and methods to be emulated) is provided Guizzo et al. 2020 to the algorithm by using matrices. These matrices are gener­ ated previously and reused during the optimization process. If the tester wants to use a new program, this program static model must be generated and given as input (as described in Assunção et al. (2014)). Since the ITO problem is related to permutations of units, which form testing orders, the chromosome is represented by a vector of integers where each vector position corresponds to a class. The size of the chromosome is equal to the number of units of each system. Thus, let each unit be represented by a number, an example of a valid solution for a problem with five units is {2, 4, 3, 1, 5}. In this example, the first unit to be tested and integrated would be the unit represented by number 2. The second, the unit represented by number 4, and so on. Each unit can appear only once. We use two objective functions to evaluate the solu­ tions (Assunção et al., 2014), which measure the dependen­ cies between server and client units. Hence, considering that: (i) a unit is a module to be tested that can be either classes or aspects; (ii) mi and mj are two coupled modules; and (iii) the operation term represents class methods, aspect methods and/or aspect advice, the used coupling measures are defined as follows: • Number of attributes (A) = The number of attributes lo­ cally declared in mj when references or pointers to in­ stances of mj appear in the argument list of some oper­ ations in mi, as the type of their return value, in the list of attributes (data members) of mi, or as local param­ eters of operations of mi (adapted from Vergilio et al. (2012a); Briand et al. (2002a)). This complexity mea­ sure counts the (maximum) number of public attributes that would have to be handled in the stub if the depen­ dency were broken. • Number of methods (O) = The number of opera­ tions/methods (including constructors) locally declared in mj that are invoked by operations of mi (adapted from Briand et al. (2002a)). This complexity measure counts the number of public operations that would have to be emulated if the dependency were broken. In the literature, the ITO problem is usually defined in terms of the testing process cost (Assunção et al., 2014; As­ sunção et al., 2013). By reducing the number of attributes and operations to be emulated, the tester can reduce the overall cost of the integration testing activity, i.e., these metrics serve as surrogates for the cyanstubbing process cost. It is hard to estimate the actual human effort for emulating such items, but using metrics in such granularity, number of attributes and number of operation/methods, as opposed to number of stubs can provide the tester a better way of estimating the cost. Moreover, the ITO problem formulation, commonly found in the literature, does not address the fault revealing capability of the testing process, just the cost of stubbing. The input data of the MOEAs consist of matrices that are read from a text file. They are matrices associated to (i) de­ pendencies between units; (ii) measure A; and (iii) measure O. Then it is necessary to establish a trade­off between these values. The dependency matrix is used to define precedence constraints and the others to calculate the fitness of each so­ lution, where the sum of dependencies between the classes for each measure corresponds to an objective. It is impor­ tant to highlight that there are classes with a great number of operators and few attributes, and vice­versa; the goal is to minimize both objectives. 2.2 Automatic Design of Algorithms The design of an algorithm is related to all decisions taken during its definition and considering a specific prob­ lem (Eiben and Smit, 2011). Algorithms used for solving hard optimization problems commonly have several parame­ ters and components to be defined by the user (López­Ibáñez et al., 2011). The chosen values have a great influence on the algorithm performance, but there is no generic set of values, since the optimal set depends on the problem being solved (Eiben and Smit, 2011). These values are then usually chosen based on common wisdom of the community(López­ Ibáñez et al., 2011), since finding the appropriate values for the parameters and components is one of the greatest chal­ lenges in the evolutionary computation field (Eiben and Smit, 2011). In the EAs’ context, examples of parameters can be the population size, crossover and mutation probabilities. Exam­ ples of components are the search operators, such as selec­ tion, crossover and mutation ones. In this sense, the auto­ matic design of algorithms can be very useful, since there is a great number of alternatives for these parameters and compo­ nents. To this end, we find in the literature different methods for the automatic design of EAs. More details about existing methods can be found in the surveys of Eiben and Smit (2011, 2012). Our approach works with irace and GE. We chose such algorithms because they are widely used in the literature of automatic configuration of meta­heuristics (Eiben and Smit, 2011, 2012). The GE algorithm has already been used for generating EAs (Lourenço et al., 2012, 2013, 2015) and pre­ sented encouraging results. This is one of the main motiva­ tions for its usage. Another advantage of GE is that it allows a flexible and context­free definition of programs to be gen­ erated by using a grammar. On the other hand, irace uses a very interesting mechanism based on statistical tests that has shown good results for multi­objective algorithms (López­ Ibáñez and Stützle, 2012; Bezerra et al., 2014, 2015). Both algorithms are described in the following. 2.2.1 irace The Iterated Race algorithm (here called irace) works by sequentially evaluating the candidate configurations and ex­ cluding the statistically worse. It consists of three main steps, repeated until a stopping criterion is satisfied, according to López­Ibáñez et al. (2011): 1. sampling new configurations according to a particular distribution; 2. selecting the best configurations from the newly sam­ pled ones by means of racing; and 3. updating the sampling distribution in order to bias the sampling towards the best configurations. Guizzo et al. 2020 To sample new configurations (Step1), the irace algo­ rithm considers two types of parameters: numerical and cat­ egorical, and the sampling distribution of each parameter de­ pends on the parameter type. Numerical parameters have a normal distribution and categorical parameters have a dis­ crete distribution. The update (Step 3) changes the sampling by updating the mean and standard deviation for normal dis­ tributions, and the probabilities for the discrete distributions. This update process guides the distribution in order to in­ crease the probability of selecting the parameters used in the best configurations when generating new ones (López­ Ibáñez et al., 2011). In order to select the best configurations (Step 2), the can­ didate ones are evaluated at each step on a single instance. After each step, the candidates that are significantly worse than at least another one are excluded. The race is repeated with the surviving configurations, and continues until a stop­ ping criterion is met. This criterion is generally related to a number of surviving configurations, a number of used in­ stances or a predefined computational budget (López­Ibáñez et al., 2011). In this work we use the irace package1 of the R Project2, the same one used in Bezerra et al. (2014, 2015). In order to be executed, the irace package requires three inputs: a configuration file, the set of instances to be used and the parameter space. The parameter space is where the param­ eters used in the automatic configuration, their types, ranges and constraints should be defined (López­Ibáñez et al., 2011). Moreover, the statistical analysis is performed using the non­ parametric Friedman statistical test (Derrac et al., 2011). Figure 1 presents an example of a parameter space defined in a parameter file for the irace package. This example is for the automatic configuration of an Ant Colony Optimiza­ tion (ACO) algorithm, and is included in the irace package along with other configuration files as a usage sample. Figure 1. Example of parameter space defined for the irace package Each parameter has a name, a switch value, a type, values and conditions (optional). The name of the parameter is an identifier for later usage in the definition of the conditions. A condition (after the “∥” delimiter) is used to define con­ straints among the parameters. For instance, in the example, the parameter “nnls” is only used when a local search (pa­ rameter “localsearch”) has received a value between 1 and 3. The type identifier “c” represents the categorical parameters, whereas “r” (real) and “i” (int) represent numerical parame­ ters. While categorical parameters can only receive as value what is given in the “value” field, a numerical parameter can 1https://cran.r-project.org/web/packages/irace/index. html 2https://www.r-project.org/ receive any value in the specified ranges in its value field. The “switch” field of a parameter is what the algorithm actu­ ally gives as argument to execute. What the irace package does is to build a sequence of ar­ guments and values. Each argument sequence is what forms the algorithm configuration. In this sense, a script is defined in the package configuration file to receive this argument se­ quence and execute the algorithm using such arguments. In the end, the irace algorithm reads the result printed by the generated algorithm and updates its state. Because a statistical test is used to compare the algorithms of each irace run, at the end of its execution, irace returns a set of algorithms statistically equal. However, in this pa­ per, we only use the one with the best hypervolume and ex­ ecute the designed algorithm several times instead. We only choose the best algorithm because the procedure of gener­ ating MOEAs in our approach needs an output of only one algorithm. 2.2.2 Grammatical Evolution A GE algorithm can be considered a type of GP, since it is similarly used to evolve programs (Ryan et al., 1998). How­ ever, while a conventional GP algorithm typically uses a tree as representation for an individual and applies search opera­ tors to those trees (Koza, 1992), a GE algorithm uses an array of integers or bits and evolves the solutions similarly to a con­ ventional EA (Ryan et al., 1998). Moreover, aside from the usual parameters of an EA (e.g. population size, maximum number of fitness evaluations and others), a GE receives a grammar file, usually in BNF, to map each solution into a program. This mapping is called genotype­phenotype map­ ping (GPM) (Barros et al., 2013). The evolution is applied to the chromosome (genotype level), but only the program (phenotype level) can be ex­ ecuted and evaluated by a fitness function. Therefore, the GPM procedure is needed by the GE algorithm to transform each chromosome into an executable program. One of the advantages of GPM is that the genotype space can be freely explored, maintaining the validity of the phenotype (Barros et al., 2013). Furthermore, this allows a design of pheno­ type neutral crossover and mutation operators, where differ­ ent genotypes can be mapped to the same phenotype. Next, we present more details of how this is done, and the default structure of a GE algorithm. The common solution representation used by the GE al­ gorithms is an array of integers or bits. If the array of bits is used, it is first mapped to an array of integers and, then, this array of integers is mapped to a program. It is possible to skip this step and use an array of integers directly. Never­ theless, the algorithm reads the grammar file, interprets the grammatical rules and then uses the integer values of the ar­ ray to decide which values are assigned to each rule. If the algorithm reaches the end of the integer array, but still needs more genes to map into production rules, then the wrapping process is applied. Wrapping consists in consuming genes (when needed) starting from the beginning of the array when the end of the array is reached. To illustrate the GPM process, the following BNF grammar (Figure 2) is given for evolving simple mathematical expressions: https://cran.r-project.org/web/packages/irace/index.html https://cran.r-project.org/web/packages/irace/index.html https://www.r-project.org/ Guizzo et al. 2020 ⟨expr⟩ ::= ⟨var⟩ | ⟨expr⟩ ⟨op⟩ ⟨expr⟩ ⟨var⟩ ::= x | y ⟨op⟩ ::= * | / | + | - Figure 2. Example of grammar for mathematical expressions The items between ⟨ and ⟩ are non­terminal rules, ∥ rep­ resents the logical operator OR, ::= means that the rule can take any of the next options and the remaining items are ter­ minal nodes. For instance, the rule ⟨var⟩ can take either the value x or y when mapped to a program. On the other hand, ⟨expr⟩ can take the value of a single ⟨var⟩ or a composition of ⟨expr⟩ ⟨op⟩ ⟨expr⟩. The choice between each option is given by the genes of the chromosome array of each individ­ ual. In this work we use an integer array of variable size as representation for the solutions. This is actually an existing strategy (Ryan et al., 1998) that might help the algorithm to eliminate useless genes or reinsert new genes into the chro­ mosomes as the evolution proceeds. For this end, the GE algorithm employs two distinguishable search operators: i) gene duplication operator; and ii) gene pruning/deletion op­ erator. The duplication operator selects a random subarray of the chromosome and copies it to the end of the chromosome. The prune operator, on the other hand, selects an index to truncate the array. These operators are usually applied with the same probability as the mutation operator and as addi­ tional steps in the evolution. Summarizing, Algorithm 1 presents the pseudocode of a conventional GE algorithm. As the algorithm shows, it is very similar to an EA, aside from the evaluation of the pop­ ulation (GE has the GPM procedure) and the application of the duplication and prune operators. Algorithm 1: Pseudocode of a GE algorithm 1 Input: GF – Grammar File; 2 begin 3 population ← Initialize the population; 4 programs ← Map population to programs using GF ; 5 Execute programs; 6 Assign a fitness value to the solutions of population according to the output of their respective programs; 7 while stop condition is not achieved do 8 matingP opulation ← Select parents; 9 of f spring ← Recombine matingP opulation; 10 Apply gene prune operator to the solutions of of f spring; 11 Apply gene duplication operator to the solutions of of f spring; 12 Apply mutation operator to the solutions of of f spring; 13 programs ← Map of f spring to programs using GF ; 14 Execute programs; 15 Assign a fitness value to the solutions of of f spring according to the output of their respective programs; 16 population ← Perform replacement; 17 end 18 return Best program of population; 19 end 2.3 Multi­Objective Evolutionary Algorithms Multi­objective optimization problems have more than one objective to be optimized Coello et al. (2007). In this kind of problem, usually two or more objectives are in conflict and cannot be optimized at the same time, i.e., by optimizing one objective, the value of the other is degraded. Many real worldproblemsaremulti­objective.Forexample,acardriver might assume two objectives when taking the best route be­ tween two points: time needed to complete the route and travel cost. In this context, Pareto dominance concept is em­ ployed (Coello et al., 2007). In a minimization problem, a solution x is said to dominate (≺) a solution y if ∀z ∈ Z : z(x) ≤ z(y) and if ∃z ∈ Z : z(x) < z(y), where z is an objective in the set of considered objectives Z. If these con­ ditions are not satisfied for both x and y, then such solutions are said to be non­dominated. The set of all possible non­ dominated solutions in the search space of the problem be­ ing optimized is called the Pareto front (P F ). In most cases it is not possible to determine the (P F ) for a given problem. Hence, the algorithms try to find an approximation of this front, here called best­known Pareto front (P Fknown). Dif­ ferently from a mono­objective EA that yields a single so­ lution at the end of its execution, the result of a MOEA is a set of non­dominated solutions. Thus, engineers usually have to decide which solution better fits their needs and/or prefer­ ences. To evaluate the performance of a MOEA the most used quality indicator is the hypervolume (Zitzler et al., 2003). The hypervolume of a P F is the area dominated by this front with respect to a reference point. Considering a two objec­ tive optimization problem, each point of the Pareto Front de­ fines a rectangle in the search space. The hypervolume cor­ responds to the area formed by the sum of all rectangles. In mono­objective EAs the engineer can easily decide which solutions are the best ones according to their fitness values in any given moment. However, MOEAs cannot do this in a straightforward way, because they have several ob­ jectives to evaluate and potentially a great number of non­ dominated solutions. Therefore, some strategies may be ap­ plied to help the decisions that MOEAs take during the evo­ lution process. Usually a fitness calculation is applied to the cyansolutions so that the comparison between them becomes possible. For instance, the SPEA2 algorithm (Zitzler et al., 2001) uses the concepts of Combined Dominance Strength (how many solutions a solution dominates and how many solutions dominate it) and K­th Nearest Neighbour (the dis­ tance to the k­th nearest neighbour) to assign a fitness value for each solution, whereas NSGA­II (Deb et al., 2002) uses the concepts of Dominance Depth (rank of the sub­front) and Crowding Distance (density estimation). Notice that the fitness evaluation in both examples uses two kinds of evaluations: i) a convergence assessment (Com­ bined Dominance Strength and Dominance Depth); and ii) a diversity assessment (K­th Nearest Neighbour and Crowd­ ing Distance). The convergence regards how close the solu­ tions of a front are to a reference front (usually optimal or best­known). A good convergence in the search process can guide the algorithm on finding solutions closer to the refer­ Guizzo et al. 2020 ence Pareto front and can improve the overall performance of the algorithm. On the other hand, a good search diversity can prevent the algorithm from falling into a local optimum and provide a better exploration of the search space. The idea is to balance both factors during the optimization process in order to optimize the resulting output. However, this might be a difficult task and may come along with some drawbacks (e.g. computational cost). In addition, there are other parameters and components that must be taken into account when designing a MOEA. Some of the most distinguishable parameters are: population size, archiving size, and mutation and crossover probabili­ ties. The archiving size regards to the size of the archive used by some MOEAs (Coello et al., 2007) as an external population to support the evolution process. Furthermore, the MOEAs components can have different implementations. For instance, the replacement strategy can be generational or ranking based. All these details contribute to increment the al­ gorithm complexity, and require a lot of effort from a novice and non­expert engineer in the optimization field. Unfortunately, these details largely influence the perfor­ mance of the algorithms, and their design and tuning are an optimization problem itself (Eiben and Smit, 2011). This serves as the main motivation for our approach proposed in the next section. 3 Proposed Approach The proposed approach uses offline training to automatically design a MOEA specialized in the ITO problem. The training can be performed by using two different algorithms. Figure 3 shows how the proposed approach works3. Figure 3. Proposed Approach Two inputs are used by the approach: the training instance and the set of components and parameters, which are de­ scribed in detail, respectively in Sections 3.2 and 3.3. The instance of the problem is provided by the user, who also chooses a design algorithm that can be either GE or irace. Another input is the setofcomponentsand parameters,which are defined in a representation compatible with the selected design algorithm (grammar or parameter space). This set is pre­defined, but can be modified or extended if desired. Then, the design algorithm is executed with the training instance, and at the end the best MOEA is returned. This MOEA is used by the tester to solve other ITO instances. The design algorithms work with a population where each individual is a MOEA. The fitness is given by some indica­ tor calculated by using the corresponding fronts obtained by 3Source code of our approach at https://github.com/ GiovaniGuizzo/jMetalGrammaticalEvolution. each individual to the ITO problem. In this work we use hy­ pervolume indicator (Zitzler et al., 2003). It is important to emphasise that the GE and irace algo­ rithms work on a higher level of the search. Instead of trying to find the solution for the problem directly, these two algo­ rithms try to generate the best MOEA that can in turn solve the problem. Hence, these two algorithms search for MOEAs in the “MOEA space” using conventional search methods with which they were proposed (López­Ibáñez et al., 2011; Ryan et al., 1998). Next, we describe the structure of the gen­ erated MOEAs by detailing their main components and the representation used by each design algorithm. 3.1 MOEA Algorithm 2 shows the structure of a standard MOEA ma­ nipulated and returned by the approach. The components of each step and their respective parameters are selected by the design algorithm. These steps are: initialization of the popu­ lation (Line 2), evaluation of the population (Lines 3 and 9), selection of parents (Line 6), crossover operator (Line 7), mu­ tation operator (Line 8), replacement (Line 10) and archiving of the individuals (Lines 4 and 11). Algorithm 2: Template of a designed MOEA 1 begin 2 population ← Initialize the population; 3 Evaluate (population); 4 Archive (population); 5 while stop condition is not achieved do 6 matingP opulation ← Selection (population); 7 of f springP opulation ← Crossover (matingP opulation); 8 of f springP opulation ← Mutation (of f springP opulation); 9 Evaluate (of f springP opulation); 10 Replacement (of f springP opulation, population); 11 Archive (of f springP opulation); 12 end 13 end We divide the fitness assignment into three independent components, one for selection, another for replacement and one for archiving. By changing these assignments separately, the design algorithms can find better MOEAs by focusing on one kind of search at each step. For instance, a selection component can focus on mating more diversified parents, whereas a replacement component can focus on making the most converged solutions to survive. 3.2 Training Instance A training instance of the ITO problem must be given by the user, so that the design algorithms can execute the generated MOEAs. This instance must contain the matrices mentioned in Section 2. This information is later used to formulate a per­ mutation problem, where each unit is represented by its ID. During the problem solving, the order in which the units ap­ pear in the chromosome determines the order in which they are integrated and tested. Because ITO is a permutation prob­ lem, the MOEA components and parameters used in the pro­ https://github.com/GiovaniGuizzo/jMetalGrammaticalEvolution https://github.com/GiovaniGuizzo/jMetalGrammaticalEvolution Guizzo et al. 2020 posed approach are focused on the permutation representa­ tion. 3.3 Components and Parameters We chose the parameters and components based on experi­ ments and tuning conducted in related work (Assunção et al., 2014; Guizzo et al., 2015; Briand et al., 2002a). For instance, in Guizzo et al. (2015) the authors used three permutation crossover operators in their online operator selection, which are also included here. The parameters and components are categorized regarding the following MOEA steps: population initialization, selec­ tion, mating, replacement and archiving. Moreover, some fit­ ness assignment strategies are used in the selection, replace­ ment and archiving to guide the evolution. In this sense, the Fitness Assignment component is defined as a mechanism to identify which solution is the best one according to all ob­ jectives, but its three usages are completely independent and can vary according to the best outcome. Most components and parameters are generic enough to be used for any problem and by any algorithm (e.g. crossover and selection operator, type of population initialisation), but a few components are more specific and were extracted from existing algorithms (e.g. replacement strategy, archiv­ ing of solutions, fitness evaluation mechanism). We have extracted and implemented the components and parame­ ters from the following algorithms: NSGA­II (Deb et al., 2002), SPEA (Zitzler et al., 2001), SPEA2 (Zitzler et al., 2001), Multi Objective Genetic Algorithm (MOGA) (Fon­ seca and Fleming, 1993), Pareto Achieved Evolution Strat­ egy (PAES) (Knowles and Corne, 2000), and Indicator Based Evolutionary Algorithm (IBEA) with hypervolume (Zitzler et al., 2003). Hence, our approach is able (although unlikely) to generate each of those algorithms by joining their compo­ nents together during the evolutionary process. If the tester wants to adapt our approach, for instance by including the components of the their own algorithm, they just need to im­ plement the components and add them to the grammar (ex­ plained in more details in Section 3.4). Next, we present all these elements in detail. 3.3.1 Population This element is composed by a Population Size and an Initial­ ization procedure. Population Size specifies the number of individuals in the population. Initialization defines how the first population is initialized. It can be done by using Random or Parallel Diversification (Talbi, 2009). The latter aims at generating diversified solutions by initializing the individu­ als in a way that an integer number cannot be repeated at the same position in another individual of the population. 3.3.2 Selection The Selection element is related to the selection of parents to be recombined and defines the Source and Selection Opera­ tor components. Source specifies from where the parents are extracted. That way, the parents can be selected only from the current population, or from the archive and population combined. Selection Operator specifies the strategy used to select the parents. Random randomly selects two solutions to be recombined. K­Tournament performs k number of bi­ nary tournaments (comparisons) between random solutions and chooses the best two to be recombined. In such a case, the parameter k is also selected. Roulette Wheel gives a prob­ ability based on the fitness value of a parent, and performs a selection complying with the probabilities of each parent. Ranking classifies the solutions based on the fitness value and selects the best ones to be recombined. The K­Tournament, Roulette Wheel and Ranking selection operators use the fitness of each solution to aid the selection. That way, they use the Fitness Assignment component. 3.3.3 Fitness Assignment The Fitness Assignment element encompasses the Conver­ gence Strategy component to assess the quality regarding the convergence of the solutions, and the Diversity Strategy com­ ponent for the tie­breaking of solutions with the same conver­ gence value. Another possibility is the usage of only one kind of strategy, either convergence or diversity for the evaluation. If no Convergence Strategy component is selected, then the Diversity Strategy component becomes the primary metric for fitness assignment. There are four possible components for the Convergence Strategy. Dominance Depth (NSGA­II) (Deb et al., 2002) as­ sesses the convergence quality of the solutions using Pareto fronts. The first Pareto front has all the non­dominated solu­ tions of the population. The second Pareto has all the non­ dominated solutions excluding the ones in the first Pareto front. Such process is performed until there are no more solu­ tions. The Dominance Depth of a solution is the Pareto front number in which it is present, thus the lower the Dominance Depth value, the better. Dominance Strength (SPEA) (Zitzler et al., 2001) computes the number of solutions that a solu­ tion dominates. If a solution dominates many others (greater strength), then it may indicate that this solution dominates a great area of the objective space. Raw Fitness (SPEA2) (Zit­ zler et al., 2001) assesses the convergence quality by com­ puting the sum of the strength values of all the solutions that dominate a solution. Thus, the lower this value, the less likely a solution is to be on an easily dominated area of the objec­ tive space. Dominance Rank (MOGA) (Fonseca and Flem­ ing, 1993) computes the number of solutions that dominates a solution, thus the lower this value, the better. Regarding the Diversity Strategy, there are four possibil­ ities. Crowding Distance (NSGA­II) (Deb et al., 2002) is based on the distance between the neighbours solutions in the objective space. A low crowding distance value means that the solution is in a crowded area of the objective space, and possibly brings low diversity to the search (e.g. if used as a parent for recombination). K­th Nearest Neighbour (SPEA2) (Zitzler et al., 2001) assesses the distance from a solution to its kth nearest neighbour solution in the objective space. The parameter k is defined as in (Zitzler et al., 2001): k = √ N + N, where N is the population size and N is the archive size. Similar to Crowding Distance, it is necessary Guizzo et al. 2020 to maximize the value the K­th Nearest Neighbour diversity strategy. Adaptive Grid (PAES) (Knowles and Corne, 2000) divides the objective space into grids to trace the crowding degree of different regions. It is possible to diversify the non­dominated solutions and help to remove excessive non­ dominated solutions located in the crowded grids. The adap­ tive grid value of a solution is the number of solutions in its grid, thus the lower the value, the more isolated the solution is. Hypervolume Contribution (Zitzler et al., 2003) is based on the hypervolume quality indicator. Briefly, the hypervol­ ume contribution of a solution measures how much a solution strictly contributes to the hypervolume of the front in which it is contained. Therefore, the greater the hypervolume con­ tribution of a solution, the bigger the space dominated only by this solution. 3.3.4 Mating Mating Strategy (Talbi, 2009) defines the way and how many off­springs are created in each generation. Generational One Child and Generational Two Children generate N children at each generation. The former generates one in each recombi­ nation and the latter generates two. Steady State generates only one offspring at each generation. Even though Steady State is usually defined as a replacement strategy (Eiben and Smith, 2003), here it is defined as a reproduction strategy and the replacement component is responsible to insert the gener­ ated solution in the population according to the replacement strategy. Mating Operators are composed by the Crossover Oper­ ator, Mutation Operator (Eiben and Smith, 2003) and their probabilities. Since in this paper we are addressing a permu­ tation problem, all the crossover and mutation operators used here are for permutation representations. The Crossover Operator creates one or more off­springs by combining the genes of two parents. In this approach, no crossover or one of four crossover operators can be selected. Single Point Crossover selects a random point and cuts the parents in this point. One half of each parent is merged into different children. Two Points Crossover has a similar strat­ egy, but two random points are chosen to cut the parents. The sub­array inside each cutting point of a parent is copied to a solution, and the remaining genes are selected from the other parent. Partially Mapped Crossover (PMX) also cuts the par­ ents in two points, but it uses a more complex cyclic proce­ dure to select the genes of the sub­arrays. Cycle Crossover works by dividing the elements into cycles. A cycle is a sub­ set of elements where each element always occurs paired with another element of the same cycle when two parents are aligned. The off­springs are created by selecting alternate cy­ cles from each parent. A Mutation Operator applies some kind of transformation intheindividual. Inthisapproach,cyanthealternativestothat component are no mutation or one of the four following mu­ tation operators: Swap Mutation randomly selects two genes and swaps their values. Insert Mutation randomly selects a random value and moves it to a random position in the chro­ mosome. Scramble Mutation randomly selects two genes and shuffles the values between them. Inversion Mutation ran­ domly selects two genes and inverts the order of the values appearing between such genes. 3.3.5 Replacement Replacement represents the strategy to define which individ­ uals survive and compose the next generation. The Genera­ tional strategy defines that the parents are always replaced by the off­springs in the next generation. This replacement takes into account the idea of elitism. Briefly, the elitism forces the survival of a predetermined number of best parents for the next generation. The Elitism Size is also selected by the algorithm. If the elitism is indeed selected, then the Fitness Assignment component is used to determine the best parents for survival. The Ranking replacement creates a ranking of the solutions, based on the fitness values that are measured by the Fitness Assignment component. That way, the best so­ lutions are selected to survive, regardless of being parents or children. 3.3.6 Archiving The Archiving element is related to the archiving proce­ dure used to store the solutions. This is employed by some algorithms such as SPEA2 (Zitzler et al., 2001) and PAES (Knowles and Corne, 2000). Sometimes the archive is called external population, but the purpose is the same: to aid the search process with another source of solutions for repro­ duction, or simply to store the best solutions found so far. At each generation, the archive is updated with the newly gener­ ated solutions. In this work the Ranking component is always used to rank the solutions using the Fitness Assignment com­ ponent, and the best solutions are kept in the archive. It is able to store dominated and non­dominated solutions. Moreover, the Archive Size parameter is used to define how many solu­ tions are stored in the archive. In this work, a set of popula­ tion size percentages is used. If such value is zero, no archive is used by the MOEA. 3.4 Design Algorithms For the generation, execution, evaluation and selection of MOEAs the tester can choose between GE and irace. These algorithmshavethesamepurpose: togenerateMOEAsbased on several trials. However, they require different representa­ tions for the elements and components presented in previous section and specialized for the ITO (permutation) problems. The GE algorithm generates each MOEA using the following grammar. ⟨GA⟩ ::= ⟨populationSize⟩ ⟨initialization⟩ ⟨selection⟩ ⟨mating⟩ ⟨replacement⟩ ⟨archive⟩ ⟨populationSize⟩ ::= 50 | 100 | 150 | 200 | 250 | 300 ⟨initialization⟩ ::= Random | Parallel Diversification ⟨selection⟩ ::= ⟨selectionOperator⟩ ⟨source⟩ ⟨fitnessAssignment⟩ ⟨selectionOperator⟩ ::= K Tournament ⟨tournamentSize⟩ | Random | Roulette Wheel | Ranking ⟨tournamentSize⟩ ::= 2 | 4 | 6 | 8 | 10 ⟨source⟩ ::= Population | Archive and Population Guizzo et al. 2020 ⟨fitnessAssignment⟩ ::= ⟨convergenceStrategy⟩ ⟨diversityStrategy⟩ ⟨convergenceStrategy⟩ ::= λ | Dominance Rank | Dominance Strength | Dominance Depth | Raw Fitness ⟨diversityStrategy⟩ ::= λ | Crowding Distance | K-th Nearest Neighbour | Adaptive Grid | Hypervolume Contribution ⟨mating⟩ ::= ⟨matingOperators⟩ ⟨matingStrategy⟩ ⟨matingOperators⟩ ::= ⟨crossoverOperator⟩ ⟨crossoverProbability⟩ ⟨mutationOperator⟩ ⟨mutationProbability⟩ ⟨crossoverOperator⟩ ::= λ | Two Points Crossover | Single Point Crossover | PMX Crossover | Cycle Crossover ⟨crossoverProbability⟩ ::= 1.0 | 0.95 | 0.9 | 0.8 | 0.5 ⟨mutationOperator⟩ ::= λ | Swap Mutation | Insert Mutation | Scramble Mutation | Inversion Mutation ⟨mutationProbability⟩ ::= 0.01 | 0.02 | 0.05 | 0.1 | 0.2 | 0.5 | 0.7 | 0.8 | 0.9 | 1.0 ⟨matingStrategy⟩ ::= Steady State | Generational Two Children | Generational One Child ⟨replacement⟩ ::= Generational ⟨elitismSize⟩ ⟨fitnessAssignment⟩ | Ranking ⟨fitnessAssignment⟩ ⟨elitismSize⟩ ::= 0 | N * 0.01 | N * 0.05 | N * 0.1 | N * 0.5 ⟨archive⟩ ::= Ranking ⟨fitnessAssignment⟩ ⟨archiveSize⟩ ⟨archiveSize⟩ ::= 0 | N | N * 1.5 | N * 2 Based on this grammar, the symbol λ means empty com­ ponent. If an alternative for a component is not wanted by the tester, then it can be removed from the grammar and it will not be used by the GE algorithm. Similarly, if the tester al­ readyhasaMOEAandonlywants toconfiguresomeparame­ ters, then it is necessary to set the components of the MOEA in the grammar without other alternatives. For instance, to apply the NSGA­II algorithm and automatically configure it, the tester must set the replacement rule with only the “Rank­ ing” alternative, the fitness assignment component with only “Dominance Depth” and “Crowding Distance”, and any other element that must not change. Similarly, the parameter space of irace is defined with all the parameters and components, and can be changed if desired. The parameter space used by our approach is struc­ tured as follows. populationSize "--populationSize " c(50,100,150,200,250,300) initialization "--initialization " c("Random","Parallel Diversified Initialization") selectionOperator "--selectionOperator " c("K Tournament","Random","Roulette Wheel","Ranking") tournamentSize "--tournamentSize " c(2,4,6,8,10) | selectionOperator == "K Tournament" selectionSource "--selectionSource " c("Population","Archive and Population") selectionRankingStrategy "--selectionRanking " c(" ","Dominance Rank", "Dominance Strength", "Dominance Depth", "Raw Fitness") | selectionOperator != "Random" selectionDiversityStrategy "--selectionDiversity " c(" ","Crowding Distance", "K-th Nearest Neighbour", "Adaptive Grid", "Hypervolume Contribution") | selectionOperator != "Random" crossoverOperator "--crossoverOperator " c(" ","Two Points Crossover", "Single Point Crossover", "PMX Crossover", "Cycle Crossover") crossoverProbability "--crossoverProbability " c(1.0,0.95,0.9,0.8,0.5) | crossoverOperator != " " mutationOperator "--mutationOperator " c(" ","Swap Mutation","Insert Mutation","Scramble Mutation", "Inversion Mutation") mutationProbability "--mutationProbability " c(0.01,0.02,0.05,0.1,0.2,0.5,0.7,0.8,0.9,1.0) | mutationOperator != " "reproduction "--reproduction " c("Steady State","Generational Two Children","Generational One Child") replacement "--replacement " c("Generational","Ranking") elitismSize "--elitismSize " c(0,0.01,0.05,0.1,0.5) | replacement == "Generational" replacementRankingStrategy "--replacementRanking " c(" ","Dominance Rank","Dominance Strength", "Dominance Depth", "Raw Fitness") | replacement == "Ranking" || (replacement == "Generational" && elitismSize != "0") replacementDiversityStrategy "--replacementDiversity " c(" ","Crowding Distance","K-th Nearest Neighbour","Adaptive Grid", "Hypervolume Contribution") | replacement == "Ranking" || (replacement == "Generational" && elitismSize != "0") archiveSize "--archiveSize " c(0,1.0,1.5,2.0) archiveRankingStrategy "--archiveRanking " c(" ","Dominance Rank","Dominance Strength", "Dominance Depth", "Raw Fitness") | archiveSize != "0" archiveDiversityStrategy "--archiveDiversity " c(" ","Crowding Distance","K-th Nearest Neighbour", "Adaptive Grid", "Hypervolume Contribution") | archiveSize != "0" The grammar used by GE, as well as the parameter space of irace, were created to ensure the generation of only valid combinations. This is one of the advantages of using GE al­ gorithms: they can only generate what the grammar allows, making the resulting solutions always syntactically correct. 4 Empirical Evaluation As mentioned before, the evaluation of our experimental evaluation was guided by two research questions. RQ1 com­ pares the MOEAs generated using our approach with the tra­ ditional MOEAs used in the ITO literature. RQ2 compares both design algorithms GE and irace. To answer the questions, we use eight real world systems, also explored in related work (Assunção et al., 2014; Guizzo et al., 2015). One was used for training and seven for testing. They are implemented in Java and AspectJ. Table 1 shows the number of units, dependencies and LOC of each system. The number of dependencies directly impacts the number of existing solutions for the problem. AJHSQLDB (first row) is used only for the training. Table 1. Systems used in the empirical evaluation Name Units Dependencies LOC Language Training AJHSQLDB 301 1338 68550 AspectJ Testing AJHotDraw 321 1592 18586 AspectJ HealthWatcher 117 399 5479 AspectJ TollSystems 77 188 2496 AspectJ BCEL 45 289 2999 Java JBoss 150 367 8434 Java JHotDraw 197 809 20273 Java MyBatis 331 1271 23535 Java The empirical evaluation is performed in two phases. In the training phase, the proposed approach is executed in the training instance AJHSQLDB to automatically generate a set of MOEAs. Then, in the testing phase, the generated MOEAs are executed in all testing instances of the problem and are evaluated. Guizzo et al. 2020 To answer RQ1, we execute two MOEAs, successfully used in related work to solve the ITO problem (Assunção et al., 2014): i) NSGA­II (Deb et al., 2002); and ii) SPEA2 (Zitzler et al., 2001). In order to perform a fair com­ parison, we use the proposed approach with GE to tune these algorithms. This is another advantage of this approach: it can also be used to tune algorithms and not only to design new ones. For this tuning, we adapted a grammar for each algo­ rithm, fixing some key components and parameters of the al­ gorithms, and only letting the other ones vary. We fixed the replacement as Ranking with Dominance Depth and Crowd­ ing Distance for the NSGA­II grammar, whereas for the SPEA2 grammar we fixed this component as Generational with no elitism. If we let the GE tune all elements of NSGA­ II and SPEA2, then they could lose their main features and become totally different algorithm implementations. All the components were implemented with jMetal (Durillo and Ne­ bro, 2011). The approach was executed once for each algo­ rithm and for the same amount of evaluations (10,000). In the end, the best configuration was selected and then used in the testing phase for 60,000 fitness evaluations, as well as the generated MOEAs. Table 2 shows the configuration of each algorithm chosen by the approach. Table 2. Parameters of NSGA­II and SPEA2 Parameter NSGA­II SPEA2 Population Size 50 50 Maximum Fitness Evaluations 60,000 60,000 Crossover Operator PMX PMX Crossover Probability 100% 95% Mutation Operator Swap Swap Mutation Probability 1% 5% Archive Size ­ 50 4.1 Training Phase The system AJHSQLDB was chosen as the training instance of our approach, because it is the biggest instance. In prelim­ inary experiments, we observed that using an easily solvable instance usually results in a weak training. Furthermore, we use only one instance of the problem for the training because otherwise it would increase the cost of such a phase. We ac­ knowledge that the use of several instances would increase the training quality, but this is a subject for a future work only focused on this trade­off. We execute each design algorithm (GE and irace) 10 times. That way, each run returns one MOEA, for a total of 20 MOEAs. For the GE configuration, we define the param­ eters based on the literature (Lourenço et al., 2013). Table 3 presents such configuration. The irace algorithm requires as parameter only a training budget (number of MOEA evaluations), for which we use the same amount given to GE: 10,000. In addition, each gener­ ated MOEA is executed by the algorithms for 2,000 fitness evaluations. We use few fitness evaluations for the training due to the great computational time needed for this task. Sum­ marizing, each design algorithm is executed 10 times with 10,000 evaluations in each independent run, and each gener­ ated MOEA receives a budget of 2,000 fitness evaluations Table 3. GE parameters Parameter Value Population Size 100 Number of MOEA Evaluations 10,000 Number of Fitness Evaluations by MOEA 2,000 Crossover Operator Single Point Crossover Crossover Probability 90% Mutation Operator Integer Mutation Mutation Probability 1% Selection Operator Binary Tournament Pruning Operator Probability 1% Duplication Operator Probability 1% Minimum of Genes in the Initial Population 10 Maximum of Genes in the Initial Population 20 Replacement Strategy Ranking during the training phase. We analyse the parameters values and the components of the MOEAs returned in all runs. Based on the 20 obtained MOEAs, we compute the frequencies that the values appear. Table 4 shows, for each parameter and component and for each design algorithm, the values that appear more often in the MOEAs. As seen in Table 4, some components and parameters are clearly dominant, since they appear in the design of the best MOEAs very often, regardless of which design algorithm is being used. For example, Steady State mating, Ranking se­ lection operator and Dominance Strength for the selection procedure are used for all the 20 MOEAs. Other compo­ nents and parameters such as Parallel Diversified initializa­ tion, 100% mutation probability, Archive and Population se­ lection source and Ranking archiving are used for almost ev­ ery obtained MOEA. Even though the design algorithms obtain similar MOEAs, some differences can be noted by analysing these frequen­ cies. For instance, using GE, the best MOEAs always use Inversion Mutation, whereas using irace the best MOEAs always use Swap Mutation. In addition, irace presents greater crossover probabilities and a convergence strategy for the archiving procedure, while the GE algorithm gener­ ates MOEAs with the lowest crossover probability 50% of the time and it does not use a convergence strategy for archiv­ ing more than half of the time. 4.2 Testing Phase In order to answer the research questions, the 20 MOEAs generated by our approach in the training phase, and the tra­ ditional algorithms NSGA­II and SPEA2 are executed using all the seven testing instances of the problem. For each instance and MOEA, 30 independent runs are ex­ ecuted for 60,000 fitness evaluations. In the comparison, we use the hypervolume indicator (Zitzler et al., 2003) to assess the quality of each front obtained after each execution. We do not know the real Pareto fronts of the problems, thus this indicator can be used because it does not require such front. Furthermore, if the hypervolume value of a front A is greater than the value of a front B, then A is not worse than B. These characteristics make hypervolume suitable for the context of this experimentation. In order to compare the algorithms, we calculate the hy­ pervolume value of the 30 runs of each algorithm in each Guizzo et al. 2020 Table 4. Frequency of most used parameters values and components. Some frequencies are not multiples of 10 because in some executions the corresponding components/parameters were not applicable, since they depended on other components that were not selected. Category Parameter/Component GE irace Value Frequency Value Frequency Population Population Size 50 50% 50 50% Initialization Parallel Diversified 80% Parallel Diversified 80% Mating Mating Strategy Steady State 100% Steady State 100% Crossover Operator Single Point 40% Single Point 50% None 40% ­ ­ Crossover Probability 0.5 50% 0.8 33.33% ­ ­ 0.95 33.33% Mutation Operator Inversion Mutation 100% Swap Mutation 100% Mutation Probability 1.0 80% 1.0 90% Selection Source Archive and Population 80% Archive and Population 70% Selection Operator Ranking 100% Ranking 100% Convergence Strategy Dominance Strength 100% Dominance Strength 100% Diversity Strategy Hypervolume Contribution 40% K­th Nearest Neighbour 30% ­ ­ Hypervolume Contribution 30% Replacement Convergence Strategy Dominance Rank 40% Dominance Strength 50% Diversity Strategy Hypervolume Contribution 50% K­Th Nearest Neighbour 40% ­ ­ None 40% Archiving Type Ranking 70% Ranking 60% Archive Size N * 2 42.85% N * 1.5 50% Convergence Strategy None 57.14% Dominance Rank 33.33% Diversity Strategy K­th Nearest Neighbour 28.57% Adaptive Grid 50% Hypervolume Contribution 28.57% ­ None 28.57% ­ instance, and used the Kruskal­Wallis statistical test at 95% of confidence (Derrac et al., 2011) to address statistical dif­ ferences on the hypervolume values. Moreover, for each in­ stance, we calculate the rank of the algorithms based on their mean hypervolume. In this ranking process, algorithms with­ out statistical difference are considered tied. In the end, we calculate themeanhypervolumeandmeanrankforeachalgo­ rithm across all instances. We also compute the Friedman sta­ tistical test at 95% of confidence over the seven mean hyper­ volume values of each algorithm to determine if there is any overall statistical difference between their results. Finally, we compute the Vargha­Delaney’s Â12 effect size (Vargha and Delaney, 2000). We use multiple group p­value analysis (Kruskal­Wallis and Friedman) based on multiple groups (algorithms) due to the number of algorithms used in the experimentation. Using Mann­Whitney U instead would result in a large amount of pair p­values. While Kruskal­Wallis is used to compute dif­ ferences in a given experimental subject, Friedman is used on the means of the algorithms to give an overall statistical anal­ ysis across multiple systems. Furthermore, for the post­hoc test we use the suggestion of Siegel and Castellan (1988). It is also the default post­hoc technique used for such statistical tests in R. For the sake of succinctness, we select and report only some of the generated algorithms. In the next paragraphs we present the results for the best, median and worst algo­ rithms of each method according to their mean rank in terms of hypervolume. The best, median, and worst algorithms ob­ tained by the GE algorithm are named GE_Best, GE_Median, and GE_Worst respectively. Similarly, the best, median, and worst algorithms obtained by irace are I_Best, I_Median, and I_Worst respectively. Table 5 presents the mean hypervolume for each instance and algorithm, and its mean rank in parentheses4. The last row presents the overall mean hypervolume and mean rank. The last column presents the p­value obtained using the Kruskal­Wallis test, except in the last row where the Fried­ man test result is shown. The best values (greatest for hyper­ volume and lowest for ranks) and the statistically equivalent values to the best ones are highlighted in bold. In general, the results obtained by the best and median gen­ erated algorithms are better than the ones obtained by the conventional ones. GE_Best and I_Best are able to obtain the best or equivalent to the best results according to the Kruskal­ Wallis test in almost every instance of the problem. Even the worst generated algorithms (GE_Worst and I_Worst) per­ form better than NSGA­II and obtain competitive results to SPEA2 in overall. I_Best performs better in general, which results in the best mean rank (3.14). GE_Best obtains the second best re­ 4Data as used for the test can be downloaded at https: //github.com/GiovaniGuizzo/jMetalGrammaticalEvolution/ blob/master/results.7z. https://github.com/GiovaniGuizzo/jMetalGrammaticalEvolution/blob/master/results.7z https://github.com/GiovaniGuizzo/jMetalGrammaticalEvolution/blob/master/results.7z https://github.com/GiovaniGuizzo/jMetalGrammaticalEvolution/blob/master/results.7z Guizzo et al. 2020 Table 5. Hypervolume means and ranks System GE GE GE IRACE IRACE IRACE NSGA­II SPEA2 p­value Best Median Worst Best Median Worst AJHotDraw 0.88 (2.00) 0.87 (2.00) 0.55 (5.50) 0.79 (2.50) 0.64 (5.00) 0.56 (5.50) 0.27 (7.50) 0.51 (6.00) 2.2E­16 HealthWatcher 1.00 (4.00) 1.00 (4.00) 0.98 (4.00) 0.99 (4.00) 0.99 (4.00) 0.94 (4.50) 0.92 (7.50) 0.99 (4.00) 1.1E­09 TollSystems 0.90 (5.00) 1.00 (2.50) 0.95 (4.00) 0.93 (4.00) 0.90 (5.00) 0.88 (5.00) 0.77 (6.50) 0.94 (4.00) 1.4E­08 BCEL 0.76 (2.00) 0.67 (6.00) 0.46 (7.50) 0.74 (2.50) 0.74 (2.50) 0.63 (6.50) 0.72 (3.50) 0.69 (5.50) 2.2E­16 JBoss 0.96 (4.50) 1.00 (4.50) 0.97 (4.50) 0.96 (4.50) 0.98 (4.50) 0.91 (4.50) 0.90 (4.50) 0.96 (4.50) 0.008 JHotDraw 0.69 (3.00) 0.73 (2.00) 0.52 (6.00) 0.72 (2.50) 0.55 (5.50) 0.51 (6.00) 0.45 (6.50) 0.60 (4.50) 1.2E­14 MyBatis 0.70 (2.00) 0.61 (5.00) 0.45 (7.50) 0.75 (2.00) 0.69 (2.00) 0.49 (7.00) 0.61 (5.00) 0.59 (5.50) 2.2E­16 Mean 0.84 (3.21) 0.84 (3.71) 0.70 (5.57) 0.84 (3.14) 0.78 (4.07) 0.71 (5.57) 0.66 (5.86) 0.75 (4.86) 3.606E­4 sults with a mean rank difference of only 0.07 to I_Best and with no statistical difference according to the Friedman and Kruskal­Wallis tests. The only system (TollSystem) in which GE_Best is statistically worse than the other algorithms is in fact the smallest one. The best known solution for this sys­ tem is always found by GE_Median. Furthermore, the me­ dian and worst algorithms obtained by the GE algorithm per­ form better than the median and worst algorithms of irace respectively. The Friedman test presents differences only between GE_Best and NSGA­II, GE_Median and NSGA­II, and GE_Best and I_Worst. Even though I_Best obtains the best results (equal hypervolume to GE_Best and GE_Median, but better rank), the Friedman test result shows no statistical dif­ ferences between I_Best and any other algorithm. Perhaps with more systems in the experiment, the statistical power would be higher and we could see more differences between the algorithms. With only seven systems, the more conser­ vative statistical assessment of the non­parametric Friedman test takes place, preventing us from assuming and report other differences. Figures4­6showthefrontsobtainedbythealgorithms.We present only the fronts of the three more complex systems based on the number of dependencies presented on Table 1. The other instances are omitted because the algorithms found similar solutions for them. Figure 4. Fronts found by the algorithms for AJHotDraw Based on the presented fronts, for the AJHotDraw in­ stance, the solutions found by NSGA­II and SPEA2 are dom­ inated by the ones found using the other algorithms. More­ over, the GE algorithms obtain better solutions. For the My­ Batis instance, the best solutions are obtained by I_Best and GE_Best. For the JHotDraw instance, I_Best found two non­dominated solutions, and one of them is also found by GE_Best and GE_Worst. In general, with exception of JHot­ Figure 5. Fronts found by the algorithms for JHotDraw Figure 6. Fronts found by the algorithms for MyBatis Draw, the algorithms are able to find diversified solutions in the fronts. In order to provide another way to analyse the differences between the algorithms, we present the Vargha­Delaney’s Â12 effect size (Vargha and Delaney, 2000; Arcuri and Briand, 2014). The Â12 test measures the probability of an effect magnitude regarding the difference between a group X and another group Y , where an Â12 value of 1.0 means that X always outperforms Y , 0.0 means that Y always out­ performs X, and 0.5 means that X and Y are equally good. According to Vargha and Delaney (2000), the Â12 values can be read as follows: Â12 magnitude =   Large if 0.71 < Â12 ≤ 1.0 M edium if 0.64 < Â12 ≤ 0.71 Small if 0.56 < Â12 ≤ 0.64 N egligible if 0.44 ≤ Â12 ≤ 0.56 Small if 0.36 ≤ Â12 < 0.44 M edium if 0.29 ≤ Â12 < 0.36 Large if Â12 < 0.29 This statistical test is computed over the 30 hypervolume values of each algorithm and compared to the other ones in each instance. Because this is a binary comparison, there Guizzo et al. 2020 would be 196 effect size values to report: 28 values for each of the seven instances. In this sense, for a cleaner view of such data, we summarize these results in Figures 7 and 8, and in Table 6. The figures present respectively the box plot for the Â12 effect size values obtained in the comparisons be­ tween I_Best and the other algorithms, and between GE_Best and the other algorithms in all instances. We chose these two algorithms because they were the best obtained algorithms by each generation algorithm. Any value above 0.5 is a bet­ ter Â12 in favour of the main algorithm. Figure 7. Box plot of the effect size values for I_Best Figure 8. Box plot of the effect size values for GE_Best I_Best performs similarly to GE_Best and, as expected, both perform better than NSGA­II and SPEA2. Furthermore, the differences between the best, median and worst algo­ rithms become clearer. For example, in Figure 7 we can see that GE_Best obtained closer results to I_Best, then GE_Median and then GE_Worst. This “ladder” is also vis­ ible in Figure 8 regarding I_Best, I_Median and I_Worst. Table 6 presents the number of non­negligible favourable effect size values minus the number of non­negligible un­ favourable effect size values for each algorithm in a given row when compared to another algorithm in the respec­ tive column. For example, GE_Best (first row) obtained 5 favourable, 1 negligible, 1 unfavourable effect size values when compared to GE_W (fourth column), hence the differ­ ence is 4 (5 favourable minus 1 unfavourable). For each cell, the maximum value is 7 and the minimum is ­7, representing, respectively, that the row algorithm was significantly better or worse than the column algorithm in all systems. The last column shows the sum of the differences. Table 6. Non­negligible effect size counts per algorithm. Alg. GE_B GE_MGE_WI_B I_M I_W NSGA­ II SPEA2To­ tal GE_B — 0 4 0 5 6 7 4 26 GE_M 0 — 6 2 3 7 4 5 27 GE_W ­4 ­6 — ­3 ­3 1 3 ­2 ­14 I_B 0 ­2 2 — 2 5 6 4 17 I_M ­5 ­3 3 ­2 — 6 7 1 7 I_W ­6 ­7 ­1 ­5 ­6 — 2 ­4 ­27 NSGA­II ­7 ­4 ­3 ­6 ­7 ­2 — ­3 ­32 SPEA2 ­4 ­5 2 ­4 ­1 4 3 — ­5 We could observe some ties in the number of effect size counts between some of the algorithms. GE_Best tied with GE_Median and I_Best, whereas I_Best only tied with GE_Best and not with GE_Median. Furthermore, GE_Median also obtained the best total sum of count dif­ ferences, even though its hypervolume average is the same and the ranking is lower than GE_Best and I_Best. This in­ dicates that these algorithms performed very similarly in the testing set used in this work, which can also be observed in the proximity between their Pareto fronts shown in Figures 4­ 6. NSGA­II shows the worst results in the effect size com­ parisons, whereas SPEA2 is more competitive to the other algorithms by overcoming the worst generated algorithms. Comparing the generated algorithms and the traditional ones, we observe that all generated MOEAs are able to out­ perform NSGA­II and obtain competitive results to SPEA2. Additionally, some algorithms performed significantly bet­ ter than the traditional ones, specially I_Best, GE_Best and GE_Median. In this sense, we can answer RQ1 by asserting that the proposed approach is capable of generating MOEAs that are better than conventional ones in the literature. Regarding RQ2, it is not possible to conclude which de­ sign algorithm is better for this problem. irace obtains the best overall MOEA, but GE generates a better set of al­ gorithms considering the median and worst generated algo­ rithms. In other words, both best and median algorithms of GE are able to statistically outperform NSGA­II and some­ times SPEA2, which occurs less often for the MOEAs gen­ erated by irace. However, irace generated the best algo­ rithm considering hypervolume values, rankings, and statis­ tical differences across all systems. 4.3 Further Discussion An advantage of GE is that its grammar provides a flexi­ ble way of design algorithms. By using such a grammar, the tester can create complex structures by changing the arrange­ ment of the grammar rules. This enables not only the gener­ ation of MOEAs with a common template, but also MOEAs using different procedures (e.g. multiple crossover and muta­ tion). On the other hand, GE needs more configuration (num­ ber of parameters) than irace. Considering the structure of the generated MOEAs, proba­ bly a tester would not think about designing such uncommon Guizzo et al. 2020 algorithms, since they are quite different from well­known MOEAs presented in the literature. Table 7, presents the com­ ponents and parameters used by the best and worst algo­ rithms generated by GE and IRACE. For instance, GE_Best uses a mutation operator with 100% of probability and no crossover operator. Of course, these characteristics are de­ pendent on the training instance we used, but they are contra­ dictory to the “common wisdom” of low mutation and great crossover probabilities. Different training scenarios could provide a more human­like designed algorithm, but yet the manual design of MOEAs may not be so powerful when com­ pared to automated mechanisms such as the approach herein proposed. In fact, as stated by Eiben and Smit (2011), the de­ sign and configuration of an EA is an optimization problem itself, hence optimization algorithms may be successfully ap­ plied in this context. Beyond the promising results of the proposed approach, it also requires reduced effort to design and configure a multi­ objective algorithm. In a conventional scenario, the tester might avoid the boring tasks of configuring or tuning the MOEA,andmaysimplyuseacommonMOEAwitharbitrary parameters and components, which can yield not so good re­ sults. The proposed approach can solve this problem by au­ tomatically generating MOEAs that present the best results. The downside of the approach is the great amount of com­ putational resources used in the training phase. In this paper we executed the design algorithms for 10,000 evaluations, and assigned 2,000 training evaluations for each MOEA, re­ sulting in 20 million total fitness evaluations each training run. Each training run took 113 hours on average to execute duetothegreatnumberoffitnessevaluationsduringthetrain­ ing. A more cheaper strategy could be adopted by reducing from 10,000 to 5,000 fitness evaluations, and from 2,000 to 1,000 training budget for each MOEA, however, we would still be using 5 million fitness evaluations for the training (around 28 hours for training time). This elevated cost is not something exclusive of our approach, but rather something that is inevitably inherent to offline design algorithms (Bez­ erra et al., 2014; Guizzo et al., 2015; Burke et al., 2012). However, one should bear in mind that the 20 million fit­ ness evaluations were also assigned to the tuning of NSGA­ II and SPEA2. Even when we assign the same amount of resources, automatically tuning an existing algorithm results in suboptimal results. If we consider the manual effort of de­ signing a MOEA or tuning an existing one, then this effort mustbeaddedtothecomputationalresourcesspent inthetun­ ing process. In such case, the tester would have to perform experiments and compare the algorithms manually, which is already done automatically by our approach. All in all, letting our approach design and configure a new algorithm seems like the best option, despite its cost. Although a great number of MOEAs evaluations were used for the training, we discovered that the best algorithm is found, on average for the 30 training executions, around the 7,120th evaluation out of 10,000. In some executions the best algorithm was found as early as the first 3,000 evaluations, but sometimes it was not found until the last 100 evaluations. This data will be important for future work, where we are go­ ing to investigate a better stopping criterion for the training in order to reduce the cost of this phase. 4.4 Threats to Validity The experimentation test was only performed using seven systems. In order to minimise this external threat, we ex­ tracted systems of different sizes and characteristics from related work (Assunção et al., 2014; Mariani et al., 2016; Guizzo et al., 2015). It is worth mentioning that the systems are implemented with two different paradigms: object and as­ pect oriented. All of these details help mitigate the threat and improve the generalisation of our results. We only used AJHSQLDB as the training instance on both GE and irace. Even though in an earlier and preliminary experimentation we observed similar results using MyBatis as training instance, this may affect the final output. We ex­ pect a degradation on the results when using smaller training instances, thus this is something we intend to test in future work. The results are susceptible to the components and param­ eters used in the training phase. In this sense, a greater num­ ber of components and parameters would result in different MOEAs and, probably, better algorithms. In order to min­ imise this threat and improve generalisation, we used a com­ prehensive set of components from multiple different algo­ rithms in the literature. Another possible threat is that due to the execution costs, in the training phase we execute each design algorithm (GE and irace) 10 times. This number was chosen based on other works in the literature of automatic design algorithms (??). We think it is enough to show the behaviour of our approach and offer a preliminary evaluation. During the testing phase we used only the hypervolume indicator due to its compatibility to our scenario and because it is considered the only Pareto­dominance compliant indica­ tor (Zitzler et al., 2007). We tried to mitigate this threat by also presenting the Pareto fronts found during the evaluation and making our experimental results available. 5 Related Work This section reviews studies addressing the ITO problem and after related work on automatic design of algorithms. 5.1 Studies addressing the ITO problem In the literature the ITO problem has been addressed by many studies and different contexts. The work of Hewett and Ki­ jsanayothin (2009) is focused on the integration order of com­ ponents that can be a subsystem, module, or object­oriented class. The authors use a graph­based algorithm to solve the problem and reduce the number of required stubs. The heuris­ tics were applied in test dependency graphs and the results showed an improvement of the results when compared to other approaches. The authors also stated that their algorithm was faster to be executed than these approaches. Other works such as Abdurazik and Offutt (2009) and the ones reviewed by Briand et al. (2003) also use graph­based algorithms and other similar heuristics to solve this problem in the Object Oriented (OO) context. The approach of Jiang et al. (2019) in­ troduces a strategy to consider the control coupling complex­ ity estimated by using the concept of transitive relationship Guizzo et al. 2020 Table 7. Components and parameters of the best and worst algorithms generated by GE and IRACE Component/Parameter GE_Best GE_Worst IRACE_Best IRACE_Worst Population size 50 100 50 50 Initialization Random Random Parallel Diversified Parallel Diversified Selection Ranking Ranking Ranking Ranking → Converg. Strategy Dominance Strength Raw Fitness Dominance Strength Dominance Strength → Divers. Strategy K­th Nearest Neighbour – Adaptive Grid K­th Nearest Neighbor → Source Archive and Population Population Population Archive and Population Mating Strategy Steady State Steady State Steady State Steady State Crossover Operator ­ PMX Crossover (80%) Single Point Crossover (80%) Single Point Crossover (90%) Mutation Operator Swap Mutation (100%) Swap Mutation (100%) Swap Mutation (100%) Swap Mutation (100%) Replacement Generational Generational Ranking Generational → Elitism Size 5 10 0 1 → Converg. Strategy Raw Fitness Dominance Strength Raw Fitness Dominance Strength → Divers. Strategy K­th Nearest Neighbour Crowding Distance K­th Nearest Neighbour – Archive Ranking – – Ranking → Converg. Strategy Raw Fitness – – Dominance Strength → Divers. Strategy Hypervolume Contribution – – Adaptive Grid → Archive Size 75 – – 75 betweenclasses.OtherworkssuchasBansaletal. (2009)pro­ pose hybrid algorithms combining graph­based and genetic algorithms also in OO context. The work of Ré et al. (2007) proposed an extension to encompass Aspect­Oriented (AO) programs. Even though the mentioned works can solve the integra­ tion and testing order problem, sometimes they may only find sub­optimal solutions (Assunção et al., 2014; Briand et al., 2002b). To overcome this situation, some works such as Assunção et al. (2013); Briand et al. (2002b); Vergilio et al. (2012b) use search­based approaches. In Briand et al. (2002b) the author proposed a way to optimally solve this problem by using GAs. However, the authors used an aggre­ gation of coupling measures as fitness function, in contrast to Assunção et al. (2013) and Vergilio et al. (2012b), where multi­objective approaches were used and good results were obtained. Cabral et al. (2010) also used a multi­objective search­based algorithm, more specifically Ant Colony Opti­ mization (ACO), and obtained better results than an approach with aggregation of objectives. We can observe that MOEAs obtained the best results to solve the ITO problem, and this means that it a suitable prob­ lem for applying our MOEA design approach. From all exist­ ing works addressing the ITO problem, the work of Assunção et al. (2014) introduced an approach that can be applied in dif­ ferent contexts of the problem, such as AO and OO programs, because of this we used such a formulation in our work. 5.2 Works on automatic design of algorithms In the literature we find works on automatic design and con­ figuration of meta­heuristics that use different techniques (Bezerra et al., 2014, 2015; López­Ibáñez and Stützle, 2012; Smit et al., 2010; Dréo, 2009). Some of them are focused only on the automatic configuration of parameters of mono­ objective algorithms (Smit et al., 2010; Dréo, 2009). Related work are the ones that use GE and irace. We can find in the literature some works that use GE for the automatic design and configuration of heuris­ tics (Lourenço et al., 2012, 2013, 2015; Burke et al., 2012; Mascia et al., 2014; Marshall et al., 2014b,a; Hutter et al., 2007). They focus on simple heuristics (Marshall et al., 2014b,a) and meta­heuristics, such as local search algo­ rithms (Burke et al., 2012; Mascia et al., 2014; Hutter et al., 2007) and EAs (Lourenço et al., 2012, 2013, 2015). We would like to mention the works of Lourenço et al. (2012, 2013, 2015) that propose a GE approach to automatically design and configure mono­objectives EAs. Their grammar encompasses the main components and parameters of EAs, such as population size, crossover and mutation operators. The EAs returned by the approach obtained better results when compared to traditional ones on the Royal Roads prob­ lem. However, we did not find works applying GE in the context of MOEAs, or even for other types of multi­objective algorithms. Regarding the use of irace, relate work (Bezerra et al., 2014, 2015; López­Ibáñez and Stützle, 2012) also consider EAs and a set of components and parameters. The studies of Bezerra et al. (2014, 2015) are the only ones addressing the context of MOEAs. They use many MOEAs components and parameters, such as replacement, archiving and fitness assignment. As result, the automatically generated MOEAs could outperform traditional ones in instances of combinato­ rial and continuous problems. Our work is mainly inspired by Bezerra et al. (2014, 2015) and Lourenço et al. (2012), because the idea of using GE and irace for the automatic design of EAs seems very promis­ ing. Furthermore, in the automatic design of MOEAs, there are many parameters and components that can be explored. However, our approach differs from them by using a differ­ ent set of components and parameters, such as initialization procedures, source of parents selection, replacements proce­ dures and the possibility of elitism. The set herein proposed is broader and more suitable for the MOEAs context. Another difference is the application of automatic design of MOEAs in the SBSE area. None of the works mentioned above generate algorithms specialized for solving software engineering problems, such as ITO. Regarding the ITO problem, a great number of approaches found in the literature use search­based algorithms (Wang Guizzo et al. 2020 et al., 2011; Briand et al., 2002a). The most promising are based on MOEAs (Assunção et al., 2014; Vergilio et al., 2012a; Guizzo et al., 2015). However, to implement and con­ figure a MOEA solution can be very difficult for the tester, who is not usually an expert on the optimization field. To ease such a task, the work of Mariani et al. (2016) introduces a hyper­heuristic based on GE to derive MOEAs for solv­ ing the ITO problem. Such a hyperheuristic was also used to select products for the Sofware Product Line (SPL) test­ ing (Jakubovski Filho et al., 2017). These works found good results and serve as motivation to propose our approach, that now encompasses a set of elements and components that al­ low the use of GE and irace and comparison of both design algorithms. As far as we know, there is no work in the litera­ ture that applies and compares both. 6 Concluding Remarks This paper presented an approach for the automatic design of MOEAs applied to the ITO problem. For this purpose, the main components and parameters of MOEAs were identified and a set of alternatives were assigned for them. They were formallymappedinagrammar(GE)andinaparameterspace (irace) to be used by the design algorithms. An empirical evaluation was conducted in two phases. The first was the training phase, where the GE and irace al­ gorithms were executed using one instance of ITO, gener­ ating ten MOEAs each. In the testing phase, the twenty ob­ tained MOEAs were executed using all seven instances of the problem. Furthermore, the traditional MOEAs NSGA­II and SPEA2 were also executed using such instances. Their parameters were tuned using the proposed approach. The MOEAs generated by the approach with both design algorithms were compared using the hypervolume quality in­ dicator and three statistical tests: Vargha­Delaney’s Â12 Ef­ fect Size, Kruskal­Wallis and Friedman. They obtained sim­ ilar results and also some similarities regarding the most se­ lected components. In addition, the generated MOEAs are able to outperform the traditional ones with statistical differ­ ence. In short, the introduced approach can successfully gener­ ate MOEAs to better solve the ITO problem, improving the performance of the traditional algorithms and reducing ef­ fort, freeing the tester from tasks like: choice among existing MOEAs, implementation and configuration. We intend to apply the approach for other ITO instances, testing contexts and problems. Some mentioned limitations should be addressed, for example, to use other training in­ stances and quality indicators in the fitness evaluation. In addition to this, the approach serves as a basis for future works that can extend the proposed set of elements to cover other desired characteristics of the MOEAs and to be used in other domains, contributing to the SBSE practitioners to implement solutions for other SE problems. Ideally, the gen­ erated MOEAs could be generalized to other SE problems without the need of retraining. Acknowledgement This work is supported by the Brazilian funding agencies Co­ ordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and Conselho Nacional de Desenvolvimento Cien­ tífico e Tecnológico (CNPq) under the grant: 305968/2018­1. We would like to thank Gian M. Fritsche for his help on the setup of irace. References Abdurazik, A. and Offutt, J. (2009). Using coupling­based weights for the class integration and test order problem. Computer Journal, 52(5):557–570. Arcuri, A. and Briand, L. (2014). A hitchhiker’s guide to sta­ tistical tests for assessing randomized algorithms in soft­ ware engineering. Software Testing, Verification and Reli­ ability, 24(3):219–250. Assunção, W. K. G., Colanzi, T. E., Vergilio, S. R., and Pozo, A. T. R. (2013). Generating integration test orders for aspect oriented software with multi­objective algorithms. Revista de Informática Teórica e Aplicada, 20(2):301– 327. Assunção, W. K. G., Colanzi, T. E., Vergilio, S. R., and Pozo, A.(2014). Amulti­objectiveoptimizationapproachforthe integration and test order problem. Information Sciences, 267(0):119–139. Bansal, P., Sabharwal, S., and Sidhu, P. (2009). An investiga­ tion of strategies for finding test order during integration testing of object oriented applications. In Proceedings of the ICM2CS, pages 1–8. Barros, R. C., Basgalupp, M. P., Cerri, R., da Silva, T. S., and de Carvalho, A. C. (2013). A Grammatical Evolution Approach for Software Effort Estimation. In Genetic and Evolutionary Computation Conference, pages 1413–1420. Bezerra, L., López­Ibáñez, M., and Stützle, T. (2015). Au­ tomatic component­wise design of multi­objective evolu­ tionary algorithms. IEEE Transactions on Evolutionary Computation, 20:1–1. Bezerra, L. C. T., López­Ibáñez, M., and Stützle, T. (2014). Automatic Design of Evolutionary Algorithms for Multi­ Objective Combinatorial Optimization. In Parallel Prob­ lem Solving from Nature, volume 8672 of Lecture Notes in Computer Science, pages 508–517. Springer International Publishing. Briand, L., Labiche, Y., and Wang, Y. (2003). An investiga­ tion of graph­based class integration test order strategies. Software Engineering, IEEE Transactions on, 29(7):594– 607. Briand, L. C., Feng, J., and Labiche, Y. (2002a). Using ge­ netic algorithms and coupling measures to devise optimal integration test orders. In International Conference on Software Engineering and Knowledge Engineering, pages 43–50. Briand, L. C., Feng, J., and Labiche, Y. (2002b). Using ge­ netic algorithms and coupling measures to devise optimal integration test orders. In Proceedings of the 14th SEKE, pages 43–50. ACM. Guizzo et al. 2020 Burke, E. K., Hyde, M. R., and Kendall, G. (2012). Gram­ matical Evolution of Local Search Heuristics. IEEE Trans­ actions on Evolutionary Computation, 16(3):406–417. Cabral, R. d. V., Pozo, A., and Vergilio, S. R. (2010). A pareto ant colony algorithm applied to the class integra­ tion and test order problem. In Petrenko, A., Simão, A., and Maldonado, J. C., editors, Testing Software and Sys­ tems, volume 6435 of Lecture Notes in Computer Science, pages 16–29. Springer Berlin Heidelberg. Coello, C. A. C., Lamont, G. B., and Veldhuizen, D. A. V. (2007). Evolutionary Algorithms for Solving Multi­ Objective Problems Second Edition. Springer Science, 2nd edition. Colanzi, T., Assunção, W., Farah, P., Vergilio, S., and Guizzo, G. (2019). A review of ten years of the sympo­ sium on search­based software engineering. In Nejati, S. and Gay, G., editors, Search­Based Software Engineering, pages 42–57, Cham. Springer International Publishing. Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA­ II. IEEE Transactions on Evolutionary Computation, 6(2):182–197. Derrac, J., García, S., Molina, D., and Herrera, F. (2011). A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm and Evolutionary Computation, 1(1):3–18. Dréo, J. (2009). Using Performance Fronts for Parameter Setting of Stochastic Metaheuristics. In Genetic and Evo­ lutionary Computation Conference, pages 2197–2200. Durillo, J. J. and Nebro, A. J. (2011). jMetal: A Java frame­ work for multi­objective optimization. Advances in Engi­ neering Software, 42(10):760–771. Eiben, A. and Smit, S. (2011). Parameter tuning for config­ uring and analyzing evolutionary algorithms. Swarm and Evolutionary Computation, 1(1):19–31. Eiben, A. and Smit, S. (2012). Evolutionary Algorithm Pa­ rameters and Methods to Tune Them. In Autonomous Search, pages 15–36. Springer Berlin Heidelberg. Eiben, A. E. and Smith, J. E. (2003). Introduction to evolu­ tionary computing. Springer Science & Business Media. Fonseca, C. M. and Fleming, P. J. (1993). Genetic Algo­ rithms for Multiobjective Optimization: Formulation Dis­ cussion and Generalization. In International Conference on Genetic Algorithms, pages 416–423. Guizzo, G., Fritsche, G. M., Vergilio, S. R., and Pozo, A. T. R. (2015). A Hyper­Heuristic for the Multi­Objective Integra­ tion and Test Order Problem. In Genetic and Evolutionary Computation Conference, pages 1343–1350. Harman, M., Mansouri, S. A., and Zhang, Y. (2012). Search­ based Software Engineering: Trends, Techniques and Ap­ plications. ACM Computing Surveys, 45(1):1–61. Hewett, R. and Kijsanayothin, P. (2009). Automated test or­ der generation for software component integration testing. In Proceedings of the 24th ASE, pages 211–220. Hutter, F., Hoos, H. H., and Stützle, T. (2007). Automatic Algorithm Configuration Based on Local Search. In AAAI Conference on Artificial Intelligence, pages 1152–1157. Jakubovski Filho, H. L., Lima, J. A. P., and Vergilio, S. R. (2017). Automatic generation of search­based algorithms applied to the feature testing of software product lines. In Proceedings of the 31st Brazilian Symposium on Software Engineering, SBES’17, pages 114–123, New York, NY, USA. ACM. Jiang, S., Zhang, M., Zhang, Y., Wang, R., Yu, Q., and Ke­ ung, J. W. (2019). An integration test order strategy to consider control coupling. IEEE Transactions on Software Engineering, pages 1–1. Knowles, J. D. and Corne, D. W. (2000). Approximating the Nondominated Front Using the Pareto Archived Evolution Strategy. Evolutionary Computation, 8(2):149–172. Koza, J. R. (1992). Genetic Programming: On the Program­ ming of Computers by Means of Natural Selection. MIT Press. López­Ibáñez, M., Dubois­Lacoste, J., Stützle, T., and Bi­ rattari, M. (2011). The IRACE package, Iterated Race for Automatic Algorithm Configuration. Technical Report TR/IRIDIA/2011­004, IRIDIA, Université Libre de Brux­ elles, Belgium. López­Ibáñez, M. and Stützle, T. (2012). The Automatic Design of Multiobjective Ant Colony Optimization Algo­ rithms. IEEE Transactions on Evolutionary Computation, 16(6):861–875. Lourenço, N., Pereira, F. B., and Costa, E. (2012). Evolv­ ing evolutionary algorithms. In Genetic and Evolutionary Computation Conference, pages 51–58. Lourenço, N., Pereira, F. B., and Costa, E. (2013). The Im­ portance of the Learning Conditions in Hyper­heuristics. In Genetic and Evolutionary Computation Conference, pages 1525–1532. Lourenço, N., Pereira, F. B., and Costa, E. (2015). The Op­ timization Ability of Evolved Strategies. In Progress in Artificial Intelligence, volume 9273 of Lecture Notes in Computer Science, pages 226–237. Springer International Publishing. Mariani, T., Guizzo, G., Vergilio, S., and Pozo, A. (2016). Grammatical evolution for the multi­objective integration and test order problem. In Genetic and Evolutionary Com­ putation Conference, pages 1069–1076. Marshall, R. J., Johnston, M., and Zhang, M. (2014a). De­ veloping a Hyper­Heuristic Using Grammatical Evolution and the Capacitated Vehicle Routing Problem. In Sim­ ulated Evolution and Learning, volume 8886 of Lecture Notes in Computer Science, pages 668–679. Springer In­ ternational Publishing. Marshall, R. J., Johnston, M., and Zhang, M. (2014b). Hyper­ heuristics, Grammatical Evolution and the Capacitated Ve­ hicle Routing Problem. In Genetic and Evolutionary Com­ putation Conference, GECCO Comp ’14, pages 71–72. Mascia, F., nez, M. L.­I., Dubois­Lacoste, J., and Stützle, T. (2014). Grammar­based generation of stochastic local search heuristics through automatic algorithm configura­ tion tools. Computers & Operations Research, 51:190– 199. O’Neill, M. and Nicolau, M. (2017). Distilling the salient features of natural systems: Commentary on “On the map­ ping of genotype to phenotype in evolutionary algorithms” by Whigham, Dick and Maclaurin. Genetic Programming Guizzo et al. 2020 and Evolvable Machines, 18(3):379–383. Ré, R., Lemos, O. A. L., and Masiero, P. C. (2007). Minimiz­ ing stub creation during integration test of aspect­oriented programs. In Proceedings of the 3rd Workshop on Testing Aspect­oriented Programs, WTAOP ’07, pages 1–6, New York, NY, USA. ACM. Ryan, C. (2017). A rebuttal to Whigham, Dick, and Maclau­ rin by one of the inventors of Grammatical Evolution: Commentary on “On the Mapping of Genotype to Pheno­ type in Evolutionary Algorithms” by Peter A. Whigham, Grant Dick, and James Maclaurin. Genetic Programming and Evolvable Machines, 18(3):385–389. Ryan, C., Collins, J. J., and Neill, M. (1998). Grammatical evolution: Evolving programs for an arbitrary language. In Genetic Programming, volume 1391 of Lecture Notes in Computer Science, pages 83–96. Springer Berlin Heidel­ berg. Siegel, S. and Castellan, N. J. (1988). Nonparametric Statis­ tics for the Behavioral Sciences. McGraw­Hill interna­ tional editions. Statistics series. McGraw­Hill. Smit, S. K., Eiben, A. E., and Szlávik, Z. (2010). An MOEA­ based Method to Tune EA Parameters on Multiple Objec­ tive Functions. In International Joint Conference on Com­ putational Intelligence, pages 261–268. Talbi, E.­G. (2009). Metaheuristics: From Design to Imple­ mentation. Wiley Publishing. Vargha, A. and Delaney, H. D. (2000). A Critique and Im­ provement of the ”CL” Common Language Effect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics, 25(2):101–132. Vergilio, S., Pozo, A., Árias, J., Cabral, R., and Nobre, T. (2012a). Multi­objective Optimization Algorithms Ap­ plied to the Class Integration and Test Order Problem. International Journal on Software Tools for Technology Transfer, 14(4):461–475. Vergilio, S. R., Pozo, A., Árias, J. a. C. G., da Veiga Cabral, R., and Nobre, T. (2012b). Multi­objective optimization algorithms applied to the class integration and test order problem. International Journal on Software Tools for Technology Transfer, 14(4):461–475. Wang, Z., Li, B., Wang, L., and Li, Q. (2011). A Brief Survey on Automatic Integration Test Order Generation. In Inter­ national Conference on Software Engineering and Knowl­ edge Engineering, pages 254–257. Whigham, P. A., Dick, G., and Maclaurin, J. (2017a). Just because it works: a response to comments on “On the Mapping of Genotype to Phenotype in Evolutionary Algo­ rithms”. Genetic Programming and Evolvable Machines, 18(3):399–405. Whigham, P. A., Dick, G., and Maclaurin, J. (2017b). On the mapping of genotype to phenotype in evolutionary al­ gorithms. Genetic Programming and Evolvable Machines, 18(3):353–361. Zitzler, E., Brockhoff, D., and Thiele, L. (2007). The hy­ pervolume indicator revisited: On the design of pareto­ compliant indicators via weighted integration. In Obayashi, S., Deb, K., Poloni, C., Hiroyasu, T., and Mu­ rata, T., editors, Evolutionary Multi­Criterion Optimiza­ tion, pages 862–876, Berlin, Heidelberg. Springer Berlin Heidelberg. Zitzler, E., Laumanns, M., and Thiele, L. (2001). SPEA2: im­ proving the strength Pareto evolutionary algorithm. Tech­ nical report, Department of Electrical Engineering, Swiss Federal Institute of Technology. Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C. M., and da Fonseca, V. G. (2003). Performance assessment of multiobjective optimizers: an analysis and review. IEEE Transactions on Evolutionary Computation, 7(2):117– 132. Introduction Background Integration and Testing Order problem Automatic Design of Algorithms irace Grammatical Evolution Multi-Objective Evolutionary Algorithms Proposed Approach MOEA Training Instance Components and Parameters Population Selection Fitness Assignment Mating Replacement Archiving Design Algorithms Empirical Evaluation Training Phase Testing Phase Further Discussion Threats to Validity Related Work Studies addressing the ITO problem Works on automatic design of algorithms Concluding Remarks