Journal of Software Engineering Research and Development, 2021, 9:7, doi: 10.5753/jserd.2021.1049  This work is licensed under a Creative Commons Attribution 4.0 International License.. Representation of software design using templates: impact on software quality and effort Silvana Moreno  [ Universidad de la República, Uruguay | smoreno@fing.edu.uy ] Vanessa Casella  [ Universidad de la República, Uruguay | vcasella@fing.edu.uy ] Martín Solari  [ Universidad ORT Uruguay | martin.solari@ort.edu.uy ] Diego Vallespir  [ Universidad de la República, Uruguay | dvallesp@fing.edu.uy ] Abstract As a practice, software design seeks to contribute to developing quality software. During this software devel­ opment stage, the requirements are translated into a representation of the software (also known as design), whose quality can be evaluated and improved. For undergraduate students, the design is difficult to understand and make. In fact, building a good design seems to require a certain level of cognitive development that few students achieve. The aim of this study is to know the effort dedicated to software detailed design and the effect on software quality when graduating students use templates to represent their design. We conducted a controlled experiment where stu­ dents develop eight projects following a defined process and recording data from its execution in a software tool. We found that the use of design templates did not improve the quality of the code, measured as the defect density in the unit test phase. Also, the use of templates did not reduce the number of code smells in the analyzed code. Regarding the effort, students who use templates dedicated greater development effort to designing than to coding. Meanwhile, students who did not use templates dedicated four times less effort to designing than to coding. Keywords: detailed design, software quality, graduating students 1 Introduction Software design is one of the most important components to ensure the success of a software system (Hu, 2013). Be­ tweentherequirementsanalysisphaseandthesoftwarebuild­ ing phase, software design has two main activities: architec­ tural design and detailed design. During architectural design, high­level components are structured and identified. Dur­ ing detailed design, every component is specified in detail (Bourque and Fairley, 2014). This work is focused specifi­ cally on detailed design. Design is a difficult discipline for undergraduate students to understand, and success (i.e. building a good design) seems to require a certain level of cognitive development that few students achieve (Carrington and K Kim, 2003; Hu, 2013; Linder et al., 2006). Students’ ability to build a good design is related to the abstraction, understanding, reasoning and data­processing ability (Kramer, 2007; Leung and Bol­ loju, 2005; Siau and Tan, 2005). Building quality software is increasingly relevant. We highly depend on software in our daily lives and its quality has a great impact. A quality software design allows us to build quality software, with fewer defects and is more main­ tainable. Industry practitioners are aware of the importance of software design quality and they use clean code practices, reviews and tools, among others, to contribute in this regard (Brown et al., 1998; Fowler, 2018; Stevenson and Wood, 2018). Knowing how undergraduate students design is of interest to several authors (Chen et al., 2005; Eckerdal et al., 2006a,b; Loftus et al., 2011; Tenenberg, 2005). Most of their studies found that students do not manage to produce a good soft­ ware design. Some of the problems detected are lack of con­ sistency between design artifacts and code, incomplete de­ signs, and the lack of understanding of what kind of infor­ mation to include when designing software (Eckerdal et al., 2006a,b; Loftus et al., 2011). In this work, we study the software design practice in grad­ uating students. We conducted an experiment within the con­ text of some courses over three consecutive years to know the effort dedicated to software design and the effect that the representation of design using specific templates has on soft­ ware quality. We use the term graduating for our students, because, in fact, they are in the fourth year of the degree of the School of Engineering of Universidad de la República, in Uruguay. The curriculum of the School of Engineering is a five­year degree, similar to the IEEE/ACM’s proposal for the Computer Science undergraduate curriculum (Joint Task Force on Computing Curricula ­ ACM and IEEE Computer Society, 2013). Students have already passed courses where detailed software design is taught: design principles, artifacts and design diagrams, UML, design patterns, etc. This work is an extension of the article published at the Iberoamerican Conference on Software Engineering (CIbSE) 2020: “The representation of detailed design using templates and their effects on software quality”. Our article was se­ lected to participate for the publication of in a special issue in the Journal of Software Engineering Research and Devel­ opment (JSERD). Below, we detail the extension of our work with respect to CIbSE article: The work presented at CIbSE 2020 aims to know the effect on software quality when graduating stu­ dents use templates to represent the detailed design. In this work we present an empirical study where students develop 8 projects following a defined process and recording data from the execution in a tool. We found that the use of design tem­ plates did not improve the quality of the code measured as the defects density in the unit test phase. Neither did the use https://orcid.org/0000-0002-1677-6212 mailto:smoreno@fing.edu.uy https://orcid.org/0000-0002-0339-6624 mailto:vcasella@fing.edu.uy https://orcid.org/0000-0001-5532-3227 mailto:martin.solari@ort.edu.uy https://orcid.org/0000-0003-1701-353X mailto:dvallesp@fing.edu.uy Moreno et al. 2021 of templates manage to reduce the number of code smells present in the analyzed code. The extension carried out in this work consists, on the one hand, of expanding and deepening aspects that for limited space reasons are not in the CIbSE ar­ ticle. On the other hand, we add a new research question and its analysis, which allows to knowing the effort that implies the use of design templates. Specifically, a new section explaining the experimental de­ sign in depth was added. The analysis of external quality was expanded and deepened. Descriptive statistics were added and analyzed and tables were added with the data of the av­ erage density of defects in UT for the students. In addition, a statistical analysis was added within the between­group anal­ ysis that checks the homogeneity of the groups studied (TRD, noTRD). Threats to validity were expanded, grouping them by type (construct, internal, external, conclusion), and Dis­ cussion and Conclusions sections were expanded. A research question was added that seeks to know the effort that students dedicate to design, and how that effort varies after the use of templates. To answer this question, the relationship between the effort dedicated to the design phase and the effort dedicated to the coding phase was studied. De­ scriptive and statistical analyses were presented as part of the analysis of results. The results obtained are discussed and re­ lated to those previously obtained in the discussion section. The document is structured as follows: Section 2 presents related works; section 3 presents the research methodology; section 4 presents the results, and section 5 is discussed; threats to validity are mentioned in section 6, and section 7 presents the conclusions and future work. 2 Related Work Software design is an important activity to ensure the qual­ ity of a software system (Hu, 2013; Taylor, 2011). It involves identifying and abstractly describing the software system and its relationships. Good designs help develop robust, main­ tainable software and with few defects (Pierce et al., 1991; Sommerville, 2016). Detailed software design is a creative activity, which can be done in different ways: implicitly, in the developer’s mind before coding, on a sketch on paper, through diagrams, using both formal and informal languages or tools (Chemuturi, 2018). Software quality is the degree to which a software product meets stakeholders’ needs both explicit and implicit. Qual­ ity models represent quality in terms of a set of elements of the model and their relationships (Nistala et al., 2019). These models define internal and external software quality attributes. The internal ones are those that do not depend on the software execution (static), while the external ones are those that are applicable to the execution. In recent years, the use of clean code practices and tools has contributed to improved design quality (Stevenson and Wood, 2018). Code smells, anti patterns and design flaws can be used to measure the quality of a software design (Mar­ tin, 2002; Gibbon, 1997; Brown et al., 1998; Fowler, 2018). SonarQube (Campbell and Papapetrou, 2013) and FindBugs (Ayewah et al., 2008) are some of the tools used to measure the quality of the code by detecting bad smells. Current industry practices require practitioners with the necessary skills to understand and build good software de­ signs. However, students have difficulties designing. Build­ inggooddesignsrequiresacertainlevelofcognitivedevelop­ ment that few students achieve (Carrington and K Kim, 2003; Hu, 2013; Linder et al., 2006). This cognitive development is related to the ability to recognize design patterns, architec­ tural design styles, and related data and actions that can be extracted into appropriate design abstractions (Hu, 2013). In fact, for students, learning to design is more difficult than learning to code. This difficulty occurs because for most programming languages, students get compiler feedback and run time errors. However, this does not happen with design (Karasneh et al., 2015). Object­Oriented Design (OOD) is one of the most widely­ used design approaches in the industry and one of the sub­ jects normally taught in universities (Flores and Medinilla, 2017). By using OO modeling diagrams and languages, static and dynamic models of software systems can be created. Sev­ eral empirical studies analyze the understanding and bene­ fits of using UML diagrams (Budgen et al., 2011; Fernández­ Sáez et al., 2013; Arisholm et al., 2006; Gravino et al., 2015; Torchiano et al., 2017). In some studies, students failed to obtain design benefits using UML diagrams (Gravino et al., 2015; Torchiano et al., 2017). Gravino et al. found that students who use UML di­ agrams to design do not make significant improvements in their source code comprehension tasks compared to students who do not use them. Also, students who use diagrams spend twice as much time on the same source code comprehension task than as students who do not use them. When analyzing the experience factor, they find that the most experienced stu­ dents achieve an improvement in the understanding of the source code (Gravino et al., 2015; Soh et al., 2012). For industry professionals, the use of UML continues to be resisted to a certain degree (Stevenson and Wood, 2018). A survey conducted to on 50 software professionals indicates that although the quality of the software is an important as­ pect, the use of UML is selective (informal, only for a while, then it is discarded) and with low frequency (Petre, 2013). The use of Model­Driven Development (MDD) methodol­ ogy to design software has shown improvements in software quality. Panach et al. conducted an experiment and found that students using MDD achieve better quality products (mea­ sured through test cases) than students using the traditional software development method (Panach et al., 2021). Undergraduate students’ design skills are reported by pre­ vious studies examining artifacts produced by them to learn how they design software (Chen et al., 2005; Eckerdal et al., 2006a,b; Loftus et al., 2011; Tenenberg, 2005). These studies use the same requirements specification for which students must produce a design. The studies use different approaches: designs produced individually, designs made in groups, and designs produced at different levels of training. In general, all the works mentioned agree on the fact that graduating students are not capable of designing a soft­ ware system. Lack of consistency between design artifacts and code, incomplete designs, and lack of understanding of what kind of information to include when designing software are some of the major difficulties reported (Eckerdal et al., Moreno et al. 2021 2006a,b; Loftus et al., 2011). We believe, just as Loftus et al. (Loftus et al., 2011), that students do not precisely know what to do when they have to design software. Besides, several authors analyzed the ar­ tifacts produced and they agree on the fact that students do not know how to design (Chen et al., 2005; Eckerdal et al., 2006a,b; Loftus et al., 2011; Tenenberg, 2005). This moti­ vated the work presented in this paper, in which we pro­ vide students with design templates as a support tool for de­ sign representation. Unlike Gravino and Torchiano, who an­ alyzed the benefits of using diagrams in code comprehen­ sion (Gravino et al., 2015; Torchiano et al., 2017), our ap­ proach tries to analyze the effort dedicated to designing and coding; and the impact of the use of templates on software quality. We studied quality from two perspectives: defects on the code and code smells. We also analyzed the effort as the time in minutes that students dedicate to the design and code phases. The focus of our research is the OOD at the class level, including source code organization, the identification and re­ lationship between classes, and the interaction of users with the system. As Kitchenham pointed out (Kitchenham and Pfleeger, 1996), this corresponds, to the “Product View”, an examination of the inside of a software product. We used an approach focused on objects because a large part of the current software is developed using that technology (Group, 2015). 3 Research Methodology We studied the effect of design in software quality when grad­ uating students represent their design using a specific set of templates and the effort they dedicate to the design activity. We conducted three experiments within the context of three consecutive undergraduate courses, from 2015 to 2017. 3.1 Course context The course Principles and Foundations of Personal Software Process (PF­PSP) have the same format every year and lasts 9 weeks. In the first week (week 1), a base process is taught, and the dynamics of the practical work to be done throughout the remaining eight weeks are explained. Students participate in the course on a voluntary basis. The base process is a defined and disciplined process that intends to help the software development tasks and to col­ lect product and process metrics. The process has different phases, scripts that guide the work in each phase, and logs that are used to collect data (see Figure 1). The base process is divided into the following phases: plan, design, code, compile, unit test (UT), and postmortem. To follow the process, students are provided with a set of scripts. Scripts are a one page guide that establishes the inputs, out­ puts and activities to be carried out in each phase. Scripts help students guide the development activities but without demanding how they must be carried out. In each phase of the process, students must log the time dedicated to the phase, as well as data on the defects he or she removes (injection phase, removal phase, type of defect and time spent to correct it). In the postmortem phase students log the size in line of code (LOC) of the program built. Figure 1. Base Process The practical work consists of each student developing 8 small projects following the base process and recording the process data in a tool. Students carry out the projects individ­ ually and consecutively. Project 2 does not begin until project 1 has been completed, and so on with the remaining projects. From week 2 to week 9, one project is assigned per week. At the beginning of each week, a teacher sends the student the requirements of each project. Each student’s submission must contain the code that solves the problem, the test cases executed, and the export of the data that was registered in the tool. Once the student submits the solution, the teacher reviews the work and sends corrections back to the student if necessary. Students carry out the projects at home and have a teacher assigned, who will be responsible for assigning the projects, correcting them and answering questions. Before starting project 1, each student must choose the programming language to use throughout the course. Our in­ terest is to collect data of the execution of the development process with the use of a programming language familiar to the student. Projects are small in size and of low and similar difficulty, so the design phase refers to detailed design (i.e. identifying classes, attributes, operations, program scenarios, status diagram, and pseudocode). The nature of project 2 is different from the other projects. In project 2, students have to build a size­measuring software, while in the remaining projects, they must produce mathe­ matical solutions (standard deviation, Simpson’s rule, corre­ lation parameters). Previous studies show that process mea­ sures and product measures in project 2 have greater diffi­ culty than in the rest of the projects (i.e., project 2 is an out­ lier), and it is usually discarded in statistical analysis (Grazi­ oli et al., 2014b; Moreno and Vallespir, 2018). Therefore, we excluded the data of this project from the analyzes presented in this article. However, it is relevant to mention that project 2 is an integral part of our course, and it is used for students from projects 3 to 8 to count the lines of code they produce in each project. Percentiles 5 and 95 of the data collected for all the stu­ dents throughout the 8 projects are 26 LOC and 242 LOC respectively. Each replication of the experiment corresponds to an in­ stance of a different run of the course. Students who par­ ticipated in one course do not participate again in a later course. The teachers participating were the same throughout the three courses (2015­2017). Moreno et al. 2021 3.2 Goals and research questions The aims of the experiment are to know the effect on soft­ ware quality when students represent their designs using tem­ plates, and to study the effort they dedicate to the design ac­ tivity. Templates are documents with a predefined structure in which students have to represent their designs. The templates we used allow to describing the detailed de­ sign of a project. We used four templates, a brief description of each of them is presented below: • Operational template: specifies the interaction between the program and the users. The content may look similar to a use­case description. • Functional template: the behavior of the program’s invo­ cations and returns are specified in this template. Vari­ ables, functions, classes and methods are described. Fig­ ure 2 presents an example of the use of this template for project 6. • Logical template: in this template, the pseudocode of each method that appears in the functional template is registered. • State template: it can be used to define the transactions and conditions of the program’s internal states. The con­ tent is similar to state machine diagrams. The selected templates emerge from the Personal Pro­ cess (PSP) framework(Humphrey, 1995). The PSP consid­ ers a design to be complete when it defines all four di­ mensions (internal­static, internal­dynamic, external­static, external­dynamic). The way to correspond to each of the four dimensions is by using the four templates (Operational, Func­ tional, Logical, State). Completing the four templates allows describing the designs entirely and precisely (Humphrey, 1995). Several studies have shown an improvement in devel­ oper performance with templates insertion (Hayes and Over, 1997; Prechelt and Unger, 2001; Gopichand et al., 2010). In the experiment context, we proposed the following re­ search questions and the corresponding research hypotheses: RQ1: Is there an improvement in the quality of the products when students represent the design using templates? RQ2: What is the relation between the effort dedicated to designing and the effort dedicated to coding? Are there any variations in effort when students use templates? To answer RQ1, we analyzed the external and internal quality of the software developed in each project. To study the external quality, we considered the following research hypothesis: H1.0: Representing software design using design templates, does not change the software defect density in unit testing H1.1: Representing software design using design templates, changes the software defect density in unit testing To study the internal quality, we descriptively analyzed certain code smells introduced by students when producing software (Fowler, 2018). We are interested in knowing if the use of templates to represent software design prevents stu­ dents from incurring into some type of code smells. To answer RQ2, we studied the time spent on the design and code phases. We analyzed the following research hypoth­ esis: H2.0: The time spent on designing equals the time spent on coding. H2.1: The time spent on designing does not equal the time spent on coding. 3.3 Experimental design Our design is a repeated measures design with one factor (the base process) and two levels: with templates to represent the software design and without templates to represent the soft­ ware design. Response variables considered in this experi­ ment are internal and external software quality, and the effort dedicated by the students to the design and code phases. Our experimental design implies that students develop 8 projects. The base process introduces practices in the first 2 projects that allow for guiding the work and measure the pro­ cess. Therefore, during the first or second project (depend­ ing on the subject), they are already following the process adequately. People have high variability among themselves when applying software development techniques or processes (Humphrey, 2005). When high variability among people ex­ ists in an experiment with human subjects, a within­subjects design is preferable to a between­subjects experiment (Senn, 2002). Moreover, in repeated measures experiments, sub­ jects serve as their own control (Jones and Kenward, 2014). This reinforces the choice of our design, in which each stu­ dent carries out several projects. The effect of students’ learning throughout these 8 exer­ cises could be a problem in our experimental design. How­ ever, this was previously studied from different approaches, and the results indicate in both studies that repetition of pro­ gramming did not contribute to performance improvements (Grazioli et al., 2014b; Grazioli and Nichols, 2012; Grazioli et al., 2014a). As we already mentioned, to evaluate the external quality, we considered the defect density in the unit test phase of the base process. That is to say, the number of defects detected in that phase are counted and divided between the LOCs of the project. To evaluate the internal quality, we analyzed the code smells in which students incur. Knowing the number of code smells present in the product’s source code gives us an idea of the maintenance costs in the future (Fowler, 2018). The effort in design and code is measured as the time in minutes that the student dedicates to the phase in question. The experimental design is presented in Figure 3. All stu­ dents apply the base process in projects 1 to 4, in which sub­ mitting the design representation to the teachers is not re­ quired. When students finished project 4, they were divided randomly into two groups: the control group and the experi­ mental group. The control group, called “without templates to represent the design” (noTRD), continues to apply the base process throughout projects 5 to 8. The experimental group, called “with templates to represent the design” (TRD), started to apply the templates from project 5 to 8. Moreno et al. 2021 Figure 2. Functional Template The TRD group attends a theoretical class where the four design templates are presented and explained (and exam­ ples are shown). The submission of the design representation for this group was mandatory (except for the state template which is optional). When a student submitted the project, the assigned teacher checked the completeness of the templates and the consistency with the code. In this way, students de­ signing a solution and then coding another one is reduced. However, the fact that the design is complete and verifiable is not controlled. Our experimental design allows us to study the behavior of the groups before and after the use of the templates. On the one hand, we propose to analyze the TRD (representing design using template) andnoTRD (representing design with­ out template) groups during project 1 to 4 to confirm they are homogeneous groups; that is, the quality of the software de­ veloped is similar in both groups from programs 1­4 (when students do not use templates in any of the groups). On the other hand, we are interested in knowing if students who use templates develop better­quality software. We pro­ pose studying the groups TRD and noTRD during projects 5 to 8 to know if representing the design using templates has some effect on the software quality. 3.4 Operation The experiment was replicated in the course for three years: 2015, 2016, and 2017. The number of students that took part in the experiment was: 25, 17, and 19 respectively. Out of the 61 students participating in the experiment, 29 are part of the TRD group, and 32 of the noTRD group. This unbalance between the groups is due to the unbalance gen­ erated when students were assigned to the TRD and noTRD groups in each of the three replications. 4 Analysis and Results To answer RQ1: “Is there any improvement in the quality of the products when students represent the design using tem­ plates?”, we analyzed the quality from the internal and exter­ nal points of view. 4.1 External Quality We measured the external quality as the defect density in UT, that is, the number of defects in UT/KLOC. To analyze the external quality, we defined the following research hypotheses: H1.0: Representingsoftwaredesign using design templates does not change the software defect density in UT H1.1: Representing software design using design templates changes the software defect density in UT We analyzed the external quality in two ways: intra groups and between groups. Between groups refers to knowing if there is a significant difference in the quality between the TRD group and the noTRD group. Intra group refers to study­ ing the quality of the software in the TRD group before and after the use of templates. Between groups The analysis between groups consists, on the one hand, of analyzing the TRD and noTRD groups during projects 1, 3 and 4; and on the other hand, analyzing the TRD and noTRD groups during projects 5 to 8. Due to the difficulty of project 2 compared with the rest of the projects, we decided not to include this project’s data in the analysis. During projects 1, 3 and 4, both groups apply the base pro­ cess, so, comparing the software quality of both groups dur­ ing those projects allows to confirm that they are homoge­ neous groups, and thus establishing the experimental frame. For this analysis, we defined the following hypothesis of in­ vestigation: H1.0: Median (Def. density in UT of noTRD) = Median (Def. density in UT of TRD) H1.1: Median (Def. density in UT of noTRD) = > Median (Def. density in Moreno et al. 2021 Figure 3. Experimental design UT of TRD) Each sample corresponds to the average defect density in UT of a student considering projects 1, 3 and 4. 1000 ∗ ∑4 n=1 #def ectsU Tn∑4 n=1 #LOCn (1) where n varies between 1, 3 and 4. During the analysis, we detected that the data from a stu­ dent of the TRD group was not accurate, that is, that the process followed had not been accurately recorded. So, data from that student was eliminated from the analysis and then there were 28 students remaining as part of the TRD group. The descriptive statistics of the TRD and noTRD groups considering projects 1, 3 and 4 are presented on Table 1. The values of the mean and interquartile range indicate there seems not to be great variability between the groups. To confirm this, we applied the Mann­Whitney test for inde­ pendent samples, since they correspond to different students. Table 1. Mean and interquartile range in projects 1, 3 and 4 Mean Interquartile range TRD 30.22 25.54 noTRD 32.88 28.9 The result indicates a p­value = 0.3467, with which we cannot reject the null hypothesis (significance = 0.05). This result does not allow us to affirm that there is a difference in quality between TRD and noTRD groups. We can assert that both groups have a similar or homogeneous behavior. This gives us more confidence to study the software quality be­ tween the TRD and noTRD groups after the use of templates eliminating the possibility that the result is due to the behav­ ior of the groups rather than to using or not using templates. Studying the TRD and noTRD groups during projects 5 to 8 aims to know if representing the design using templates has some effect in the software quality. For the analysis between groups during projects 5 to 8, we defined the following hypothesis of investigation: H1.0: Median (Def. density in UT of noTRD) = Median (Def. density in UT of TRD) H1.1: Median (Def. density in UT of noTRD) = > Median (Def. density in UT of TRD) Table2presents the average defect density inUT for the 28 students of the TRD group and the 32 students of the noTRD group in projects 5 to 8. The values of the mean and of the interquartile shown in Table 3 indicate low variability of the groups. That is to say, the use of templates by theTRD group does not produce a sig­ nificant difference in the defect density compared to noTRD group not using templates. To study the behavior of both groups we used hypothesis tests. The samples are different because they correspond to different students, thus, the Mann­Whitney test is applied. Results indicate p­value = 0.165, therefore, the null hy­ pothesis cannot be rejected. Thus, we cannot affirm that the students who use the templates manage to develop software with fewer UT defect density than students who do not use templates. Intra groups As already mentioned, intra groups refers to knowing if students of TRD group improve the software quality after the use of templates to prepare the design. To know this, the de­ fect density in UT from theTRDgroup is analyzed in projects 1 to 4 (without project 2) and projects 5 to 8. Studying the be­ havior of the same group allows to know if there is a change in the software quality after the use of templates. We define the following research hypotheses: H1.0: Median (Def. density in UT of TRD134) = Median (Def. density in UT of TRD58) H1.1: Median (Def. density in UT of TRD134) = > Median (Def. density in UT of TRD58) being TRD134 are the students of TRD group during projects 1, 3 and 4; and TRD58 are the same students of TRD group during projects 5 to 8. Table 4 presents the defect density in UT for the students of theTRDgroup in projects 1, 3 and 4, and the same students in projects 5 to 8. The descriptive statistics presented in Table 5 indicate some variability in defect density. Even though the mean is similar, it seems that using templates (after project 5) to rep­ resent the design achieves products with less defects. To statistically study the data, we applied the Wilcoxon test (signed rank test) for paired samples (because for this analysis the data come from the same students). Results indi­ cate a value of V = 138 and p­value = 0.1438. Since p­value is higher than 0.05 (value of significance) it is not possible Moreno et al. 2021 Table 2. Average defect density in UT for the students of the TRD group and noTRD group in projects 5 to 8 Group Student Defect density Group Student Defect density TRD 1 8.83 noTRD 1 27.98 TRD 2 23.16 noTRD 2 24.86 TRD 3 33.78 noTRD 3 23.59 TRD 4 40.76 noTRD 4 14.35 TRD 5 83.33 noTRD 5 21.37 TRD 6 16.10 noTRD 6 12.19 TRD 7 5.74 noTRD 7 22.79 TRD 8 13.02 noTRD 8 43.33 TRD 9 28.07 noTRD 9 27.02 TRD 10 12.5 noTRD 10 36.46 TRD 11 9.49 noTRD 11 38.98 TRD 12 19.70 noTRD 12 16.80 TRD 13 11.70 noTRD 13 37.65 TRD 14 36.85 noTRD 14 18.93 TRD 15 20.53 noTRD 15 18.25 TRD 16 22.93 noTRD 16 22.98 TRD 17 11.80 noTRD 17 47.12 TRD 18 37.45 noTRD 18 30.21 TRD 19 26.05 noTRD 19 35.03 TRD 20 5.03 noTRD 20 27.84 TRD 21 23.35 noTRD 21 12.22 TRD 22 17.36 noTRD 22 24.57 TRD 23 10.08 noTRD 23 15.65 TRD 24 42.75 noTRD 24 41.17 TRD 25 33.43 noTRD 25 44.89 TRD 26 28.63 noTRD 26 20.35 TRD 27 44.02 noTRD 27 38.80 TRD 28 23.88 noTRD 28 51.54 noTRD 29 7.85 noTRD 30 27.89 noTRD 31 24.24 noTRD 32 25.49 Moreno et al. 2021 Table 3. Mean and the interquartile range in projects 5 to 8 Mean Interquartile range TRD 24.65 21.2 noTRD 27.57 16.9 to reject the null hypothesis. This indicates that we cannot affirm that students improve the quality of their software by using design templates. 4.2 Internal Quality To evaluate the internal quality, we carried out an analysis of those code smells introduced by students when develop­ ing the course projects. The aim of this analysis is to inves­ tigate if the use of design templates prevents students from incurring into certain code smells. The analysis presented is preliminary and exploratory, seeking to obtain initial results that allow us to generate new research hypotheses. The code smell types depend on the programming lan­ guage. As students can choose the language in which they develop their projects, this analysis has to be done taking into account the different languages used. With the aim of doing an initial analysis, and that it added value to our re­ search, the students who developed their projects with Java, C#, C, C++ and Ruby were selected, excluding those devel­ oped with PHP and Python. We excluded PHP and Phyton because they do not have many code smells in common with the other languages. If we had added PHP and Python, the number of code smells to analyze would have been reduced too much. So, both languages were excluded for this initial analysis. This left a total of 45 students for the analysis, 19 from 2015, 14 from 2016, and 12 from 2017. Of those 45 students, 21 belong to the TRD group (9 in 2015, 6 in 2016 and 6 in 2017) and 24 to the noTRD group (10 in 2015, 8 in 2016 and 6 in 2017). To detect the code smells, the tool SonarQube1 was used, since it is a free­software tool for a variety of programming languages, which presents constant updates for the commu­ nity and a wide documentation, among others. We selected 16 code smell types for the analysis. These are common for the programming languages we chose and are detectable by SonarQube. The code smell types are: 1) statements “if ... else if” must end with the clause “else”; 2) statements “switch”/“case” must not be nested; 3) statements “switch”/“case” must not have too many “case”/“when” clauses; 4) the cognitive complexity of the functions or meth­ ods must not be too high; 5) “if” collapsible statements must merge; 6) the “if”, “for”, “while”, “switch” and “try” state­ ments of control flow must not nest too much; 7) the ex­ pression must not be too complex; 8) files must not have too many lines of code; 9) functions or methods must not have too many lines of code; 10) functions or methods must not have too many parameters; 11) lines of code must not be too long; 12) functions or methods must not be empty; 13) statements must be in separate lines; 14) two branches in one conditional structure must not have the exact same im­ plementation; 15) the parameters of one function or method not used must be eliminated; 16) the local variables not used 1http://www.sonarqube.org must be eliminated. A more detailed description of each one is not provided for article­length reasons. Table 6 shows the percentage of students that incurred in at least one code smell, segmented by project (from 1 to 8) and by group (noTRD and TRD). Code smells 3, 8 and 12 are not present in any of the projects analyzed. When analyzing the table between the noTRD and TRD groups, as of program 5 (after using templates) a great vari­ ability arises, both if it is considered per project as it is con­ sidered per code smell. For code smells 4, 7, 10 and 13, it is observed that a group is better for certain projects, and the other group is better for certain other projects. For code smells 1, 2, 5, 6, 9 and 14, the difference between groups is very little. To sum up, changes after using templates are not observed for any of these code smells. For the case of code smell 11, a very minor percentage is observed in projects 5 and 7, and a minor percentage in project 8 on the part of the group using templates. In project 6, both groups have almost identical behavior. From the point of view of templates, maybe it is the pseudocode template that is helping the students decrease the introduction of this code smell. Code smells 15 and 16 show a similar behavior. For both cases, TRD group almost does not incur in them, while noTRD does and sometimes in a high percentage. Number 15 refers to parameters not used in the methods, and 16 to local variables not used. Clearly, these types of code smells can be avoided with good software design. From the point of view of the use of templates, maybe the development of pseudocode (logic template) and the functional template are preventing the students of the TRD group from incurring in these code smells. Anyway, it is necessary to manually ana­ lyze the templates submitted by the students and have inter­ views with them to know better if this can be happening for the reasons already described. This has not been done yet. However, when analyzing the table, but only considering the data of the TRD group throughout the 8 projects, we do not see that the use of templates improves the internal quality. It is worth noting that this group normally did not incur in code smells 15 and 16 (or did it in a very low percentage). Ob­ serving projects 1 to 4 and 5 to 8 separately, we do not see any difference between them. That means, the behavior of this group before using templates and during its usage does not change for these code smells. So, the difference presented in the previous analysis between TRD and noTRD groups does not seem to respond to the use of templates. Something similar happens with code smell 11. Results do not show a decrease of this code smell when using templates. It can be observed that in project 8, the percentage of oc­ currence of code smells 4, 9 and 10 significantly increases for both groups. This increase makes us think that project 8 is more complex for the students. These three code smells in­ dicate that the code developed is too complex and long for its comprehension. That is, the use of templates did not help the students elaborate a less complex and understandable design. Putting both analyses together, we conclude that the use of templates does not improve the internal quality. Specifically (or being more precise), the use of templates does not seem to have an effect on the code smells in which the students Moreno et al. 2021 Table 4. Defect density in UT for the students of the TRD group in projects 1, 3 and 4, and in projects 5 to 8 Group Student Defect density 1,3 and 4 Defect density 5 to 8 TRD 1 2.22 8.83 TRD 2 7.22 23.16 TRD 3 35.33 33.78 TRD 4 14.24 40.76 TRD 5 95.74 83.33 TRD 6 17.85 16.10 TRD 7 10.14 5.74 TRD 8 21.18 13.02 TRD 9 15.54 28.07 TRD 10 39.80 12.5 TRD 11 13.79 9.49 TRD 12 18.31 19.70 TRD 13 10.23 11.70 TRD 14 60.60 36.85 TRD 15 32.60 20.53 TRD 16 25.83 22.93 TRD 17 51.09 11.80 TRD 18 48.78 37.45 TRD 19 39.63 26.05 TRD 20 15.56 5.03 TRD 21 30.70 23.35 TRD 22 25.77 17.36 TRD 23 9.72 10.08 TRD 24 32.71 42.75 TRD 25 10.05 33.43 TRD 26 42.70 28.63 TRD 27 16.87 44.02 TRD 28 102.04 23.88 Moreno et al. 2021 Table 5. Mean and the interquartile range calculator Project Mean Interquartile range 1, 3 y 4 30.22 25.5 5 to 8 24.65 21.2 incur when designing software. 4.3 Effort dedicated to designing and coding To answer RQ2: “What is the relation between the effort ded­ icated to designing and the effort dedicated to coding?, Are there any variations in effort when students use templates?” , we analyzed the following hypothesis test: H2.0: Median (TCOD) <= Median (TDLD) H2.1: Median (TCOD) > Median (TDLD) As part of the base process, each student registered the time spent in the design phase (TDLD) and the time spent in the code phase (TCOD) for each project. To know the effort dedicated to designing and to coding by the group that uses the templates and the group that does not use them, we analyzed both groups independently during projects 5 to 8. That is, on the one hand, we carried out the analysis of the TRD group during projects 5 to 8, and on the other hand, of the noTRD group during projects 5 to 8. For each student, we calculated the time spent in design and the time spent in code for projects 5 to 8. The calculation for each pair of data is the following: ( 8∑ n=5 T DLDn, 8∑ n=5 T CODn) (2) where T DLDn is the time spent in the design phase for project n, T CODn is the time spent in the code phase for project n, and where n varies from 5 to 8. Table 7 presents the 28 data pairs (TDLD, TCOD) for the TRD group, and the 32 data pairs (TDLD, TCOD) for the noTRD group. Table 8 presents the mean and the interquartile range for the TRD group and the noTRD group. The mean value of the TRD group shows that the use of templates takes more design time compared with the group that did not use templates. Furthermore, the design time in the case of TRD exceeds the time spent on coding. Regarding the TCOD’s mean, even though it is similar in the TRD and noTRD groups, a decrease in the TRD group is observed. Despite the fact that the decrease is not quite significant, the use of templates might have helped coding in less time. To determine the statistical test that best fits the problem to be solved, the distribution of the data was previously studied. When applying the Kolmogorov­Smirnov test for the TRD group, a significance value of 0.00478 is obtained, indicating that the values do not fit a normal distribution. The result of applying Kolmogorov­Smirnov test for the noTRD group returns 7.713e­12 as a significance value, for that, the values do not fit a normal distribution. As the data of both does not follow a normal distribu­ tion, Wilcoxon’s test is used for paired samples. The sam­ ples of each group are paired since the sampled pairs (TDLD, TCOD) correspond to the same student. We executed the test for the TRD group and for the noTRD independently. For the noTRD group, we proposed to know the value of X such that TCOD = X*TDLD. We analyzed the following hypothesis test: H2.0: Median (TCOD of noTRD) <= Median (X*TDLD of noTRD) H2.1: Median (TCOD of noTRD) > Median (X*TDLD of noTRD) When executing the test for the noTRD group with X=1, the null hypothesis is rejected (p­value = 4.169e­07, the sig­ nificance level is taken with a value of 0.05), confirming that the coding time is greater than the designing time. To know how much more or what is the relationship between these times (TCOD = X*TDLD) we applied the test again but now multiplying the TDLD by an integer X value until the null hy­ pothesis cannot be rejected. Table 9 presents the results for the Wilcoxon test. The results indicate that for X=1, X=2 and X=3 the null hypothesis is rejected, so the coding time is greater than 3 times the design time. For X=4, the null hypothesis cannot be rejected (p­value=0.541). In other words, students who did not use templates generally spent at least 3 times more time on coding than on designing. In the case of the TRD group, the mean value shows that students tend to dedicate more time to design in relation to code.Therefore,wecarriedout theanalysis inaninverseway, calculating X such that: X*TCOD=TDLD. We analyzed the following hypothesis test: H2.0: Median (X*TCOD of TRD) >= Median (TDLD of TRD) H2.1: Median (X*TCOD of TRD) < Median (TDLD of TRD) When executing Wilcoxon test for the TRD group with X=1, the null hypothesis is rejected (p­value = 0.0007155), confirming that the coding time is less than the designing time. To know how many times more students spent in de­ signing, we applied the test again but now multiplying the TCOD by an integer X value until the null hypothesis cannot be rejected. Table 10 presents the results of the Wilcoxon test applied to TRD group. The results indicate that for X=2 the null hypothesis can­ not be rejected (p­value = 0.998). So, students who use tem­ plates spend more time designing than coding, but not dou­ ble. This result indicates that the group that used templates ded­ icated a greater effort to design than the group that did not use templates. To confirm that the relationship between design­ ing time and coding time previously obtained by the TRD group is due to the use of templates and not to another factor dependent on the group, we studied the relationship (TCOD, TDLD) but in this case during projects 1, 3 and 4 (without using templates). Table 11 presents the mean and the interquartile range of the pairs (TDLD, TCOD) for the TRD group in projects 1, 3 and 4. The values of the descriptive statistics of the TRD group in projects 1, 3 and 4 are similar to those of thenoTRDgroup. In other words, during projects in which students design with­ Moreno et al. 2021 Table 6. Percentage of students who incur at least one code smell by code smell type and student group Code smell Group Project 1 2 3 4 5 6 7 8 1 noTRD 4% 29% 0% 4% 13% 13% 4% 13% TRD 19% 19% 10% 0% 5% 5% 5% 5% 2 noTRD 0% 0% 0% 0% 0% 0% 0% 0% TRD 0% 0% 0% 0% 0% 0% 0% 5% 4 noTRD 8% 58% 0% 13% 30% 46% 29% 50% TRD 24% 43% 5% 10% 10% 43% 24% 95% 5 noTRD 4% 21% 0% 0% 0% 0% 0% 0% TRD 0% 24% 10% 0% 0% 5% 0% 5% 6 noTRD 13% 63% 8% 29% 30% 38% 13% 42% TRD 38% 67% 29% 29% 33% 52% 57% 62% 7 noTRD 0% 25% 0% 0% 0% 4% 8% 0% TRD 0% 19% 0% 0% 0% 5% 0% 5% 9 noTRD 0% 4% 8% 17% 10% 21% 21% 67% TRD 0% 10% 19% 14% 10% 29% 38% 71% 10 noTRD 0% 0% 0% 0% 0% 0% 8% 54% TRD 0% 0% 5% 0% 0% 0% 19% 38% 11 noTRD 4% 46% 42% 8% 40% 4% 46% 75% TRD 0% 29% 29% 0% 14% 5% 24% 62% 13 noTRD 0% 0% 0% 0% 10% 0% 0% 4% TRD 5% 0% 5% 0% 0% 0% 5% 19% 14 noTRD 0% 8% 0% 0% 10% 0% 0% 0% TRD 0% 0% 0% 0% 0% 0% 0% 0% 15 noTRD 0% 0% 8% 4% 20% 0% 13% 17% TRD 0% 0% 0% 0% 5% 0% 0% 0% 16 noTRD 8% 13% 8% 8% 40% 8% 17% 29% TRD 5% 5% 10% 10% 0% 0% 10% 10% Moreno et al. 2021 Table 7. Data pairs for the TRD group and the noTRD group TRD group noTRD group TDLD TCOD TDLD TCOD 178 263 60 172 748 217 44 369 940 621 51 446 522 249 63 350 178 61 16 245 204 221 53 302 163 371 100 427 295 212 67 289 665 265 64 243 175 272 23 464 626 329 31 350 407 169 65 460 757 407 23 248 238 228 18 184 392 269 132 347 288 249 163 225 212 210 140 197 278 150 116 354 573 274 69 205 518 199 33 229 336 398 193 226 453 108 58 329 401 222 103 206 330 360 83 168 515 493 43 241 327 242 92 187 160 169 21 481 296 213 107 304 35 236 205 468 64 224 168 194 Table 8. Mean and the interquartile range for noTRD and TRD groups Group Mean Interquartile range TRD TDLD 399.1 287.5 TRD TCOD 265.7 26.7 noTRD TDLD 19.5 15.7 noTRD TCOD 292.8 132 Table 9. Wilcoxon test for the noTRD group in projects 5 to 8 X=1 X=2 X=3 X=4 4.169e­07 4.088e­05 0.03861 0.541 Table 10. Wilcoxon test for the TRD group in projects 5 to 8 X=1 X=2 0.0007155 0.998 Table 11. Mean and the interquartile range of the pairs (TDLD, TCOD) for the TRD group in projects 1, 3 and 4 Mean Interquartile range TDLD 43 41.5 TCOD 242 118 out using templates, the time spent on design is significantly less than the time spent on coding. Table 12 presents the results of executing Wilcoxon’s test to analyze the relation TCOD = X*TDLD of the TRD group in projects 1, 3 and 4. Table 12. Wilcoxon test for the TRD group in projects 1, 3 and 4 X=1 X=2 X=3 X=4 X=5 3.725e­09 3.725e­08 0.0002701 0.01245 0.09678 The results indicate that for X=5, the null hypothesis can­ not be rejected (p­value = 0.09678). Students of TRD group in projects 1, 3 and 4 generally spent at least 4 times more time on coding than on designing. This result shows that there is an increase in the time dedicated to design after the students of the TRD group begin to use design templates. 5 Discussion In the context of our experiment, we found that design repre­ sentation using templates produced an increase in time spent designing (we were expecting this). However, it did not help to develop better­quality software products, nor from an in­ ternal point of view, neither from an external point of view. Results show that the use of templates did not improve nei­ ther the number of defects the developed code has (measured as defects density in UT), nor the internal quality (measured as the number of code smells in the code). These results are related to those reported by Gravino (Gravino et al., 2015), where the use of UML diagrams did not achieve any improve­ ment in the comprehension of the source code vis­à­vis not using them. In addition, the analysis of the relation between effort dedi­ cated to coding and effort dedicated to designing showed that the use of templates produced an increase in design time. Stu­ dents who did not use the templates tended to spent 3 times more on code than on design. Students who use templates spent more time designing than coding. Moreover, students in both groups spent similar time in coding and before us­ ing templates the students in TRD group behave similar to noTRD group. We can conclude then, that using templates to represent design increases the effort dedicated to design but does not have a significant positive effect on quality or in reducing coding time. This can be due to several factors that we must analyze in the future. It could be, among other reasons, that students are not used to these templates and so they did not get the expected benefit; it could be that they just filled the templates but, in that moment, they did not care to think or de­ velop a quality system; it could be that students do not know how to design (as found in other studies); or as mentioned by Chaiyo (Chaiyo and Ramingwong, 2013), it could be that the templates are difficult to use by students. We believe that students do not have the habit of designing and thinking of a solution before coding. Although we think that the use of templates would be helpful, we believe that the students filled them in to achieve the goal without thinking of a design solution. Rather, we believe that the usual stu­ dent practice is code­and­fix. Even though more analysis is Moreno et al. 2021 needed, we agree with several authors on the fact that grad­ uating students have difficulties to design and they do not seem to understand what type of information to include to de­ sign software (Eckerdal et al., 2006a,b; Loftus et al., 2011). 6 Threats to validity Most empirical studies are threatened by the way research is conducted (Wohlin et al., 2012). This section describes the threats to validity we have detected. Internal validity threats: Investigating with students in­ volves several threats. On the one hand, the fact that the con­ text of the experiment is a course implies that the students does not develop naturally. We tried to minimize this threat with a non­graded course, that is, the student approved or failed. Besides, we remarked the importance of monitoring and registering the process just as it was, and we emphasized that students’ assessments would not be done according to results, defects found, or efforts made. On the other hand, there is a threat that students share in­ formation or solutions to projects. In this sense, the assigned teachers reviewed the submissions and compared them be­ tween students to ensure there were no duplicate submis­ sions. In addition, students carry out their projects at home, which causes limited control by teachers. To reduce this threat, we introduced supervision, corrections, and feedback between the student and the assigned teacher. Besides, for the analysis, we did a data aggregation of the three courses, knowing that the different courses can have influence on the data collected for being a hierarchical model. We tried to reduce this threat through the use of a defined and disciplined process the students followed, and keeping the same material and the same teachers throughout the three courses. External validity threats: experimenting with students of a course has the advantage that they are available and are will­ ing to participate in experiments, and the disadvantage that their characteristics cannot be generalized. In our experiment, students took part of the PF­PSP course voluntarily and did not know that they were part of an experiment until they fin­ ished the course. This reduces to the minimum the bias they might have when feeling part of a research. Conversely, the results obtained in this experiment cannot be generalized to the students practice of design in other contexts. Construct validity threats: this kind of threat is related to the way in which the response variables were measured. In our experiment, we measured effort as the time in min­ utes that the student spends on the phase and the quality as the number of defects in UT and the number of code smells in which students incur. To ensure a correct data recording, we used a data recording tool and framework that allows a disciplined and measurable process to be followed. Conclusion validity threats: The number of students in the research constitutes a threat to the statistical conclusion. 61 students participated during the three replications. This causes the statistical analysis to be carried out using non­ parametric tests whose statistical power is lower than the parametric tests. As a measure to this threat, we completed the non­parametric tests with descriptive statistics. 7 Conclusions This work is one step further towards the understanding of the software design practice. The results of our experiment show that graduating students do not improve the software quality when using templates for design representation. How­ ever, using templates produces a significant increase in the time spent on the design phase without reducing coding time. We analyzed the software quality from the internal and ex­ ternal points of view, and from the effort dedicated to design. On the one hand, we statistically proved that using templates for design representation does not improve the external soft­ ware quality, measured as the defect density in unit testing. From the internal quality perspective, the use of templates does not have a significant positive effect on the code smells in which students incur when designing software. Regarding the effort, students who used templates dedi­ cate a greater effort to designing than to coding (which is not double). Meanwhile, students that did not use templates dedicated four times less effort to designing than to coding. Our results are related to those mentioned by Gravino and Torchiano (Gravino et al., 2015; Torchiano et al., 2017), where the use of UML diagrams to design does not make significant improvements in their source code comprehen­ sion tasks. Also, regarding effort, students who use diagrams spend twice as much time on the same source code compre­ hension task than students who do not use them. Gravino ana­ lyzes the experience factor, and they find that the most experi­ enced students achieve an improvement in the understanding of the source code (Gravino et al., 2015). Although we did not analyze the experience factor of the graduating students, it could be an analysis to be performed in the future. Our research focuses on graduating students, most of them working in the Uruguayan software industry as junior engi­ neers. These engineers usually perform programming tasks, which include low­level design. The results obtained in our experimentcannotbegeneralizedtoall juniordevelopersand even less to senior developers. Our results raises new questions about the practice of soft­ ware design: What do students usually design? What kind of information do they include when designing? Is it possible for them to produce their designs mentally, without repre­ senting them? Do they know the effect of a good design in software quality? Continuing with this line of research, in 2018, we executed an experiment that sought to know how students usually de­ sign. Students performed the same 8 projects during this ex­ periment and delivered the design representation made in a natural way (without templates). Although we have not yet fi­ nalized the data analysis, we have found that students do not deliver complete designs in a preliminary analysis. In gen­ eral, they use informal/natural language and incomplete class diagrams in a few cases. Studying the students’ habitual be­ havior when designing software should help identify poten­ tial problems in the design practices and find better ways of teaching skills for developing quality software. In 2019 and 2020, no experiments could be performed, but in 2021 we Moreno et al. 2021 are replicating the 2019 experiment to have more data. As future work, we will finish the above­mentioned analysis to identify potential problems in the design practices and find better ways of teaching skills for developing quality software. Also, we plan to analyze the designs produced with the tem­ plates to know what students design and conduct interviews with students to know their experience using templates. On the other hand, we find it interesting to experiment with some simple MDD tool to know the effect on software qual­ ity. References Arisholm, E., Briand, L. C., Hove, S. E., and Labiche, Y. (2006). The impact of uml documentation on software maintenance: an experimental evaluation. IEEE Transac­ tions on Software Engineering, 32(6). Ayewah, N., Pugh, W., Hovemeyer, D., Morgenthaler, J. D., and Penix, J. (2008). Using static analysis to find bugs. IEEE software, 25(5). Bourque, P. and Fairley, R. E. (2014). Guide to the Software Engineering Body of Knowledge ­ SWEBOK v3.0. IEEE Computer Society, 2014 version edition. Brown, W. H., Malveau, R. C., McCormick, H. W., and Mowbray, T. J. (1998). AntiPatterns:refactoringsoftware, architectures, and projects in crisis. John Wiley & Sons, Inc. Budgen, D., Burn, A. J., Brereton, O. P., Kitchenham, B. A., and Pretorius, R. (2011). Empirical evidence about the uml: a systematic literature review. Software:Practiceand Experience, 41(4):363–392. Campbell, G. A. and Papapetrou, P. P. (2013). SonarQube in Action. Manning Publications Co, 2013 version edition. Carrington, D. and K Kim, S. (2003). Teaching software de­ sign with open source software. In 33rd Annual Frontiers in Education. Chaiyo, Y. and Ramingwong, S. (2013). The develop­ ment of a design tool for personal software process (psp). In 10th International Conference on Electrical Engineer­ ing/Electronics, Computer, Telecommunications and In­ formation Technology, pages 1–4. Chemuturi, M. (2018). Software Design: A Comprehen­ sive Guide to Software Development Projects. CRC Press/Taylor & Francis Group. Chen, T.­Y., Cooper, S., McCartney, R., and Schwartzman, L. (2005). The (relative) importance of software design criteria. SIGCSE Bull., 37(3):34–38. Eckerdal, A., McCartney, R., Moström, J. E., Ratcliffe, M., and Zander, C. (2006a). Can graduating students design software systems? In SIGCSE Bull., page 403–407. ACM, Association for Computing Machinery. Eckerdal, A., McCartney, R., Moström, J. E., Ratcliffe, M., and Zander, C. (2006b). Categorizing student software de­ signs: Methods, results, and implications. Computer sci­ ence education, 16(3):197–209. Fernández­Sáez, A., Genero, M., and Chaudron, M. (2013). Empirical studies concerning the maintenance of uml dia­ grams and their use in the maintenance of code: A system­ atic mapping study. InformationandSoftwareTechnology, 55:1119–1142. Flores, P. and Medinilla, N. (2017). Conceptions of the stu­ dents around object­oriented design: A case study. In XII Jornadas Iberoamericanas de Ingenieria de Software e In­ geniería del Conocimiento. Fowler, M. (2018). Refactoring: improving the design of ex­ isting code. Addison­Wesley Professional. Gibbon, C. A. (1997). Heuristics for object­oriented design. PhD thesis, University of Nottingham. Gopichand, M., Swetha, V., and Ananda Rao, A. (2010). Software defect detection and process improvement us­ ing personal software process data. In International Con­ ference on Communication Control and Computing Tech­ nologies, pages 794–799. Gravino, C., Scanniello, G., and Tortora, G. (2015). Source­ code comprehension tasks supported by uml design mod­ els: Results from a controlled experiment and a differenti­ ated replication. Journal of Visual Languages & Comput­ ing, 28:23 – 38. Grazioli, F. and Nichols, W. (2012). A cross course anal­ ysis of product quality improvement with psp. In Team Software Process Symposium 2012, pages 76–89. Grazioli, F., Nichols, W., and Vallespir, D. (2014a). An anal­ ysis of student performance during the introduction of the psp: An empirical cross­course comparison. In Team Soft­ ware Process Symposium 2013, pages 11–21. Grazioli, F., Vallespir, D., Pérez, L., and Moreno, S. (2014b). The impact of the psp on software quality: Eliminating the learning effect threat through a controlled experiment. Adv. Soft. Eng., 2014. Group, S. (2015). The chaos report. The Astrophysical Jour­ nal Supplement Series. Hayes, W. and Over, J. (1997). The personal software pro­ cess (psp): An empirical study of the impact of psp on indi­ vidual engineers. Technical Report CMU/SEI­97­TR­001, Software Engineering Institute, Carnegie Mellon Univer­ sity, Pittsburgh, PA. Hu, C. (2013). The nature of software design and its teaching: an exposition. ACM Inroads, 4(2). Humphrey, W. (2005). PSP: A Self­Improvement Process for Software Engineers. Addison­Wesley Professional. Humphrey, W. S. (1995). A discipline for software engineer­ ing. Addison­Wesley Longman Publishing Co., Inc. Joint Task Force on Computing Curricula ­ ACM and IEEE Computer Society (2013). Computer Science Curricula 2013: Curriculum Guidelines for Undergraduate Degree Programs in Computer Science. Association for Comput­ ing Machinery, New York, NY, USA. Jones,B.and Kenward,M.G. (2014). DesignandAnalysisof Cross­Over Trials. Chapman and Hall/CRC, 3rd edition. Karasneh, B., Jolak, R., and Chaudron, M. R. V. (2015). Us­ ing examples for teaching software design: An experiment using a repository of uml class diagrams. In 2015 Asia­ Pacific Software Engineering Conference. Kitchenham, B. and Pfleeger, S. L. (1996). Software quality: the elusive target. IEEE Software, 13(1):12–21. Kramer, J. (2007). Is abstraction the key to computing? Com­ Moreno et al. 2021 mun. ACM, 50(4):36–42. Leung, F. and Bolloju, N. (2005). Analyzing the quality of domain models developed by novice systems analysts. In 38th Hawaii International Conference on System Sci­ ences. Linder, S. P., Abbott, D., and Fromberger, M. J. (2006). An instructional scaffolding approach to teaching software de­ sign. Journal of Computing Sciences in Colleges, 21. Loftus, C., Thomas, L., and Zander, C. (2011). Can grad­ uating students design: revisited. In Proceedings of the 42nd ACM technical symposium on Computer science ed­ ucation. ACM. Martin, R. C. (2002). Agilesoftwaredevelopment:principles, patterns, and practices. Prentice Hall. Moreno, S. and Vallespir, D. (2018). ¿los estudiantes de pregrado son capaces de diseñar software? estudio de la relación entre el tiempo de codificación y el tiempo de diseño en el desarrollo de software. In Conferencia Iberoamericana de Ingeniería de Software 2018. Nistala, P., Nori, K. V., and Reddy, R. (2019). Software quality models: A systematic mapping study. In 2019 IEEE/ACMInternationalConferenceonSoftwareandSys­ tem Processes, pages 125–134. Panach, J. I., Dieste, O., Marín, B., España, S., Vegas, S., Pas­ tor, O., and Juristo, N. (2021). Evaluating model­driven development claims with respect to quality: A family of experiments. IEEE Transactions on Software Engineer­ ing, 47(1):130–145. Petre, M. (2013). Uml in practice. International Conference on Software Engineeringn, 35. Pierce, K., Deneen, L., and Shute, G. (1991). Teaching soft­ ware design in the freshman year. In SoftwareEngineering Education. Springer Berlin Heidelberg. Prechelt, L. and Unger, B. (2001). An experiment measur­ ing the effects of personal software process (psp) training. IEEE Transactions on Software Engineering, 27(5):465– 472. Senn, S. (2002). Cross­over Trials In Clinical Research. John Wiley & Sons, Ltd, 2nd edition. Siau, K. and Tan, X. (2005). Improving the quality of concep­ tual modeling using cognitive mapping techniques. Data & Knowledge Engineering, 55(3). Quality in conceptual modeling. Soh, Z., Sharafi, Z., Van den Plas, B., Cepeda Porras, G., Guéhéneuc, Y.­G., and Antoniol, G. (2012). Professional status and expertise for uml class diagram comprehension: An empirical study. In IEEE International Conference on Program Comprehension. Sommerville, I. (2016). Software Engineering. Pearson. Stevenson, J. and Wood, M. (2018). Recognising object­ oriented software design quality: a practitioner­based questionnaire survey. Software Quality Journal, 26. Taylor, R. N. (2011). Conference welcome message. In Proc. 33rd International Conference on Software Engineering. Association for Computing Machinery. Tenenberg, J. (2005). Students designing software: a multi­ national, multi­institutional study. Informatics in Educa­ tion, 4. Torchiano, M., Scanniello, G., Ricca, F., Reggio, G., and Leotta, M. (2017). Do uml object diagrams affect design comprehensibility?resultsfromafamilyoffourcontrolled experiments. Journal of Visual Languages & Computing, 41. Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., and Wesslén, A. (2012). Experimentation in software engineering. Springer Science & Business Media. Introduction Related Work Research Methodology Course context Goals and research questions Experimental design Operation Analysis and Results External Quality Internal Quality Effort dedicated to designing and coding Discussion Threats to validity Conclusions