Journal of Software Engineering Research and Development, 2020, 8:4, doi: 10.5753/jserd.2020.602
 This work is licensed under a Creative Commons Attribution 4.0 International License..

Reducing the Discard of MBT Test Cases
Thomaz Diniz [ Federal University of Campina Grande | thomaz.morais@ccc.ufcg.edu.br ]
Everton L. G. Alves [ Federal University of Campina Grande | everton@computacao.ufcg.edu.br ]
Anderson G.F. Silva [ Federal University of Campina Grande | andersongfs@splab.ufcg.edu.br ]
Wilkerson L. Andrade [ Federal University of Campina Grande | wilkerson@computacao.ufcg.edu.br ]

Abstract
Model-Based Testing (MBT) is used for generating test suites from system models. However, as software evolves,

its models tend to be updated, which may lead to obsolete test cases that are often discarded. Test case discard can be
very costly since essential data, such as execution history, are lost. In this paper, we investigate the use of distance
functions and machine learning to help to reduce the discard of MBT tests. First, we assess the problem of managing
MBT suites in the context of agile industrial projects. Then, we propose two strategies to cope with this problem: (i)
a pure distance function-based. An empirical study using industrial data and ten different distance functions showed
that distance functions could be effective for identifying low impact edits that lead to test cases that can be updated
with little effort. Moreover, we showed that, by using this strategy, one could reduce the discard of test cases by
9.53%; (ii) a strategy that combines machine learning with distance values. This strategy can classify the impact of
edits in use case documents with accuracy above 80%; it was able to reduce the discard of test cases by 10.4% and
to identify test cases that should, in fact, be discarded.

Keywords: MBT, Test Case Discard, Suite evolution, Agile Development

1 Introduction

Software testing plays an important role since it helps gain
confidence the software works as expected (Pressman, 2005).
Moreover, testing is fundamental for reducing risks and as-
sessing software quality (Pressman, 2005). On the other
hand, testing activities are known to be complex and costly.
Studies found that nearly 50% of a project’s budget is related
to testing (Kumar & Mishra, 2016).
In practice, a test suite can combine manually and auto-

matically executed test cases (Itkonen et al., 2009). Although
automation is always desired, manually executed test cases
are still very important. Itkonen et al. (2009) state that man-
ual testing still plays an important role in the software indus-
try and cannot be fully replaced by automatic testing. For
instance, a tester that runs manual tests tends to better exer-
cise a GUI and find new faults. On the other hand, manual
testing is often costly (Harrold, 2000).
To reduce the costs related to testing, Model-Based Test-

ing (MBT) can be used. It is a strategy where test suites are
automatically generated from specification models (e.g., use
cases, UML diagrams) (Dalal et al., 1999; Utting & Legeard,
2007). By using MBT, sound tests can be extracted before
any coding, and without much effort.
In agile projects, requirements are often volatile (Beck &

Gamma, 2000; Sutherland & Sutherland, 2014). In this sce-
nario, test suites are used as safety nets for avoiding feature
regression. Discussions on the importance of test case reuse
are not new (Von Mayrhauser et al., 1994). In software en-
gineering, software reuse is key for reducing development
costs and improving quality. This is also valid for testing
(Frakes, 1994). A test case that finds faults can be a valu-
able investment (Myers et al., 2011). Good test cases should
be stored as a reusable resource to be used in the future (Cai
et al., 2009). In this context, an always updated test suite is

mandatory. A recent work proposed lightweight specifica-
tion artifacts for enabling the use of MBT in agile projects
(N. Jorge et al., 2018), CLARET. With CLARET, one can
both specify requirements using use cases and generate MBT
suites from them.
However, a different problem has emerged. As the soft-

ware evolves (e.g., bug fixes, change requirements, refactor-
ings), both its models and test suite need revisions. Since
MBT test suites are generated from requirement models, in
practice, as requirements change, the requirement artifacts
are updated, new test suites are generated, and the newer
suites replace the old ones. Therefore, test cases that were
impacted by the edits, instead of updated, are often consid-
ered obsolete and discarded (Oliveira Neto et al., 2016).
Although one may find it easy to generate new suites, re-

gression testing is based on a stable test suite that evolves.
Test case discarding implies important historical data that
are lost (e.g., execution time, the link faults-to-tests, fault-
discovering time). Test case historical data is an important
tool for assessing system weaknesses and better manage it,
therefore, one should not neglect it. For instance, most de-
fect prediction models are based on historical data (He et al.,
2012). Moreover, for some strategies that optimize testing
resources allocation, historical data is key (Noor & Hem-
mati, 2015; Anderson et al., 2014). By discarding test cases,
and their historical data, a project may miss important infor-
mation for both improving a project and guiding its future
actions. Moreover, in a scenario where previously detected
faults guide development, missing tests can be a huge loss.
Finally, test case discard and poor testing are known as signs
of bad management and eventually lead to software develop-
ment waste (Sedano et al., 2017).
However, part of a test suite may turn obsolete due to lit-

tle impacted model updates. Thus, those test cases could be
easily reused with little effort and consequently reducing test-

mailto:thomaz.morais@ccc.ufcg.edu.br
mailto:everton@computacao.ufcg.edu.br
mailto:andersongfs@splab.ufcg.edu.br
mailto:wilkerson@computacao.ufcg.edu.br


Diniz et al. 2020

ing discards. Nevertheless, manual analysis is tedious, costly,
and time-consuming, which often prevents its applicability
in the agile context. In this sense, there is a need for an au-
tomatic way of detecting reusable and, in fact, obsolete test
cases.
Distance functions map a pair of strings to a number that

indicates the similarity level between the two versions (Co-
hen et al., 2003). In a scenario where manual test cases evolve
due to requirement changes, distance functions can be an in-
teresting tool to help us classify the impact of the changes
into a test case.
In this paper, first, we assess and discuss the practical prob-

lem of model evolution in MBT suites. To cope with this
problem, we propose and evaluate two strategies for auto-
matically classifying model edits and tests aiming at avoid-
ing unnecessary test discards. The first is based on distance
functions, while the second combines machine learning and
distance values.
This work is an extension over our previous one (Diniz

et al., 2019) including the following contributions:

• An study using historical data from real industrial
projects that investigates the impact of model evolution
in MBT suites. We found that 86% of the test cases turn
obsolete between two consecutive versions of a require-
ment file, and those tests are often discarded. Moreover,
52% of the found obsolete tests were caused by low
impact syntactic edits and could become fully updated
with the revision of 25% of the steps.

• An automatic strategy based on distance functions for
reclassifying reusable test cases from the obsolete set.
This strategy was able to reduce test case discard by
9.53%.

• An automatic strategy based on machine learning and
distance functions for classifying test cases and model
change impact. This strategy can classify the impact of
edits in use case documents with accuracy above 80%,
it was able to reduce the discard of test cases by 10.4%,
and to identify test cases that should, in fact, be dis-
carded.

This paper is organizedas follows. In Section2, we present
a motivational example. The needed background is discussed
in Section 3. Section 4 presents an empirical investigation
for assessing the challenges of managing MBT suite dur-
ing software evolution. Sections 5 and 6 present the strategy
for classifying model edits using distance functions and the
performed evaluation, respectively. Section 7 introduces the
strategy that combines machine learning and distance values.
Section 8 presents a discussion comparing results from both
strategies. In Section 9, some threats to validity are cleared.
Finally, Sections 10 and 11 present related works and the con-
cluding remarks.

2 Motivational Example
Suppose that Ann works in a project and wants to bene-
fit from MBT suites. Her project follows an agile method-
ology where requirements updates are expected to be fre-
quent. Therefore, she decides to use CLARET (N. Jorge et al.,

2018), an approach for specifying requirements and generat-
ing test suites.
The following requirement was specified using

CLARET’s DSL (Listing 1): “In order to access her
email inbox, the user must be registered in the system and
provide a correct username and password. In case of an
incorrect username or password, the system must display
an error message and ask for new data.”. In CLARET,
an ef [flow #] mark refers to a possible exception flow,
and a bs [step #] mark indicates a returning point from an
exception/alternative to the use case’s basic flow.
From this specification, the following test suite can be gen-

erated: S1 = {tc1, tc2, tc3}, where tc1 = [bs:1 → bs:2 → bs:3
→ bs:4], tc2 = [bs:1 → bs:2 → bs:3 → ef[1]:1 → bs:3 →
bs:4], and tc3 = [bs:1 → bs:2 → bs:3 → ef[2]:1 → bs:3 →
bs:4] .
Suppose that in the following development cycle, the use

case (Listing 1) was revisited and updated due to both re-
quirement changes and for improving readability. Three ed-
its were performed: (i) the message in line 9 was updated
to “displays a successful message”; (ii) system message in
line 12 was updated to “alerts that username does not exist”;
and (iii) both description and system message in exception
3 (line 14) were updated to “Incorrect username/password
combination” and “alerts that username and/or password are
incorrect”, respectively.
Since steps from all execution flows were edited (basic,

exception 1, and exception 2), Ann discards S1 and gener-
ates a whole new suite. However, part of S1’s tests was not
much impacted and could be turned to reused with little or
no update. For instance, only edit (iii), in fact, changed the
semantic of the use case, while (i) and (ii) are updates that
do not interfere with the system’s behavior. Therefore, only
test cases that exercise the steps changed by (iii) should be in
fact discarded (tc3). Moreover, test cases that exercise steps
changed by (i) and/or (ii) could be easily reused and/or up-
dated (tc1 and tc2).
We believe that an effective and automatic analyzer would

help Ann to decide when to reuse or discard test cases, and
therefore reduce the burden of losing important testing data.

1 systemName "Email"
2 usecase "Log in User" {
3 actor emailUser "Email User"
4 preCondition "There is an active network connection"
5 basic {
6 step 1 emailUser "launches the login screen"
7 step 2 system "presents a form with username and

password fields and a submit button"
8 step 3 emailUser "fills out the fields and click on

the submit button"
9 step 4 system "displays a message" ef[1,2]
10 }
11 exception 1 "User does not exist in database" {
12 step 1 system "alerts that user does not exist" bs

3
13 }
14 exception 2 "Incorrect password" {
15 step 1 system "alerts that the password is

incorrect" bs 3
16 }
17 postCondition "User successfully logged"
18 }

Listing 1: Use Case specification using CLARET.


Diniz et al. 2020

3 Background
This section presents the MBT process, the CLARET nota-
tion, and the basic idea behind the two strategies used for
reducing test case discard, distance functions, and machine
learning.

3.1 Model-Based Testing

MBT aims to automatically generate and manage test suites
from software specification models. MBT may use different
model formats to perform its goals (e.g., Labeled Transition
System (LTS) (Tretmans, 2008), UML diagrams (Bouquet
et al., 2007)). As MBT test suites are derived from speci-
fication artifacts, their test cases tend to reflect the system
behavior (Utting et al., 2012). Utting & Legeard (2007) dis-
cuss a series of benefits of using MBT, such as sound test
cases, high fault detection rates, and test cost reduction. On
the other hand, regarding MBT limitations, we can list the
need for well-built models, huge test suites, and a great num-
ber of obsolete test cases during software evolution.
Figure 1 presents an overview of the MBT process. The

system models are specified through a DSL (e.g., UML) and
a test generation tool is used to create the test suite. However,
as the system evolves, edits must be performed on its models
to keep them up-to-date. If any fault is found, the flow goes
back to system development. These activities are repeated
until the system is mature for release. Note that previous test
suites are discarded, and important historical data may be lost
in this process.

Figure 1. MBT Process

3.2 CLARET

CLARET (N. Jorge et al., 2017, 2018) is a DSL and tool that
allows the creation of use case specifications using natural
language. It was designed to be the central artifact for both
requirement engineering and MBT practices in agile projects.
Its toolset works as a syntax checker for use cases descrip-
tion files and provides visualization mechanisms for use case
revision. Listing 1 presents a use case specification using
CLARET.

From the use case description in Listing 1, CLARET gen-
erates its equivalent Annotated Labeled Transition System
(ALTS) model (Tretmans, 2008) (Figure 2). Transition labels
starting with [c] indicate pre or post conditions, while the
ones starting with [s] and [e] are regular and exception exe-
cution steps, respectively.

Figure 2. ALTS model of the use case from Listing 1.

CLARET’s toolset includes a test generation tool, LTS-BT
(Labeled Transition System-Based Testing) (Cartaxo et al.,
2008). LTS-BT is an MBT tool that uses as input LTS models
and generates test suites by searching for valid graph paths.
The generated tests are reported in XML files that can be
directly imported to a test management tool, TestLink1. The
test cases reported in Section 2 were collected from LTS-BT.

3.3 Distance Functions
Distance functions are metrics for evaluating how similar, or
different, are two strings (Coutinho et al., 2016). Distance
functions have been used in different contexts (e.g., (Runkler
& Bezdek, 2000; Okuda et al., 1976; Lubis et al., 2018)). For
instance, Coutinho et al. (2016) use distance functions for
reducing MBT suites.
There are several distance functions (e.g., (Hamming,

1950; Han et al., 2007; Huang, 2008; De Coster et al., 1;
Levenshtein, 1966)). For instance, the Levenshtein function
(Levenshtein, 1966; Kruskal, 1983) (equation described be-
low) compares two strings (a and b) and calculates the num-
ber of required operations to transform a into b, and vice-
versa; where 1ai ̸=bj is the indicator function equal to 0 when
ai ̸= bj and equal to 1 otherwise, and leva,b is the distance
between the first i characters of a and the first j characters
of b.
To illustrate its use, consider two strings a = “kitten” and

b = “sitting”. Their Levenshtein distance is three, since three
operations are needed to transform a to b: (i) replacing ‘k’ by
‘s’; (ii) replacing ‘e’ by ‘i’; and (iii) inserting ‘g’ at the end.
A more detailed discussion about the Levenshtein and others
functions, as well as an open-source implementation of them
are available2.

1http://testlink.org/
2https://github.com/luozhouyang/python-string-similarity


Diniz et al. 2020

leva,b(i, j) =




max(i, j) if min(i,j) = 0

min




leva,b(i − 1, j) + 1
leva,b(i, j − 1) + 1 otherwise
leva,b(i − 1, j − 1) + 1ai̸=bj

3.4 Machine Learning
Machine Learning is a branch of Artificial Intelligence based
on the idea that systems can learn from data, identify pat-
terns, and make decisions with minimal human intervention
(Michie et al., 1994). By providing ways for building data-
driven models, machine learning can produce accurate re-
sults and analysis (Zhang & Tsai, 2003).
The learning process begins with observations or data (ex-

amples), it looks for data patterns, and make future decisions.
By applying machine learning, one aims to allow computers
to learn without human intervention, and to adjust its actions
accordingly.
Machine learning algorithms are often categorized as su-

pervised or unsupervised. Supervised machine learning algo-
rithms (e.g., linear regression, logistic regression, neural net-
works) use labeled examples from the past to predict future
events. Unsupervised machine learning algorithms (e.g., k-
Means clustering, Gaussian mixture models) are used when
the training data is neither classified nor labeled. It infers a
function to describe a hidden structure from unlabeled data.
The use of machine learning in software engineering has

grown in the past years. For instance, machine learning meth-
ods have been used for estimating development effort (Srini-
vasan & Fisher, 1995; Baskeles et al., 2007), predicting a soft-
ware fault-proneness (Gondra, 2008), fault prediction (Shep-
perd et al., 2014), and improving code quality (Malhotra &
Jain, 2012).

4 Analysing the Impact of Model Evo-
lution in MBT Suites

To understand the impact of model evolution in MBT suites,
we observed two industrial projects (SAFF and BZC) from
industrial partners. Both systems were developed in the con-
text of a cooperation between our research lab and two dif-
ferent companies, Ingenico do Brasil Ltda and Viceri Solu-
tion Ltda. The SAFF project is an information system that
manages status reports of embedded devices; and BZC is a
system for optimizing e-commences logistic activities.
The projects were run by two different teams. Both teams

applied agile practices and used CLARET for use case spec-
ification and generation of MBT suites.
Both projects use manually executed system-level black-

box test cases for regression purposes. In this sense, test case
historical data is very important since it can help to keep
track of the system evolution and to avoid functionality re-
gression. However, the teams reported that often discard test
cases when the related steps on the system use cases are up-
dated in any form, which they refer to as a practical manage-
ment problem.
Therefore, we mined the projects repositories, traced each

model change (use case update), and analyzed its impact on
the generated suites.

Table 1. Summary of the artifacts used in our study.
#Use Cases #Versions #Edits

SAFF 13 42 415
BZC 15 37 103
Total 28 79 518

Our goal in this study was to better understand the impact
of model updates in the test suites and to measure how much
of a test suite is discarded. To guide this investigation, we
defined the following research questions:

• RQ1: How much of a test suite is discarded due to use
case editions?

• RQ2: What is the impact of low (syntactic) and high
(semantic) model edits on a test suite?

• RQ3: How much of an obsolete test case needs revision
to be reused?

4.1 Study Procedure
For each CLARET file (use case model), we collected the
history of its evolution in a time frame. In the context of our
study, we consider a use case evolution any edit found be-
tween two consecutive versions of a CLARET file. Our study
worked with 28 use cases, a total of 79 versions, and an av-
erage of 5 step edits per CLARET file. Table 1 presents the
summary of the collected data. After that, we collected the
test suites generated for each version of the CLARET files.

Figure 3. Procedure.

We extracted a change set for each pair of adjacent ver-
sions of a CLARET file (uc, uc’). In our analysis, we con-
sidered two kinds of edits/changes: i) step update. Any step
in the base version (uc) that had its description edited in the
delta version; and ii)stepremoval. Any step that existed inuc
but not in uc’. We did not consider step additions. Since our
goal was to investigate reuse in a regression testing scenario,
we considered suites generated using only the base version
(uc). Consequently, no step addition could be part of the gen-
erated tests.
After that, we connected the changeset to the test suites.

For that, we ran a script that matched each edited step to the
test cases it impacted. We say a test case is impacted by a
modification if it includes at least one modified step from
the changeset. Thus, our script clustered the tests based on
Oliveira Neto et al. (2016)’s classification: Obsolete. Test
cases that include updated or removed steps. These tests have
different actions or system responses, when compared to its
previous version; and Reusable. Test cases that exercise only
unmodified parts of the model specification. All their actions
and responses remained the same when compared to its pre-
vious version. Figure 3 summarizes our study procedure.


Diniz et al. 2020

To observe the general impact of the edits, we measured
how much of a test suite was discarded due to use case edits.
Being s_total the number of test cases generated from a use
case (uc); s_obs, the number of found obsolete test cases; and
N the number of pairs of use cases; we define the Average
number of Obsolete Test Cases (AOTC) by Equation 1.

AOT C = (
∑ s_obs

s_total
) ∗

1
N

(1)

Then, we manually analyzed each element from the
changeset and classified them into low impact (syntactic
edit), high impact (semantic edit), or a combination of both.
For this analysis, we defined three versions of the AOTC met-
ric: AOTC_syn, the average number of obsolete test cases
due to low impact edits; AOTC_sem, the average number of
obsolete test cases due to high impact edits; and AOTC_both,
that considers tests with low and highly impacted steps.
Finally, to investigated how much of a test case needs revi-

sion, for each test, we measured how many steps were mod-
ified. For this analysis, we defined the AMS (Average Mod-
ified Steps) metric (Equation 2), which measures the pro-
portion of steps that need revision due to model edits. Be-
ing tc_total the number of steps in a given test case; tc_c,
number of steps that need revision; and N the number of test
cases:

AM S = (
∑ tc_c

tc_total
) ∗

1
N

(2)

4.2 Results and Discussion
Our results evidence that MBT test suites can be very sen-
sitive to any model evolution. A great number of test cases
were discarded. On average, 86% (AOTC) of a suite’s tests
turned obsolete between two consecutive versions of a use
case file (Figure 4). This result exposes one of the main diffi-
culties of using MBT in agile projects, requirement files are
very volatile. Thus, since test cases are derived from these
models, any model edit leads to a great impact on the gener-
ated tests.
Although a small number of test cases were reused, the

teams in our study found the MBT suites useful. They
mentioned that a series of unnoticed faults were detected,
and the burden of creating tests was reduced. Thus, we can
say that MBT suites are effective, but there is still a need for
solutions for reducing test case discard due to model updates.

RQ1: How much of a test suite is discarded due to use
case editions? On average, 86% of the test cases became
obsolete between two versions of a use case model.

We manually analyzed each obsolete test case. Figure 5
summarizes this analysis. As we can see, 52% of the cases
became obsolete due to low impact (syntactic) edits in the
use case models, while 21% were caused by high impact (se-
mantic) changes, and 12% by both syntactic and semantics
changes in the same model. Therefore, more then half of the
obsolete set refer to edits that could be easily revised and
turned to reusable without much effort (e.g., a step rephras-
ing, typo fixing).

14%

86%

Suite per Evolution

Obsolete

Reusable

Figure 4. Reusable and obsolete test cases.

52%

21%

12%
Obsolete per Type

Both

Semantic

Syntactic

Figure 5. Tests that became obsolete due to a given change type.

RQ2: What is the impact of low (syntactic) and high
(semantic) model edits on a test suite? 52% of the found
obsolete tests were caused by low impact use case edits
(syntactic changes), while 21% were due to high impact
edits (semantic chances), and 12% by a combination of
both.

We also investigated how much of an obsolete test case
would need to be revised to avoid discarding. It is important
to highlight that this analysis was based only on the num-
ber of steps that require revision. We did not measure the
complexity of the revision. Figure 6 shows the distribution
of the found results. As we can see, both medians were simi-
lar (25%). Thus, often a tester needs to review 25% of a test
case to turn it reusable, disregarding the impact of the model
evolution.

As most low impact test cases relate to basic syntactic step
updates (e.g., fixing typos, rephrasing), we believe the costs
of revisiting them can be minimal. For the highly impacted
tests (semantic changes), it is hard to infer the costs, and, in
some cases, discarding those tests can still be a valid option.
However, a test case discard can be harmful and should be
avoided.

RQ3: How much of an obsolete test case needs revision
to be reused? In general, a tester needs to revisit 25% of
the steps of an obsolete test, regardless of the impact of the
model edits.


Diniz et al. 2020

●

●

●

●●

●●

●●

●●

●

●

●

●
●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●
●
●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●
●●●
●●

●

●
●●●
●●

●

●
●
●●●
●●

●

●
●●●
●●

●

●

●●●
●●

●

●●

●

●

●
●

●
●

●

●

●

●

●

●
●

●
●
●

●

●
●

●

●

●

●

●

●
●

●●
●

●

●

●

●

●●

●●
●

●

●
●

●

●

●

●

●

●

●

●

●

●

●
●

●

●
●

●

●

●

●

●

●

●

●

●

●
●

●

●

●

●

●

●
●
●●●
●●
●
●●

●

●
●●●
●●
●
●●

●

●
●
●●●
●●
●
●●

●

●
●●●
●●
●
●●

●

●
●
●●●
●●
●
●●

●

●
●●●
●●
●
●●

●

●

●●●
●●
●
●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●
●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●
●
●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●
●●●
●●●

●

●●●●●●
●●●
●●●

●

●●●●●●
●
●●●●●●

●

●●●●●●●●●●●●

●

●●●●●●

●●●
●●●

●

●●●●●●

●
●●●●●●

●

●●●●●●●●●●●●

●

●●●●●●
●
●●●●●●

●

●●●●●●●●●●●●

●

●●●●●●

●

●●●
●●●
●
●●●●●●
●●●
●●●
●
●●●●●●
●
●●●

●●●

●

●●●●●●

●●●

●●●

●

●●●●●●
●●●

●●●

●

●●●●●●

●
●●●

●●●

●

●●●●●●

●●●

●●●

●

●●●●●●

●
●●●

●●●

●

●●●●●●

●●●

●●●

●

●●●●●●

●

●
●

●●

●

●●
●

●

●●

●

●●

●

●●

●

●●
●

●

●●

●

●●

●
●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●
●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●
●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●●
●●●
●●

●

●
●●
●
●●●
●●

●

●
●●●
●
●●●
●●

●

●
●●
●
●●●
●●

●

●
●●

●

●●●
●●

●

●
●●

●
●
●●●
●●

●

●
●●
●
●●●
●●

●

●
●●

●
●●
●
●

●

●
●

●

●

●
●

●

●
●●
●
●

●

●
●

●

●

●
●

●

●
●●

●
●●
●
●

●

●
●

●

●

●
●

●

●●●●●

●

●●●●●●●

●

●●●●●●

●

●●●●●●●

●

●●●●●

●●
●●●
●●

●

●
●●●
●●

●

●
●
●●●
●●

●

●
●●●
●●

●

●

●●●
●●

●

●●●●●●

●

●●●

●
●●

●

●●●

●
●●

●
●
●●

●●

●●

●●

●●

●

●

●●

●

●●
●●●●

●
●
●
●
●
●●●●

●
●
●
●

●

●

●
●●●●

●
●
●
●
●
●●●●

●
●
●
●

●

●
●
●●●●
●●
●
●●
●●●●
●
●
●●

●●
●●●●

●
●
●
●●
●●●●
●
●●
●

●

●
●
●●●●
●
●●
●
●
●●●●
●
●●
●

●
●
●
●●●●

●
●
●
●
●
●●●●

●
●
●
●

●

●

●
●●●●

●
●
●
●
●
●●●●

●
●
●
●

●

●
●
●●●●
●
●
●●
●
●●●●
●
●●
●

●
●
●●●●
●
●●
●
●
●●●●
●●
●
●

●

●
●
●●●●
●
●●
●●
●●●●
●●
●●

●

●

●●●●

●
●
●

●
●

●●●●

●
●
●
●

●

●

●
●●●

●

●

●

●

●

●
●●●

●

●

●

●

●
●
●

●
●●●

●

●

●

●

●

●
●●●

●

●

●

●

●
●

●

●●●●

●

●

●

●

●

●●●●

●

●

●

●

●

●●
●●●

●

●

●

●

●

●
●●●

●

●

●

●

●

●

●
●
●●●

●

●

●

●

●

●
●●●

●

●

●

●

●

●

●

●
●●●

●

●

●

●

●

●
●●●

●

●

●

●

●

●
●

●
●●●

●

●

●

●

●

●
●●●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●●●●

●

●

●

●

●

●●
●●●

●

●

●

●

●

●●●●

●

●

●

●

●

●
●
●
●●●

●

●

●

●

●

●
●●●

●

●

●

●

●
●
●
●●●

●

●

●

●

●

●
●●●

●

●

●

●

●●

●●●●●

●

●

●●

●

●●●●●

●

●

●●

●●

●

●

●●

●

●

●●

●

●

●

●

●●●

●

●●●

●

●

●●●

●

●●●●

●

●

●●●

●

●●●

●

●

●

●

●●●

●

●●●

●

●●●

●

●

●

●

●●●

●

●

●●

●

●●

●

●

●●

●

●●

●

●●●

●●

●

●

●●

●

●●●●●

●●

●

●●●●●

●●

●●●●●●●

●

●

●

●●●●●

●

●●●●

●

●

●●

●●●●●●●●●

●●

●●●

●●

●●●

●●●●

●
●●●●
●
●●●●

●

●●●

●

●●●

●

●

●

●●●

●

●

●

●●●
●

●

●

●●

●

●

●●

●
●●
●
●●

●

●

●

●●●

●

●●

●

●●●

●

●●●

●

●

●

●

●

●●●●●●●

●●●●●●●

●●●●●●

●●

●

●

●

●●

●

●

●●

●

●●

●

●

●●

●

●

●

●●

●

●

●●

●●

●●

●●

●●

●●

●

●

●

●●

●

●

●●

●●

●●

●●

●●

●

●●●

●●

●

●●

●

●●● ●

●
●
●
●
●
●
●
●
●

●●●●

●
●
●
●
●
●
●
●
●

●●●

●

●

●

●
●
●

●●●

●

●

●

●
●
●

●●●

●

●

●

●
●
●

●●●

●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●●

●●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●●

●●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●●

●●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●●

●●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●●

●●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●

●

●●●●

●●●

●

●●
●
●●●●●●
●●

●

●●
●
●●●●●●
●●

●

●●
●
●●●●●●
●●

●

●●
●
●●●●●●
●●

●

●●
●
●●●●●●
●●

●

●●
●
●●●●●●
●●

●

●

●●

●

●

●●

0

25

50

75

100

Semantic Syntactic

Obsolete test type

P
ro

p
o

rt
io

n
 o

f 
c
h

a
n

g
e

d
 s

te
p

s
 i
n

 t
e

s
t 

c
a

s
e

s
 (

%
)

Figure 6. Proportion of the test cases modified by edit type.

5 Distance Functions to Predict the
Impact of Test Case Evolution

The study described in Section 4 evidences the challenge
of managing MBT suites during software evolution. To test
whether distance functions can help to cope with this prob-
lem, we ran a second empirical study. The goal of this study
was to analyze the use of distance functions to automatically
classify changes in use case documents that could impact
MBT suites.

5.1 Subjects and Functions
For that, our study was also run in the context of the indus-
trial projects SAFF and BZC. It is important to remember that
both projects used agile methodologies to guide the develop-
ment and updates in the requirement artifacts were frequent.
Moreover, their teams used both CLARET (N. Jorge et al.,
2017), for use case specification, and LTS-BT (Cartaxo et al.,
2008) for generating MBT suites.
As our study focuses on the use of distance functions, we

selected a set of ten of the most well-known functions that
have been used in different contexts: Hamming (Hamming,
1950), LCS (Han et al., 2007), Cosine (Huang, 2008), Jaro
(De Coster et al., 1), Jaro-Winkler (De Coster et al., 1), Jac-
card (Lu et al., 2013), Ngram (Kondrak, 2005), Levenshtein
(Levenshtein, 1966), OSA (Damerau, 1964), and Sorensen
Dice (Sørensen, 1948). To perform systematic analyses, we
normalize their results in a way that their values range from
zero to one. Values near zero refer to low similarity, while
near one values indicate high similarity. We reused open-
source implementations of all ten functions34. To customize
and analyze the edits in the context of our study, we created
our own tool and scripts that were verified through a series
of tests.
We mined the projects’ repository and collected all use

case edits. Each of these edits would then impact the test
cases. We call “impacted” any test case that includes steps
that were updated during model maintenance. However, we
aim to use distance functions to help us to classify these edits
and avoid the test case discard.

3https://github.com/luozhouyang/python-string-similarity
4https://rosettacode.org/wiki/Category:Programming_Tasks

Table 2. Classification of edits.
Steps Description

Version 1 Version 2 Classification
“Extract data on
offline mode.”

“Show page that
requires new data.” high impact

“Show page that
requires new data.”

“Show page that
requires new terminal
data.”

low impact

”Click on Edit
button”

”Click on the Edit
button” low impact

To guide our investigation, we defined the following re-
search questions:

• RQ4: Can distance functions be used to classify the im-
pact of edits in use case documents?

• RQ5: Which distance function presents the best results
for classifying edits in use case documents?

5.2 Study Setup and Procedure

Since all use case documents were CLARET files, we reused
the data collected in the study of Section 4. Therefore, a total
of 79 pairs of use case versions were analyzed in this study,
with a total of 518 edits. Table 1 summarizes the data.
After that, we manually analyzed each edit and classified

them between low impact and high impact. A low impact
edit refers to changes that do not alter the system behavior
(a pure synthetic edit), while a high impact edit refers to
changes in the system expected behavior (semantic edit). Ta-
ble 2 exemplifies this classification. While the edit in the first
line changes the semantics of the original requirement, the
next two refer to edits performed for improving readability
and fixing typos. During our classification, we found 399 low
impact and 27 high impact edits for the SAFF system, and
92 low and 11 high impact for BZC. This result shows that
use cases often evolve for basic description improvements,
which may not justify the great number of discarded test
cases in MBT suites.
After that, for each edit (original and edited versions), we

ran the distance functions using different configuration val-
ues and observed how they classified the edits compared to
our manual validation.

5.3 Metrics

To help us evaluate the results, and answer our research ques-
tions, we used three of the most well-known metrics for
checking binary classifications: Precision, which is the rate
of relevant instances among the found ones; Recall, calcu-
lates the rate of relevant retrieved instances over the total of
relevant instances; and Accuracy, which combines Precision
and Recall. These metrics have been used in several software
engineering empirical studies (e.g., (Nagappan et al., 2008;
Hayes et al., 2005; Elish & Elish, 2008)). Equations 3, 4 and
5 present those metrics, where TP refers to the number of
cases a distance function classified an edit as low impact and
the manual classification confirms it; TN refers to the num-
ber of matches regarding high impact edits; FP refers to when
the automatic classification reports low impact edits when in


Diniz et al. 2020

fact high impact edits were found; and FN is when the auto-
matic classification reports high impact when in fact should
be low impact edits.

P recision =
T P

T P + F P
(3)

Recall =
T P

T P + F N
(4)

Accuracy =
T P + T N

T P + T N + F P + F N
(5)

5.4 Results and Discussion
To answer RQ4, we first divided our dataset of use case
edits into two (low and high impact edits), according to our
manual classification. Then, we ran the distance functions
and plotted their results. Figures 7 and 8 show the box-plot
visualization of this analysis considering found low (Figure
7) and high impacts (Figure 8). As we can see, most low
impact edits, in fact, refer to low distance values (median
lower than 0.1), for all distance functions. This result gives
us evidence that low distance values can relate to low impact
edits and, therefore, can be used for predicting low impact
changes in MBT suites. On the other hand, we could not find
a strong relationship between high impact edits and distance
values. Therefore we can answer RQ4 stating that distance
functions, in general, can be used to classify low impact.

RQ4: Can distance functions be used to classify the im-
pact of edits in use case documents? Low impact edits
are often related to lower distance values. Therefore, dis-
tance functions can be used for classifying low impact ed-
its.

Figure 7. Box-plot for low impact distance values.

As for automatic classification, we need to define an effec-
tive impact threshold, for each distance function, we run an
exploratory study to find the optimal configuration for using
each function. By impact threshold, we mean the distance
value for classifying an edit as low or high impact. For in-
stance, consider a defined impact threshold of x% to be used
with function f. When analyzing an edit from a specification
document, if f provides a value lower than x, we say the edit

Figure 8. Box-plot for high impact distance values.

is low impact, otherwise it is high impact. Therefore, we de-
sign a study where, for each function, we vary the defined
impact threshold and we observed how it would impact Pre-
cision and Recall. Our goal with this analysis is to identify
the more effective configuration for each function. We range
the impact threshold between [0; 1].
To find this optimal configuration, we consider the inter-

ception point between the Precision and Recall curves, since
it reflects a scenario with less mistaken classifications (false
positives and false negatives). Figure 9 presents the analysis
for the Jaccard functions. Its optimal configuration is high-
lighted (impact threshold of 0.33) – the green line refers to
the Precision curve, the blue line to the Recall curve, and
the red circle shows the point both curves meet. Figure 10
presents the analysis for the other functions.

Figure 9. Best impact threshold for the Jaccard function.

Table 3 presents the optimal configuration for each func-
tion and the respective precision, recall, and accuracy val-
ues. These results reinforce our evidence to answer RQ4
since all functions presented accuracy values greater than
90%. Moreover, we can partially answer RQ5, since now


Diniz et al. 2020

Figure 10. Exploratory study for precision and recall per distance function.

we found, considering our dataset, the best configuration
for each distance function. To complement our analysis,
we went to investigate which function performed the best.
First, we run proportion tests considering both the func-
tions all at once and pair-to-pair. Our results show, with
95% of confidence, could not find any statistical differ-
ences among the functions. This means that distance func-
tion for automatic classification of edits impact is effec-
tive, regardless of the chosen function (RQ5). Therefore,
in practice, one can decide which function to use based
on convenience aspects (e.g., easier to implement, faster).

RQ5: Which distance function presents the best results
for classifying edits in use case documents? Statistically,
all ten distance functions performed similarly when classi-
fying edits from use case documents.

6 Case Study
To reassure the conclusions presented in the previous section,
and to provide a more general analysis, we ran new studies
considering a different object, TCOM. TCOM is an indus-
trial software also developed in the context of our coopera-
tion with the Ingenico Brasil Ltda. It controls the execution
and manages testing results of a series of hardware parts. It


Diniz et al. 2020

Table 3. Best configuration for each function and respective preci-
sion, recall and accuracy values.

Function
Impact

Threshold Precision Recall Accuracy

Hamming 0.91 94.59% 94.79% 90.15%
Levenshtein 0.59 95.22% 95.42% 91.31%
OSA 0.59 95.22% 95.42% 91.31%
Jaro 0.28 95.01% 95.21% 90.93%
Jaro-Winkler 0.25 95.21% 95.21% 91.12%
LCS 0.55 94.99% 94.79% 90.54%
Jaccard 0.33 95.22% 95.42% 91.31%
NGram 0.58 95.41% 95.21% 91.31%
Cosine 0.13 95% 95% 90.73%
Sørensen–Dice 0.47 94.99% 94.79% 90.54%

Table 4. Summary of the artifacts for the TCOM system.
#Use Cases #Versions #Edits

TCOM 7 32 133

is important to highlight that a different team ran this project,
but in a similar environment: CLARET use cases for speci-
fication and generated MBT suites. The team also reported
similar problems concerning volatile requirements, and fre-
quent test case discards.
First, similar to the procedure applied in Section 5.2, we

mined TCOM’s repository and collected all versions of its
use case documents and their edits. Table 4 summarizes the
collected data from TCOM. Then, we manually classified
all edits between low and high impact to serve as validation
for the automatic classification. Finally, we ran all distance
functions considering the optimal impact thresholds (Table 3
- second column) and calculated Precision, Recall and Accu-
racy for each configuration (Table 5).

Table 5. TCom - Evaluating the use of the found impact threshold
for each function and respective precision, recall and accuracy val-
ues.

Function
Impact

Threshold Precision Recall Accuracy

Hamming 0.91 87.59% 94% 84.96%
Levenshtein 0.59 87.85% 94% 85.71%
OSA 0.59 87.85% 94% 85.71%
Jaro 0.28 89.52% 94.00% 87.22%
Jaro-Winkler 0.25 94.00% 89.52% 87.22%
LCS 0.55 89.62% 95% 87.97%
Jaccard 0.33 89.52% 94% 87.22%
NGram 0.58 87.85% 94% 85.71%
Cosine 0.13 88.68% 94% 86.47%
Sørensen–Dice 0.47 88.68% 94% 86.47%

As we can see, the found impact thresholds presented high
precision, recall, and accuracy values when used in a differ-
ent system and context (all above 84%). This result gives
as evidence that, distance functions are effective for auto-
matic classification of edits (RQ4) and that the found impact
thresholds performed well for a different experimental object
(RQ5).
In a second moment, we used this case study to evaluate

how our approach (using distance functions for automatic
classification) can help reducing test discards:

• RQ6: Can distance function be used for reducing the
discard of MBT tests?

To answer RQ6, we considered TCOM’s MBT test cases

Table 6. Example of a low impacted test case.
...
step 1: operator presses
the terminal approving
button.
step 2: system goes
back to the terminal
profiling screen.
...

...
step 1: operator presses
the terminal approving
button.
step 2: system redirects
the terminal to its
profiling screen.
...

generated from its CLARET files. Since all distance func-
tions behave similarly (Section 5.4), in this case study we
used only Levenshtein’s function to automatically classify
the edits and to check the impact of those edits in the tests.
In a common scenario, which we want to avoid, any test case
that contains an updated step would be discarded. Therefore,
in the context of our study, we used the following strategy
“only test cases that contain high impact edits should be dis-
carded, while test cases with low impact edits are likely to
be reused with no or little updating”. The rationale behind
this decision is that low impact edits often imply on little to
no changes to the system behavior. Considering system-level
black-box test suites (as the ones from the projects used in
our study), those tests should be easily reused. We used this
strategy and we first applied Oliveira’s et al.’s classification
(Oliveira Neto et al., 2016) that divided TCOM’s tests among
three sets: obsolete – test cases that include impacted steps;
reusable – test cases that were not impacted by the edits; and
new – test cases that include new steps.
A total of 1477 MBT test cases were collected from

TCOM’s, where 333 were found new (23%), 724 obsolete
(49%), and 420 reusable (28%). This data reinforces Silva
et al. (2018)’s conclusions showing that, in an agile context,
most of an MBT test suite became obsolete quite fast.
In a common scenario, all “obsolete” test cases (49%)

would be discarded throughout the development cycles. To
cope with this problem, we ran our automatic analysis and we
reclassified the 724 obsolete test cases among low impacted
– test cases that include unchanged steps and updated steps
classified by our strategy as ”low impact”; highly impacted
– test cases that include unchanged steps and “high impact”
steps; and mixed, test cases that include at least one “high
impact” step and at least one “low impact” step.
From this analysis, 109 test cases were low impacted. Al-

though this number seems low (15%), those test cases would
bewronglydiscardedwheninfact theycouldbeeasilyturned
into reusable. For instance, Table 6 shows a simplified ver-
sion of a “low impacted” test case from TCOM. As we can
see, only step 2 was updated to better phrase a system re-
sponse. This was an update for improving specification read-
ability, but it does not have any impact on the system’s be-
havior. We believe that low impacted test cases could be eas-
ily reused with little or no effort. In our case study, most of
them need small updating that could be easily done in a sin-
gle updating round, or even during test case execution. For
the tester point of view, this kind of update may not be urgent
and should not lead to a test case discard.
The remaining test cases were classified as follows: 196

“highly impacted” (27%), and 419 “mixed” (58%). Table 7
and 8 show examples of highly impacted and mixed tests,


Diniz et al. 2020

Table 7. Example of a highly impacted test case.
...
step 3: operator presses
camera icon.
step 4: system redirects
to photo capture screen.
...
step 9: operator takes a
picture and presses
the Back button.
...

...
step 3: operator selects
a testing plan.
step 4: system redirects
to the screen that shows
the selected tests.
...
step 9: operator sets a
score and press Ok.
...

Table 8. Example of a mixed test case.
...
step 2: operator presses
button CANCEL to mark
there is no occurrence
description.
...
step 7: operator presses
the button SEND.
...

...
step 2: operator presses
the button CANCEL to
mark there is no
occurrence description.
...
step 7: operator takes a
picture of the hardware.
...

respectively. In Table 7, we can see that steps 3, 4, and 9 were
drastically changed, which infer to a test case that requires
much effort to turn it into reusable. On the other hand, in the
test in Table 8, we have both an edit for fixing a typo (step 2)
and an edit with a requirement change (step 7).
To check whether our classification was in fact effective

we present its confusion matrix (Table 9). In general, our clas-
sification was 66% effective (Precision). A smaller precision
was expected, when compared to the precision classification
from Section 5, since here we consider all edits that might af-
fect a test case, while in Section 5 we analyzed and classified
each edit individually. However, we can see, our classifica-
tion was highly effective for low impacted test cases, and
most mistaken classification relates to mixed one (tests that
combine low and high impact edits). Those were, in fact, test
cases that were affected in a great deal by different types of
use case editions.
Back to our strategy, we believe that highly impacted or

mixed classifications indicate test cases that are likely to be
discarded, since they refer to tests that would require much
effort to be updated, while low impacted tests can be reused
with little to no effort. Overall, our strategy correctly classi-
fied the test cases in 66% of the cases (Precision). Regard-
ing low impacted tests, we correctly classified 63% of them.
Therefore, from the 724 “obsolete” test cases, our strategy
automatically inferred that 9.53% of them should not be dis-

Table 9. Confusion Matrix.
Predicted

Low High Mixed
Actual
Low 69 3 37 109

Actual
High 4 37 155 196

Actual
Mixed 21 27 371 419

94 67 563 724

carded. We believe this rate can get higher when we better
analyze the mixed set. A mixed test combines low and high
impact edits. However, when we manually analyzed those
cases, we found several examples where, although high im-
pact edits were found, most test case impacts were related
to low impact edits. For instance, there was a test case com-
posed of 104 execution steps where only one of those steps
needed revision due to a high impact use case edit, while
the number of low impact edits was seven. In a practical sce-
nario, although we still classify it as a mixed test case, we
would say the impact of the edits was still quite small, which
may indicate a manageable revision effort. Thus, we state
that mixed tests need better analysis before discarding. The
same approach may also work forhighly impacted tests when
related to a low number of edits.
Finally, we can answer RQ6 by saying that an automatic

classification using distance functions can, in fact, reduce the
number of discarded test cases by at least 9.53%. However,
this rate tends to be higher when we consider mixed tests.

RQ6: Can distance function be used for reducing the
discard of MBT tests? The use of distance functions
can reduce the number of discarded test cases by at least
9.53%.

7 Combining Machine learning and
Distance Values

In previous sections, we showed that distance functions alone
could help to identify low impact edits that lead to test cases
that can be updated with little effort. Moreover, the case
study in Section 6 showed that this strategy could reduce
test case discard by 9.53%. Although very promising, we be-
lieve those results could be improved, especially regarding
the classification of high impact edits. In this sense, we pro-
poseacomplementarystrategythatcombinesdistancevalues
and machine learning.
To apply machine learning and avoid test case discard, we

used Keras (Gulli & Pal, 2017). Keras is a high-level Python
neural networks API that runs on top of TensorFlow (Abadi
et al., 2016). It focuses on efficiency and productivity; there-
fore, it allows easy and fast prototyping. Moreover, it has a
stronger adoption in both the industry and the research com-
munity (Géron, 2019).
Keras provides two types of modelsSequential, andModel

with the functional API. We opted to use a Sequential model
due to its easy configuration and effective results. A sequen-
tial model is composed of a linear stack of layers. Each layer
contains a series of nodes and performs calculations. A node
is activated only when a certain threshold is achieved. In the
context of our model, we used the REctified Linear Units
(ReLU) and Softmax functions. Both are known to be a good
fit for classification problems (Agarap, 2018). Dense layer
nodes are connected to all nodes from the next layer, while
Dropout layer nodes are more selective.
Our model classifies whether two versions of a given

test case step refer to a low or high impact edit. Although
techniques such as Word2Vec (Rong, 2014) could be used


Diniz et al. 2020

to transform step descriptions to numeric vectors, due to
previous promising results (Sections 5 and 6), we opted to
use a classification based on a combination of different dis-
tance values. Therefore, to be able to use the model, we
first pre-process the input (versions of a test step), run the
ten functions (Hamming (Hamming, 1950), LCS (Han et al.,
2007), Cosine (Huang, 2008), Jaro (De Coster et al., 1), Jaro-
Winkler (De Coster et al., 1), Jaccard (Lu et al., 2013), Ngram
(Kondrak, 2005), Levenshtein (Levenshtein, 1966), OSA
(Damerau, 1964), and Sorensen Dice (Sørensen, 1948)), and
collect their distance values. Those values are then provided
to our model that starts with a Dense layer, followed by four
hidden layers, and returns as output a size two probability
array O. O’s first position refers to the found probability of
a given edit be classified as high impact, while the second
refers to the probability for a low impact edition. The high-
est of those two values will be the final classification of our
model.
Suppose two versions of the test step s (s and s’). First,

our strategy runs the ten distance functions considering the
pair (s; s’) and generates its input model set (e.g., I =
0.67; 0.87; 0.45; 0.78; 0.34; 0.6; 0.5; 0.32; 0.7; 0.9). This set
is then provided to our model that generates the output ar-
ray (e.g., O = [0.5; 0.9]). For this example, O indicates that
the edits that transformed s to s’ are high impact, with 50%
chances, and low impact, with 90% chances. Therefore, our
final classification is that the edits were low impact edit.
For training, we created a dataset with 78 instances of ed-

its randomly collected from both SAFF and BZC projects.
To avoid possible bias, we worked with a balanced training
dataset (50% low impact and 50% high impact edits). More-
over, we reused the manual classification discussed in Sec-
tions 4 and 5 as reference answers for the model. In a Note-
book Intel Core i3 with 4GB of RAM, the training phase was
performed in less than 10 minutes.

7.1 Model Evaluation
Similar to the investigation described in Sections 5 and 6,
we proceeded an investigation to validate whether the strat-
egy that combines machine learning with distance values is
effective and promotes the reduction of test case discard. For
that, we set the following research questions:

• RQ7: Can the combination of machine learning and dis-
tance values improve the classification of edits’ impact
in use case documents?

• RQ8: Can the combination of machine learning and dis-
tance values reduce the discard of MBT tests?

To answer RQ7, and evaluate our model, we first ran
it against two data sets: (i) the model edits combined of
the SAFF and BZC projects (Table 1); and (ii) the model
edits collected from the TCOM project (Table 4). While
the first provides a more comprehensive set, the second
allows us to test the model in a whole new scenario. It
is important to highlight that both data sets contain only
real model edits performed by the teams. Moreover, they
contains both low and high impact edits. Again, we reused
our manual classification to validate the model’s output.
Table 10 presents the results of this evaluation. As we can

Table 10. Results of our model evaluation using TCOM’s dataset.
Precision Recall Accuracy

SAFF+BZC 81% 97% 80%
TCOM 94% 99% 95%

see, our strategy performed well for predicting the edits
impact, especially for TCOM, it provided an accuracy of
95%. These results give us evidence of our model efficiency.

RQ7: Can the combination of machine learning and
distance values improve the classification of edits’ im-
pact in use case documents? Our strategy was able to
classify edits with accuracy above 80%, an improvement
of 7% when compared to the classification using only dis-
tance functions. This result reflects its efficiency.

To answer RQ8, we considered TCOM’s MBT test cases
generated from its CLARET files. We reused the manual clas-
sification from Section 6, and ran our strategy to automatic
reclassify the 724 obsolete test cases among low impacted
– test cases that include unchanged steps and updated steps
classified by our strategy as “low impact”; highly impacted
– test cases that include unchanged steps and “high impact”
steps; and mixed, test cases that include at least one “high
impact” step and at least one “low impact” step.
From the 109 actual low impacted test cases, our strategy

was able to detect 75 (69%), an increase of 6% when com-
pared to the classification using a single distance function.
Those would be test cases that should be easily revised to
avoid discarding as model changes were minimal (Figure 6).
Table 9 presents the confusion matrix for our model clas-
sification. Out of the 724 obsolete test cases (according to
Oliveira et al.’s classification (Oliveira Neto et al., 2016)),
our model would help a tester to automatically save 10.4%
from discarding.
As we can see, overall, our classification was 69% effec-

tive, an increase of 3% when compared to the classification
using a single distance function (Table 9). Although this im-
provement may be low, it is important to remember that those
would be the actual test that would be saved from a wrong
discard.
On the other hand, we can see a great improvement in the

high impact classification (from 19% to 86%). This data indi-
cates that different from the strategy using a single distance
function, our model can be a great help to automatically iden-
tify both reusable and in fact obsolete test cases. On the other
hand, the classification for mixed test cases performed worse
(from 88% to 61%). However, we believed that mixed test
cases are the ones that require a manual inspection to check
whether it is worth updating for reusing or should be dis-
carded.
It is important to highlight that our combined strategy was

able to improve the performance rates for the most important
classifications (low and high impacted), which are related to
major practical decisions (to discard or not a test case). More-
over, when wrongly classifying a test case, our model often
sets it as mixed, which we recommend a manual inspection.
Therefore, our automatic classification tends to be accurate
and not misleading.
Finally, we can answer RQ8 by saying that the combined


Diniz et al. 2020

strategy was, in fact, effective for reducing the discard of
MBT tests. The rate of saved tests was 10.4%. Moreover,
it improved, when compared to the strategy using a single
distance function, the detection rate by 6% for low impacted
test cases, and by 67% for high impact ones.

RQ8: Can the combination of machine learning and
distance values reduce the discard of MBT tests? Our
combined strategy helped us to reduce the discard of test
cases by 10.4%, an increase of 0.9%. However, it correctly
identifies test cases that should in fact be discarded.

8 General Discussion
In previous sections, we proposed two different strategies for
predicting the impact of model edits and avoiding test case
discarding: (i) a pure distance function-based; and (ii) a strat-
egy that combines machine learning with distance values.
Both were evaluated in a case study with real data. The first
strategy applies a simpler analysis, which may infer lower
costs. Though simple, it was able to correctly identify 63%
of the lowimpacted test cases, and to rescue 9.53% of the test
cases that would be discarded. However, it did not perform
well when classifying highly impacted tests (19%). Our sec-
ond approach, though more complex (it requires a set of dis-
tance values as inputs to the model), generated better results
for classifying low impacted and highly impacted test cases,
68%, and 86% precision, respectively. Moreover, it helped us
to avoid the discard of 10.4% of the test cases. Therefore, if
running several distance functions for each model edit is not
an issue, we recommend the use of (ii) since it is in fact the
best option for automatically classify test cases that should
be reused (low impacted) or be discarded (highly impacted).
Moreover, regarding time, our prediction model responses
were almost instant. Regarding mixed tests, our suggestion
is always to inspect them to decide whether it is worth updat-
ing.

9 Threats to Validity
Most of the threats for validity to the drew conclusions refer
to the number of projects, use cases, and test cases used in
our empirical studies. Those numbers were limited to the arti-
facts created in the context of the selected projects. Therefore,
our results cannot be generalized beyond the three projects
(SAFF, BZC, and TCOM). However, it is important to high-
light that all used artifacts are from real industrial systems
from different contexts.
As for conclusion validity, our studies deal with a limited

data set. Again, since we chose to work with real, instead of
artificial artifacts, the data available for analysis were limited.
However, the data was validated by the team engineers and
by the authors.
One may argue that since our study deals only with

CLARET use cases and test cases, our results are not valid
for other notations. However, CLARET resembles tradi-
tional specification formats (e.g., UML Use Cases). More-

over, CLARET test cases are basically a sequence of pairs
of steps (user input - system response), which can relate to
most manual testing at the system level.
Regarding internal validity, we collected the changed set

from the project’s repositories, and we manually classify
each change according to its impact. This manual valida-
tion was performed by at least two of the authors and, when
needed, the project’s members were consulted. Moreover,
we reused open-source implementations of the distance func-
tions5. These implementations were also validated by the
first author.

10 Related Work
The practical gains of regression testing are widely discussed
(e.g., (Aiken et al., 1991; Leung & White, 1989; Wong et al.,
1997; Ekelund & Engström, 2015)). In the context of agile
development, this testing strategy plays an important role by
working as safety nets changes are performed (Martin, 2002).
Parsons et al. (2014) investigate regression testing strategies
in agile development teams and identify factors that can influ-
ence the adoption and implementation of this practice. They
found that the investment in automated regression testing is
positive, and tools and processes are likely to be beneficial
for organizations. Our strategies (distance functions, distance
functions and machine learning) are automatic ways to en-
able the preservation of regression test cases.
Ali et al. (2019) propose a test case prioritization and se-

lection approach for improving regression testing in agile
projects. Their approach prioritizes test cases by clustering
the ones that frequently change. Here, we see a clear exam-
ple of the importance of preserving test cases.
Some work relates agile development to Model-Based

Testing, demonstrating the general interest in these topics.
Katara & Kervinen (2006) introduce an approach to gener-
ate tests from use cases. Tests are translated into sequences of
events called action-words. This strategy requires an expert
to design the test models. Puolitaival (2008) present a study
on the applicability of MBT in agile development. They refer
to the need for technical practitioners when performing MBT
activities and specific adaptations. Katara & Kervinen (2006)
discuss how MBT can support agile development. For that,
they emphasize the need for automation aiming that MBT
artifacts can be manageable and with little effort to apply.
Cartaxo et al. (2008) propose a strategy/tool for generating

test cases from ALTS models and selecting different paths.
Since the ALTS models reflect use cases written in natural
language, the generated suites encompass the problems evi-
denced in our study (a great number of obsolete test cases),
as the model evolves.
Oliveira Neto et al. (2016) discuss a series of problems re-

lated to keeping MBT suites updated during software evolu-
tion. To cope with this problem, they propose a test selection
approach that uses test case similarity as input when collect-
ing test cases that focus on recently applied changes. Oliveira
Neto et al.’s approach refers to obsolete all test cases that are
impacted in any way by edits in the requirement model. How-
ever, as our study found, a great part of those tests can be little

5https://github.com/luozhouyang/python-string-similarity


Diniz et al. 2020

impacted and could be easily reused, avoiding the discard of
testing artifacts.
The test case discard problem is not restricted to CLARET

artifacts. Other similar cases are discussed in the literature
(e.g.,(Oliveira Neto et al., 2016; Nogueira et al., 2007)).
Moreover, this problem is even greater with MBT test
cases derived from artifacts that use non-controlled language
(Pinto et al., 2012).
Other works also deal with test case evolution (e.g.,

(Katara & Kervinen, 2006; Pinto et al., 2012)). They discuss
the problem and/or propose strategies for updating the test-
ing code. Those strategies do not apply to our context, as we
work with MBT test suite evolution generated from use case
models.
Distance functions have been used in different software en-

gineering scenarios (e.g., (Runkler & Bezdek, 2000; Okuda
et al., 1976; Lubis et al., 2018)). For instance, Runkler &
Bezdek (2000) use the Levenshtein function to automatically
extract keywords from documents. In the context of MBT,
Coutinho et al. (2016) investigated the effectiveness of a se-
ries of distance functions when used combined with strate-
gies for suite reduction based on similarity. Although in a
different context, their results go according to ours where all
distance functions performed in a similar way.
The use of machine learning techniques in software engi-

neering is not new. Baskeles et al. (2007) propose a model
for estimating development effort aiming at overcoming
problems related to budget and schedule extension. Gondra
(2008) uses an artificial neural network to determine the im-
portance of software metrics for predicting fault-proneness.
Durelli et al. (2019) present a systematic mapping study

on machine learning applied to software testing. From 48 se-
lected primary studies, they found that machine learning has
been used mainly for test-case generation, refinement, and
evaluation. For instance, Strug & Strug (2012) use a KNN-
learner to reduce the set of mutants to be executed in muta-
tion testing. It predicts when a test can kill certain mutants.
Fraser & Walkinshaw (2015) propose an approach based on
machine learning algorithms to evaluate test suites using be-
havioral coverage. It receives data from a test generation tool
and predicts the behavior of the program for the given inputs.
Zhu et al. (2008) propose a model for estimating test exe-
cution efforts based on testing data such as the number of
test cases, test complexity, and knowledge of the system un-
der testing. Chen et al. (2011) present a machine learning ap-
proach for selection regression test cases. Their learner clus-
ters similar test cases based on an input function and con-
straints. Our work differs from the others since it uses ma-
chine learning and distance functions to predict the impact
of a given use case update and to avoid the discard of MBT
test cases.

11 Concluding Remarks
In this paper, we describe a series of empirical studies ran on
industrialsystemsforevaluatingtheuseofdistancefunctions
to classify the impact of edits in use case files automatically.
Our results showed that distance functions are effective in
identifying low impact editions. Therefore, we propose two

variations of its use: as a classification strategy itself (Section
5), and combined with a machine learning model (Section 7).
We also found that low impact editions often refer to test

cases that can be easily updated without any effort. Our strate-
gies helped to both identify low impact and high impact test
cases. We believe those results can help testers to better work
with MBT artifacts in the context of software evolution and
avoid the discard of test cases.
As future work, we plan to expand our study with a broader

set of systems. We also consider developing a tool that, us-
ing distance functions, can help testers to identify and up-
date low impact test cases. Finally, we plan to investigate
the use of different approaches (e.g., other machine learning
techniques, dictionaries) to improve our classification rates
and better help testers when updating highly impacted test
cases.

Acknowledgements
This research was partially supported by a cooperation between
UFCG and two companies Viceri Solution LTDA and Ingenico
do Brasil LTDA, the latter stimulated by the Brazilian Informat-
ics Law n. 8.248, 1991. Second and fourth authors are supported
by National Council for Scientific and Technological Develop-
ment (CNPq)/Brazil (processes 429250/2018-5 and 315057/2018-
1). First and third authors were supported by UFCG/CNPq and
CAPES, respectively.

References
Abadi M., et al., 2016, in 12th {USENIX} Symposium on
Operating Systems Design and Implementation ({OSDI}
16). pp 265–283

Agarap A. F., 2018, arXiv preprint arXiv:1803.08375
Aiken L. S., West S. G., Reno R. R., 1991, Multiple regres-
sion: Testing and interpreting interactions. Sage

Ali S., Hafeez Y., Hussain S., Yang S., 2019, Software Qual-
ity Journal, pp 1–27

Anderson J., Salem S., Do H., 2014, in Proceedings of the
11th Working Conference on Mining Software Reposito-
ries. pp 142–151

BaskelesB.,TurhanB.,BenerA.,2007, in200722ndinterna-
tional symposium on computer and information sciences.
pp 1–6

Beck K., Gamma E., 2000, Extreme programming explained:
embrace change. addison-wesley professional

Bouquet F., Grandpierre C., Legeard B., Peureux F., Vacelet
N., Utting M., 2007, in Proceedings of the 3rd interna-
tional workshop on Advances in model-based testing. pp
95–104

Cai L., Tong W., Liu Z., Zhang J., 2009, in 2009 15th
IEEE Pacific Rim International Symposium on Depend-
able Computing. pp 103–108

Cartaxo E. G., Andrade W. L., Neto F. G. O., Machado P. D.,
2008, in Proceedings of the 2008 ACM symposium on Ap-
plied computing. pp 1540–1544

Chen S., Chen Z., Zhao Z., Xu B., Feng Y., 2011, in 2011


Diniz et al. 2020

Fourth IEEE International Conference on Software Test-
ing, Verification and Validation. pp 1–10

Cohen W. W., Ravikumar P., Fienberg S. E., et al., 2003, in
IIWeb. pp 73–78

Coutinho A. E. V. B., Cartaxo E. G., de Lima Machado P. D.,
2016, Software Quality Journal, 24, 407

Dalal S. R., Jain A., Karunanithi N., Leaton J. M., Lott
C. M., Patton G. C., Horowitz B. M., 1999, in Proceed-
ings of the 1999 International Conference on Software
Engineering (IEEE Cat. No.99CB37002). pp 285–294,
doi:10.1145/302405.302640

Damerau F. J., 1964, Commun. ACM, 7, 171
De Coster X., De Groote C., Destiné A., Deville P., Lam-
ouline L., Leruitte T., Nuttin V., 1

Diniz T., Alves E. L., Silva A. G., Andrade W. L., 2019, in
Proceedings of the XXXIII Brazilian Symposium on Soft-
ware Engineering. pp 337–346

Durelli V. H., Durelli R. S., Borges S. S., Endo A. T., Eler
M. M., Dias D. R., Guimaraes M. P., 2019, IEEE Transac-
tions on Reliability, 68, 1189

Ekelund E. D., Engström E., 2015, in 2015 IEEE Interna-
tional Conference on Software Maintenance and Evolu-
tion (ICSME). pp 449–457

Elish K. O., Elish M. O., 2008, Journal of Systems and Soft-
ware, 81, 649

Frakes W., 1994, in Proceedings of 1994 3rd International
Conference on Software Reuse. pp 2–3

Fraser G., Walkinshaw N., 2015, Software Testing, Verifica-
tion and Reliability, 25, 749

Géron A., 2019, Hands-On Machine Learning with Scikit-
Learn, Keras, and TensorFlow: Concepts, Tools, and Tech-
niques to Build Intelligent Systems. O’Reilly Media

Gondra I., 2008, Journal of Systems and Software, 81, 186
Gulli A., Pal S., 2017, Deep learning with Keras. Packt Pub-
lishing Ltd

Hamming R. W., 1950, The Bell system technical journal, 29,
147

Han T. S., Ko S.-K., Kang J., 2007, in International Work-
shop on Machine Learning and Data Mining in Pattern
Recognition. pp 585–600

Harrold M. J., 2000, in Proceedings of the Conference on the
Future of Software Engineering. pp 61–72

Hayes J. H., Dekhtyar A., Sundaram S., 2005, in ACM SIG-
SOFT Software Engineering Notes. pp 1–5

He Z., Shu F., Yang Y., Li M., Wang Q., 2012, Automated
Software Engineering, 19, 167

Huang A., 2008, in Proceedings of the sixth new zealand
computer science research student conference (NZC-
SRSC2008), Christchurch, New Zealand. pp 9–56

Itkonen J., Mantyla M. V., Lassenius C., 2009, in 2009
3rd International Symposium on Empirical Soft-
ware Engineering and Measurement. pp 494–497,
doi:10.1109/ESEM.2009.5314240

Katara M., Kervinen A., 2006, in Haifa Verification Confer-
ence. pp 219–234

Kondrak G., 2005, in Consens M. P., Navarro G., eds,
Lecture Notes in Computer Science Vol. 3772, SPIRE.
Springer, pp 115–126, http://dblp.uni-trier.de/

db/conf/spire/spire2005.html#Kondrak05
Kruskal J. B., 1983, SIAM review, 25, 201
Kumar D., Mishra K., 2016, Procedia Computer Science, 79,
8

Leung H. K., White L., 1989, in Proceedings. Conference on
Software Maintenance-1989. pp 60–69

Levenshtein V. I., 1966, Soviet Physics Doklady, 10, 707
Lu J., Lin C., Wang W., Li C., Wang H., 2013. pp 373–384,
doi:10.1145/2463676.2465313

Lubis A. H., Ikhwan A., Kan P. L. E., 2018, International
Journal of Engineering & Technology, 7, 17

Malhotra R., Jain A., 2012, Journal of Information Process-
ing Systems, 8, 241

Martin R. C., 2002, Agile software development: principles,
patterns, and practices. Prentice Hall

Michie D., Spiegelhalter D. J., Taylor C., et al., 1994, Neural
and Statistical Classification, 13, 1

Myers G. J., Sandler C., Badgett T., 2011, The art of software
testing. John Wiley & Sons

N. Jorge D., Machado P., L. G. Alves E., Andrade W.,
2017, in Proceedings of the 24th Tools Session / 8th
Brazilian Conference on Software: Theory and Practice.
, doi:10.1109/RE.2018.00041

N. Jorge D., Machado P., L. G. Alves E., Andrade W., 2018.
pp 336–346, doi:10.1109/RE.2018.00041

Nagappan N., Murphy B., Basili V., 2008, in 2008
ACM/IEEE 30th International Conference on Software
Engineering. pp 521–530

Nogueira S., Cartaxo E., Torres D., Aranha E., Marques R.,
2007, in 1st Brazilian Workshop on Systematic and Auto-
mated Software Testing.

Noor T. B., Hemmati H., 2015, in 2015 IEEE 26th Inter-
national Symposium on Software Reliability Engineering
(ISSRE). pp 58–68

Okuda T., Tanaka E., Kasai T., 1976, IEEE Transactions on
Computers, 100, 172

Oliveira Neto F. G., Torkar R., Machado P. D., 2016, Infor-
mation and Software Technology, 80, 124

Parsons D., Susnjak T., Lange M., 2014, Software Quality
Journal, 22, 717

Pinto L. S., Sinha S., Orso A., 2012, in Proceedings of the
ACM SIGSOFT 20th International Symposium on the
Foundations of Software Engineering. p. 33

Pressman R., 2005, Software Engineering: A Practitioner’s
Approach, 6 edn. McGraw-Hill, Inc., New York, NY, USA

Puolitaival O.-P., 2008, Adapting model-based testing to ag-
ile context: Master’s thesis. VTT Technical Research Cen-
tre of Finland

Rong X., 2014, arXiv preprint arXiv:1411.2738
Runkler T. A., Bezdek J. C., 2000, in Ninth IEEE Interna-
tional Conference on Fuzzy Systems. FUZZ-IEEE 2000
(Cat. No. 00CH37063). pp 636–640

Sedano T., Ralph P., Péraire C., 2017, in 2017 IEEE/ACM
39th International Conference on Software Engineering
(ICSE). pp 130–140

Shepperd M., Bowes D., Hall T., 2014, IEEE Transactions
on Software Engineering, 40, 603

Silva A. G., Andrade W. L., Alves E. L., 2018, in Proceed-

http://dx.doi.org/10.1145/302405.302640
http://dx.doi.org/10.1145/363958.363994
http://dx.doi.org/10.1109/ESEM.2009.5314240
http://dblp.uni-trier.de/db/conf/spire/spire2005.html#Kondrak05
http://dblp.uni-trier.de/db/conf/spire/spire2005.html#Kondrak05
http://dx.doi.org/10.1145/2463676.2465313
http://dx.doi.org/10.1109/RE.2018.00041
http://dx.doi.org/10.1109/RE.2018.00041


Diniz et al. 2020

ings of the III Brazilian Symposium on Systematic and Au-
tomated Software Testing. SAST ’18. ACM, New York,
NY, USA, pp 49–56, doi:10.1145/3266003.3266009,
http://doi.acm.org/10.1145/3266003.3266009

Sørensen T., 1948, Biol. Skr., 5, 1
Srinivasan K., Fisher D., 1995, IEEE Transactions on Soft-
ware Engineering, 21, 126

Strug J., Strug B., 2012, in IFIP International Conference on
Testing Software and Systems. pp 200–214

Sutherland J., Sutherland J., 2014, Scrum: the art of doing
twice the work in half the time. Currency

Tretmans J., 2008, in , Formal methods and testing. Springer,
pp 1–38

Utting M., Legeard B., 2007, Practical Model-Based Testing:
A Tools Approach. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA

Utting M., Pretschner A., Legeard B., 2012, Software Test-
ing, Verification and Reliability, 22, 297

Von Mayrhauser A., Mraz R., Walls J., Ocken P., 1994, in
Proceedings 1994 IEEE International Conference on Com-
puter Design: VLSI in Computers and Processors. pp 484–
491

Wong W. E., Horgan J. R., London S., Agrawal H., 1997,
in PROCEEDINGS The Eighth International Symposium
On Software Reliability Engineering. pp 264–274

Zhang D., Tsai J. J., 2003, Software Quality Journal, 11, 87
Zhu X., Zhou B., Hou L., Chen J., Chen L., 2008, in 2008 The
9th International Conference for Young Computer Scien-
tists. pp 1193–1198

http://dx.doi.org/10.1145/3266003.3266009
http://doi.acm.org/10.1145/3266003.3266009
http://dx.doi.org/10.1002/stvr.456
http://dx.doi.org/10.1002/stvr.456

	Introduction
	Motivational Example
	Background
	Model-Based Testing
	CLARET
	Distance Functions
	Machine Learning

	Analysing the Impact of Model Evolution in MBT Suites
	Study Procedure
	Results and Discussion

	Distance Functions to Predict the Impact of Test Case Evolution
	Subjects and Functions
	Study Setup and Procedure
	Metrics
	Results and Discussion

	Case Study
	Combining Machine learning and Distance Values
	Model Evaluation

	General Discussion
	Threats to Validity
	Related Work
	Concluding Remarks