Journal of Software Engineering Research and Development, 2022, 10:8, doi: 10.5753/jserd.2022.2133  This work is licensed under a Creative Commons Attribution 4.0 International License.. Accessibility Mutation Testing of Android Applications Henrique Neves da Silva  [ Federal University of Paraná | henriqueneves@ufpr.br] Silvia Regina Vergilio  [ Federal University of Paraná | silvia@inf.ufpr.br] André Takeshi Endo  [ Federal University of São Carlos | andreendo@ufscar.br] Abstract Smart devices and their apps are present in many everyday activities and play an important role for people with some disabilities. However, making apps more accessible is still a challenge for developers. Automatically acces- sibility testing tools can help in this task but present some limitations. They produce reports on accessibility faults, which usually cover only a subset of the app because they are dependent on the test set available. In order to help in the improvement and/or assessment of test suites generated, as well as contribute to increasing the performance of accessibility testing tools, this work introduces a mutation testing approach. The approach includes a set of mutant operators derived from faults corresponding to the negation of the WCAG standard’s principles and success criteria. It also includes a process to analyse the mutants regarding the original app. Evaluation results with 7 open-source apps show the approach is applicable in practice and contributes to significantly improving the number of faults revealed by the test suites accompanying the apps. Keywords: Mobile Apps, Mutation Testing, Accessibility 1 Introduction In the last decade, we have observed a growing number of smartphones and studies show this number is expected to increase even more in the next years (Cisco, 2017). Smart devices and their apps have become a key component in peo- ple’s daily lives. This is not different for people with some disabilities. For instance, people with some visual impair- ment have relied on smartphones as a vital means to foster independence in carrying out various tasks, such as under- standing text document structure, communicating through so- cial media apps, identifying products on supermarket shelves, and moving between obstacles (Acosta-Vargas et al., 2020). World Health Organization (WHO) estimated that more than one billion people, which is around 15% of the world’s population, are affected by some form of disability (Hart- ley, 2011). Then, it is fundamental to engineer software so that all the advantages of technology are accessible to every individual. Mobile accessibility refers to making websites and apps more accessible to people with disabilities when using smartphones and other mobile devices (W3C, 2019). Progress has been made with accessibility because of man- dates from government regulations (e.g., U.S. Section 508 of Rehabilitation Act), standards (such as the British Broadcast Corporation Standards, Brazilian Accessibility Model, and Web Content Accessibility Guidelines), widespread indus- trial awareness, technological advances, and accessibility- related lawsuits (Yan and Ramachandran, 2019). However, developers still have the challenge of providing more acces- sible software on mobile devices. According to Ballantyne et al. (2018), much of the research on software accessibility is dedicated to the Web and its sites (Grechanik et al., 2009; Wille et al., 2016; Abuaddous et al., 2016); even though there is a recurring effort on the accessibility of mobile apps (Ven- dome et al., 2019). Moreover, studies point to the lack of ad- equate tools, guides and policies to design, evaluate, and test the accessibility in mobile apps (Acosta-Vargas et al., 2020). Automated accessibility testing tools are usually based on existing guidelines. One of the most popular standards is the WCAG (W3C’s Web Content Accessibility Guideline) (Kirk- patrick et al., 2018) guide. The WCAG guide covers recom- mendations for people with blindness and low vision, deaf- ness and hearing loss, limited movement, cognitive limita- tions, speech and learning disabilities. WCAG encompasses several guidelines, each one related to different success crite- ria,groupedintofouraccessibilityprinciples.Sometoolspro- duce, given a set of executed test cases, a report of accessibil- ity violations for the app. Examples of these tools are Acces- sibility Google Scanner (Google, 2020), Espresso (Google, 2018), A11y Ally (Toff, 2018), and MATE (Eler et al., 2018). They can perform static or dynamic analysis (Silva et al., 2018). A limited number of violations can be checked by static tools, but dynamic analysis tends to be more costly. Another limitation is that the accessibility faults checked by tools are limited by the test cases used. They cover only a subset of the app due to weak test scripts or limited input test data generation algorithms (Silva et al., 2018). Tools generally used for test data generation such as Monkey (Moher et al., 2009), Sapienz (Mao et al., 2016), Stoat (Su et al., 2017) and APE (Gu et al., 2019), are focused on functional behavior, code coverage or crashes. In this sense, this work hypothe- sizes that a mutation approach specific to accessibility testing can help in the improvement and/or assessment of test suites generated and contribute to increasing the performance of ac- cessibility testing tools. The idea behind mutation testing is to derive versions of the program under test P , called mutants. Each mutant de- scribes a possible fault, and is produced by a mutation opera- tor (Jia and Harman, 2011). The objective is to generate test cases capable of distinguishing P from its mutants, that is, that when executed with each mutant m produces a different output from the output of P . If the P ’s result is correct, it is free from the fault described by m. If the output is differ- https://orcid.org/0000-0002-2417-3374 mailto:henriqueneves@ufpr.br https://orcid.org/0000-0003-3139-6266 mailto:silvia@inf.ufpr.br https://orcid.org/0000-0002-8737-1749 mailto:andreendo@ufscar.br Silva et al. 2022 ent, m is said killed. At the end, a measure called mutation score is calculated, related to the number of mutants killed. This measure can be used to design test cases, or to evaluate the quality of an existing test suite, and consider whether a program has been tested enough. Mutation testing has been proved to be effective in differ- ent domains and contexts (Jia and Harman, 2011). More re- cently, it has been used in the test of non-functional prop- erties such as performance regarding execution time (Lisper et al., 2017) and energy consumption (Jabbarvand and Malek, 2017). There are some initiatives exploring mutation test- ing of Android apps (Wei, 2015; Deng et al., 2015; Jabbar- vand and Malek, 2017; Luna and El Ariss, 2018; Escobar- Velásquez et al., 2019). But these works are not focused on accessibility testing. Given the context and motivation described above, this pa- per introduces a mutation approach for the accessibility test of Android apps. The underlying fault model is related to the non-compliance with WCAG principles and success cri- teria. We propose a set of 6 operators that remove some se- lected code elements, the most commonly used in the apps, and whose absence may imply accessibility violations. We also define a mutant analysis process that uses tools’ acces- sibility reports to distinguish killed mutants. The process is implemented using the reports produced by Espresso Google (2018), and evaluated with 7 open-source apps. The results showourapproachisapplicableinpracticeandcontributesto improving the quality of the test suites accompanying the se- lected apps. We observe a significant improvement regarding the number of faults revealed by using the mutant-adequate test suites. In this way, the present work introduces a mutation ap- proach that encompasses a set of mutant operators and a mu- tation process implemented by a tool. The approach (i) can be used as a criterion for test data generation and/or assessment, helping developers measure the quality of their test suites or to generate tests from an accessibility perspective; (ii) can be explored to evaluate the accessibility tools available in the market and in academia; and (iii) contributes to the emergent area of mutation testing for non-functional properties, and represents the first step to allow accessibility mutation test- ing, serving as basis to direct future research and encourage the academic community to create tools that further explore this field of research. The remainder of this paper is organized as follows. Sec- tion 2 gives an overview of related work. Section 3 in- troduces our mutation testing approach. Section 4 details the evaluation and its main results. Section 5 discusses the threats to validity, and Section 6 concludes the paper. 2 Related Work Related work can be classified into two main categories: mu- tation testing of apps (Section 2.1) and accessibility testing (Section 2.2). 2.1 Mutation testing of Android Apps In the literature, there are some mutation approaches for An- droid apps. Deng et al. (2015) define 4 classes of mutation op- erators specific to the Android context. The proposed work- flow differs from the traditional mutation test process. Once the mutants are generated, it is necessary to install each mu- tant m on the Android emulator. The test cases are imple- mented through frameworks Robotium (Reda, 2019) or JUnit (Gamma and Beck, 2019). While Deng’s approach requires the app source code, Wei (2015) proposes muDroid, a tool that requires only the APK file of the app. Linares-Vásquez et al. (2017) define a list of 38 mutation operators, implemented by the tool MDroid+ (Moran et al., 2018). First, a static analysis of Java code using Abstract Syn- tactic Trees (AST) is performed to find a Potential Fault Pro- file (PFP) that describes a source code location that can be changed by an operator. PFPs are used to apply the transfor- mation corresponding to each operator in the Java code or XML file. MDroid+ creates a clone of the Android project and applies a single mutation to a PFP specified in the cloned project, resulting in a mutant. Finally, a report is generated associating the name of the created clone with the applied op- erator. The tool does not offer a way to compile and execute the mutants, nor does it calculate the mutation score. In a follow-up study, Escobar-Velásquez et al. (2019) introduce MutAPK that requires as input the APK of the Android app and implements the same operators of MDroid+ (Linares-Vásquez et al., 2017; Moran et al., 2018). The corresponding implementation considers SMALI repre- sentation. Like MDroid+, MutAPK does not include a mu- tant analysis strategy. Both allow the creation of customized mutation operators. Some works have explored aspects of a specific nature within the Android platform. The Edroid tool (Luna and El Ariss, 2018) implements 10 mutation operators oriented to vary configuration files and GUI elements. The analysis of the mutants is done manually. If the mutant’s UI components are distinguished from the original, the mutant is classified as dead. µDroid is a mutation tool to identify energy-related prob- lems (Jabbarvand and Malek, 2017). The tool implements a total of 50 mutation operators corresponding to 28 classes defined as energy consumption anti-patterns. µDroid has a fully automated mutation testing process. While the test is performed in the original app, energy consumption is mon- itored. When the test is executed on the mutant, the energy consumption of the original app is compared to that of the mutant. If the screening is different enough, the mutant is considered dead. Most tools may be extended to have integrated support for the mutation testing process, mainly automatic mutant execu- tion and analysis. Most of them generate mutants and do not offer automatic support for the analysis of the mutant output, which is mainly conducted manually. In addition, there are some initiatives exploring mutation testing of apps for non- functional properties, such as energy consumption, but they do not address accessibility faults. Based on elicited results about mutation testing of mobile apps (Silva et al., 2021), and as far we are concerned, there is not a mutation approach for Silva et al. 2022 mobile accessibility testing and evaluation. 2.2 Accessibility evaluation of Android Apps There are few studies on the accessibility assessment of mo- bile apps. This small amount of studies is due to the lack of adequate tools, guides, and policies to evaluate apps (Acosta- Vargas et al., 2020; Eler et al., 2018). Such guides are gener- ally used as oracles to check whether the app meets accessi- bility requirements during accessibility evaluation that can be conducted manually or by automated tools. Below, we present some works that analyse those guides and report the main accessibility problems, as well as automated tools that take them into consideration. Ballantyne et al. (2018) compile a super-set of guides and normalize them to eliminate redundancy. The result lists 11 categories of testable accessibility elements: Text, Au- dio, Video, GUI Elements, User Control, Flexibility and Ef- ficiency, Recognition instead of Recalling, Gestures, System Visibility, Error Prevention, and Tangible Interaction. Dama- ceno et al. (2018) perform a similar mapping that identifies 68 problems associated with different aspects of the interac- tion of people with visual impairments on mobile devices. These problems are mapped into 7 groups: Buttons, Data En- try, Gesture-based interaction, Screen size, User feedback, and Voice command. The group with more problems is re- lated to the interaction made of formal gestures. Vendome et al. (2019) elaborate taxonomy of accessibil- ity problems by mining 13,817 Android apps from GitHub. The authors observe that 36.96% of the projects did not have elements with descriptive label attributes, and only 2.08% imported at least one accessibility API. The main categories listed in the fault model are: support for visual limitation, support for motor limitation, hearing limitation, and other as- pects of accessibility. Alshayban et al. (2020) present the results of a large- scale study to understand the accessibility from three com- plementary perspectives: app, developers, and users. First, they analyze the prevalence of accessibility violations in over 1,000 Android apps. Then they investigate the developer sen- timents through a survey. In the end, they investigate user ratings and app popularity. Their analysis revealed that in- accessibility rates for apps developed by big companies are relatively similar to inaccessibility rates for other apps. The works of Acosta-Vargas et al. (2019, 2020) evaluate the use of WCAG 2.1 and the Accessibility Google Scanner, a tool that suggests accessibility improvements for Android apps. The authors conclude that the WCAG guide achieves digital inclusion on mobile platforms. However, the accessi- bility problems must be fixed before the application goes into production and they recommend the use of WCAG through- out the development cycle. The most recent version of WCAG 2.1 includes sugges- tions for web access via a mobile device (Kirkpatrick et al., 2018). WCAG principles are grouped into 4 categories: (i) Perceivable, that is, “the information must be presentable to users in ways they can perceive”; (ii) Operable, “User in- terface components and navigation must be operable.”; (iii) Understandable, “Information and the operation of user in- terface must be understandable.”; and (iv) Robust, “Content must be robust enough that it can be interpreted by a wide va- riety of user agents, including assistive technologies”. These principles are the core tenets of accessibility. To follow the accessibility principles, we must achieve the success criteria defined within their respective guideline and principle. Automated tools commonly use the WCAG success crite- ria as testable statements to check for guideline violations. They can perform static or dynamic analysis (Silva et al., 2018). Static analysis can quickly analyze all assets of an app (Google, 2018), but they cannot find violations that can only be detected during runtime (e.g., low color contrast). In contrast, dynamic analysis tends to be time consuming. In this sense, Eler et al. (2018) define a set of accessibility crite- ria and implemented MATE (Mobile Accessibility TEsting), a tool that automatically explores and verifies the accessibil- ity of mobile apps. Developers can also manually assess accessibility proper- ties using the Google Scanner (Google, 2020). It allows test- ing apps and gets suggestions on how to improve accessibil- ity (to help those who have limited vision, speech, or move- ment). First, the app is activated, then it displays the main handling instructions. Finally, with the mobile app running, Google Scanner highlights the GUI element on the screen and what accessibility property it has not fulfilled. The A11y Ally app (Toff, 2018) checks the accessibility of the running app. From its integration via the command line, a11y generates a JSON file at the end of its execution. This file contains the list of GUI elements and which acces- sibility criteria have been violated. The framework Espresso (Google, 2018) allows the recording of automated tests that assess the accessibility of the mobile app. The accessibility of the GUI element, or only widget, will be checked if the test action triggers/interacts with the widget in question. The tools for accessibility testing and evaluation present some limitations. The most noticeable one is that the kind and numberofaccessibilityviolationsdeterminedbythetoolsare dependent on the test set used to execute the app and produce the reports. In this sense, the use of mutants describing poten- tial accessibility faults can guide the test data generation and help in the improvement or assessment of an existing test set regarding this non-functional property. 3 A Mutation Approach for Accessi- bility Testing This section introduces our approach, and describes its main elements, which are usually required for any mutation ap- proach: (i) the underlying fault model, related to accessibility faults; (ii) the mutation operators; (iii) the mutation testing process, adopted to analyze the mutants; and (iv) automation aspects, essential to allow the use of the approach in practice. 3.1 Fault Model In this stage, we searched the literature for different acces- sibility guides that establish good practices and experiments that used them (see Section 2.1). In general, a guide summa- rizes the main recommendations for making the presented content of the mobile app more accessible. As a result of Silva et al. 2022 our search, we observe that the WCAG guide was adopted as a reference to build mobile accessibility guides such as eMAG (Brazilian Government, 2007), List of Accessibility Guidelines for Mobile Applications (Ballantyne et al., 2018), BBC Mobile Accessibility Guideline (BBC, 2017), and SiDi Accessibility Guideline (SejaSidier, 2015). In this way, the WCAG guide was chosen due to the following reasons: i) as mentioned before, it encompasses success criteria writ- ten as testable statements; ii) it is constantly updated and a new version of the guide maintains compliance with its pre- vious one; and iii) it has been considered by many authors as the most popular guide (Acosta-Vargas et al., 2019, 2020). Once the success criteria are known, we can start building a fault model by negating these criteria. An unsatisfied crite- rion may imply one or more accessibility faults, as exempli- fied in Table 1. Table 1. Negating WCAG success criteria Principle Success criterion Success criterion denial Perceivable Content description Absence of content for non-text elements descriptions Operable Recommended touch Not recommended area size touch area size Understan- Labels or Absence of labels dable instructions or instructions Robust Status messages Absence of status messages As observed in Table 1, the denial of the criterion “Labels or instructions” causes one or more faults related to the ab- sence of a label. Within Android’s mobile development, dif- ferent code elements characterize the use of a label for a GUI element. These code elements can be either XML attributes or Java methods. For instance, one way to satisfy the suc- cess criterion “Labels or instructions” is setting the XML at- tributes :hint and :labelFor, or using the Java methods setHint and setLabelFor. Such elements are the key to the generation of mutants, in order to capture the faults of our model. In this way, more than one mutation operator can be derived from the negation of a criterion, such as “Labels or instructions”. Each mutation operator, in its turn, can be applied to more than one element in the code, generating dis- tinct mutants. To select the code elements and propose the mutation op- erators of our approach, we refer to the work of Silva et al. (2020). This work maps the WCAG principles and success criteria to code elements of native Android API, and ana- lyzes the prevalence of the mapped elements in 111 open source mobile apps. The study identifies code elements that impact accessibility, and shows that apps which adopt dif- ferent types of code elements tend to have a smaller density of accessibility faults. This means that code elements associ- ated with WCAG are related to accessibility faults and justify mutation operators based on these code elements. 3.2 Mutation Operators The main objective in defining the accessibility mutation op- erators is to make sure that the test suite created by the tester exploits all, or at least most, of the app GUI elements, as well as check the correct use of the code elements related to the accessibility success criterion. In this way, the operators can be used to guide the generation of test cases or to assess the quality of existing ones. To this end, and following the work of Silva et al. (2020), we selected a set E of code el- ements, the most adopted in the apps, to propose an initial set of operators. These operators are defined considering as- pects of Android apps’ accessibility and can be improved in the future, by adding other code elements and success cri- teria. The selected code elements are presented in Table 2; they correspond to the most used ones in the apps for each principle (Silva et al., 2020). The table also shows the corre- sponding mutation operator. The labelFor element is a label that accompanies the View object. It can be defined via the XML file or the Java language. In general, it provides a description and explo- ration labels for some screen elements. The hint element is a temporary label assigned to editable fields only. It is necessary for TalkBack, or any other screen reader, to cor- rectly report what information the app needs. We can set or change TextView font size with the element textSize. Recommended dimension type for text is “sp” for scaled- pixels (e.g., 15sp). The element inputType specifies the in- put type for each text field in order for the system to dis- play the appropriate soft input method (e.g., an on-screen keyboard). The app, by default, looks for the closest ele- ment to receive the next focus. The next element is not al- ways the most logical. In these cases, we need to give the app custom navigation. We can define the next view to fo- cus on using the code element nextFocusDownId. The ele- ment importantForAccessibility describes whether or not this view is important for accessibility. If the value is set with “yes”, the view fires accessibility events and is re- ported to accessibility services (e.g., TalkBack) that query the screen. The idea of the operators is to remove the corresponding code element e ∈ E when present. We opted for statement deletion operators, as previous studies gave evidence that such operators produce fewer yet effective mutants (Dela- maro et al., 2014). For each code element removed, we have a unique generated mutant. Table 3 presents examples of ap- plying the operators. Snippets of code are presented and the ones to be removed are preceded by “–”. It is important to em- phasize that if a mutation operator can not be applied to the app source code, this may indicate that the project/developer team has low priority on accessibility. Now, imagine that the developer has taken care to define the accessibility code ele- ments in the app. Even if they are defined, it is very important to ensure that the test set includes a test that performs an ac- tion and interacts with the corresponding GUI element and check they are defined properly. 3.3 Mutation Process The testing process for the application of the proposed oper- ators is depicted in Figure 1. It encompasses three steps. The first one is the mutant generation using the accessibility mu- tation operators defined. This step produces a set of mutant apps M. In the second step, the original app and the mutants in M are executed with test set T , which can be designed Silva et al. 2022 Table 2. Selected code elements and corresponding WCAG principles and success criteria. Principle Success criteria Code Elements Mutation Operator XML Attributes Java Methods Perceivable Resize Text :textSize setTextSize Missing textSize Identify Input Purpose :inputType setInputType Missing inputType Operable Keyboard; Focus Order :nextFocusDownId setNextFocusDowndId Missing nextFocusDownId Unders- Label or Instructions :labelFor setLabelFor Missing labelFor tandable Label or Instructions :hint setHint Missing hint Robust Status Messages :importantForAccessibility setImportantForAccessibility Missing importantForAccessibility Table 3. Mutation Operator Description Mutation operator Code context example XML Attribute Java Method MTS - Missing textSize MIT - Missing inputType MNFD - Missing nextFocusDownId MLF - Missing labelFor MH - Missing hint MIA - Missing importantForAccessibility with the tester’s preferred strategy. However, for the mutant analysis our process requires that T is implemented and exe- cuted by using an accessibility checker tool, such as the ones reported in Section 2.2. The third step, mutant analysis, al- lows calculating the mutation score by comparing the acces- sibility reports produced by an accessibility checker for the original and mutant apps. If the accessibility logs differ, that is, different accessibility faults are encountered the mutant can be considered dead. The accessibility report generated by Espresso contains some temporal information that may cause a non-deterministic output. To correct this, we post-process the output so that only the essential information is taken into account, namely the code element ID and its reported acces- sibility issue. Therefore, if the original app’s accessibility log is the same as that of the mutant app, resulting in a live mutant, the test suite probably needs to be revised and improved. If the score is not satisfactory, the tester can add new test cases or modify existing ones in T so that more mutants are killed. (1) Mutant Generation (2) Execution of T in M and App (3) Mechanism of analysis of mutants Android App Mutation Score Tester decides to improve score by modifying T Set of tests T Accessibility log of visited screens Set M of mutant Apps Figure 1. Testing process of the proposed approach 3.4 Implementation To evaluate and use our approach, we implemented a proto- type tool named AccessibilityMDroid. It receives as input the source code of the Android app under test. Accessibili- tyMDroid implements the proposed operators by extending MDroid+ (Moran et al., 2018), which are used for mutant generation (Step 1). To build and execute the test, as well as to produce the accessibility log (Step 2), the Espresso frame- work is used. We chose tests implemented with Espresso be- cause it is the default framework for GUI testing in Android Studio and includes embedded accessibility checking. As T is executed, the AccessibilityCheck class allows us to check for accessibility faults. In the end of the run, Espresso gener- ates a log of the accessibility problems used in Step 3. The tool compares the log automatically, and a list of mutants killed is produced. To illustrate our approach we use a sample app built with Android Studio. A piece of code for this app is presented in Figure 2. With the application of operator MH (Missing hint), which removes from the GUI element the hint code element, Line 22 (in red) disappears in the mutant m. 14 Figure 2. A mutant generated by operator MH Silva et al. 2022 27 @Test 28 public void loginTest() { 29 var appCompatEditText = onView(allOf( 30 withId(R.id.username), 31 childAtPosition(allOf(withId(R.id.container), 32 childAtPosition(withId(android.R.id.content), 0)), 1), 33 isDisplayed())); 34 35 appCompatEditText.perform(replaceText("email"), closeSoftKeyboard()); 36 37 var appCompatEditText2 = onView(allOf( 38 withId(R.id.password), 39 childAtPosition(allOf(withId(R.id.container), 40 childAtPosition(withId(android.R.id.content), 0)), 2), 41 isDisplayed())); 42 43 appCompatEditText2.perform(replaceText("123456"), closeSoftKeyboard()); 44 45 var appCompatEditText3 = onView(allOf( 46 withId(R.id.password), withText("123456"), 47 childAtPosition(allOf(withId(R.id.container), 48 childAtPosition(withId(android.R.id.content), 0)), 2), 49 isDisplayed())); 50 51 appCompatEditText3.perform(pressImeActionButton()); 52 } Figure 3. Test case using Espresso 1 AppCompatEditText{id=2131230902,res-name=nickname}: View falls below the minimum recommended size for touch targets. Minimum touch target size is 48x48dp. Actual size is 331.4x45.0dp (screen density is 2.6). 2 AppCompatEditText{id=2131230902,res-name=nickname}: View falls below the minimum recommended size for touch targets. Minimum touch target size is 48x48dp. Actual size is 331.4x45.0dp (screen density is 2.6). 3 AppCompatEditText{id=2131230917,res-name=password}: View falls below the minimum recommended size for touch targets. Minimum touch target size is 48x48dp. Actual size is 331.4x45.0dp (screen density is 2.6). 4 AppCompatEditText{id=2131230917,res-name=password}: View falls below the minimum recommended size for touch targets. Minimum touch target size is 48x48dp. Actual size is 331.4x45.0dp (screen density is 2.6). 5 AppCompatEditText{id=2131230917,res-name=password}: View falls below the minimum recommended size for touch targets. Minimum touch target size is 48x48dp. Actual size is 331.4x45.0dp (screen density is 2.6). Figure 4. Accessibility log for the original app 1 @Test 2 public void loginTest() { 3 + onView(withId(R.id.nickname)).perform(typeText("nick"), 4 + closeSoftKeyboard()); 5 var appCompatEditText = ... 6 } Figure 5. Changed test Suppose that for this app, a test, as depicted in Figure 3, is available. When T is executed with Espresso on mutant m (Step 2), a log is generated. This log is compared to the log generated by executing T in the original app (Step 3). From the difference between the two accessibility logs, it is possible to determine the mutant’s death. In this case, T was not enough to show the difference between the original app and the mutant. As both produce the same log in Figure 4, the mutant is still alive. The tester now tries to improve T and realizes that existing tests do not interact with one of the 1 + AppCompatEditTextid=2131230902,res-name=nickname: View is missing speakable text needed for a screen reader 2 AppCompatEditText{id=2131230902,res-name=nickname}: View falls below the minimum recommended size for touch targets. Minimum touch target size is 48x48dp. Actual size is 331.4x45.0dp (screen density is 2.6). 3 AppCompatEditText{id=2131230902,res-name=nickname}: View falls below the minimum recommended size for touch targets. Minimum touch target size is 48x48dp. Actual size is 331.4x45.0dp (screen density is 2.6). 4 AppCompatEditText{id=2131230917,res-name=password}: View falls below the minimum recommended size for touch targets. Minimum touch target size is 48x48dp. Actual size is 331.4x45.0dp (screen density is 2.6). 5 AppCompatEditText{id=2131230917,res-name=password}: View falls below the minimum recommended size for touch targets. Minimum touch target size is 48x48dp. Actual size is 331.4x45.0dp (screen density is 2.6). 6 AppCompatEditText{id=2131230917,res-name=password}: View falls below the minimum recommended size for touch targets. Minimum touch target size is 48x48dp. Actual size is 331.4x45.0dp (screen density is 2.6). Figure 6. Accessibility log for the mutant m app’s input fields. After changes in T (illustrated in Figure 5), Step 2 is executed again and the log for m is now the one in Figure 6; it differs from the original one in the first line. By employing a similar procedure to kill accessibility mutants, T achieves a higher mutation score, covers more GUI ele- ments, and potentially reveals other accessibility faults. 4 Evaluation The main goal of the proposed operators is to serve as a guide for the evaluation and improvement of test suites regarding accessibility faults. To evaluate these aspects properly, as well as our implementation using Espresso, we formulated three research questions as follows. RQ1: How is the applicability of the accessibility muta- tion operators? This question aims to investigate if the pro- posed operators and processes are applicable in practice. To answer this question, we evaluate the approach’s application cost by analysing the number of mutants generated by each operator, as well as the number of required test cases. RQ2: How adequate are existing test suites with respect to the accessibility mutation testing? This question evalu- ates the use of the proposed operators as an evaluation crite- rion. They are used for quality assessment of the test suites accompanying the selected open source apps with respect to accessibility. To this end, we analyse the ability of existing tests to kill the mutants generated by our approach. RQ3: How much do the mutation operators contribute to revealing new accessibility faults? This question looks at the effectiveness of mutant-adequate test suites when re- vealing accessibility violations. Silva et al. 2022 4.1 Study Setup We sampled open source apps from F-droid1, last updated in 2019/2020, containing Espresso test suites. We refer to the test suite accompanying the project as T . We removed apps that failed to build and whose tests were not compatible with the accessibility checking feature. The replication package is available at: https://osf.io/vfs2d/. The seven apps are: AlarmClock, an alarm clock for An- droid smartphones and tablets that brings a pure alarm ex- perience; AnyMemo, a spaced repetition flashcard learning software; Authorizer, a password manager for Android; Equate, a unit converting calculator; KolabNotes, a note taking app; Piwigo, a photo gallery app for the web; and PleesTracker, a sleep tracker. For each app, we used AccessibilityMDroid to generate mutants, run T , produce the accessibility logs to each mu- tant, and compare those with the original log. In this way, we obtained the set of mutants killed by T . After this, we manu- ally inspected the alive mutants and realized that many times, some of the test cases in T exercised the mutated code, but they produced no difference in the log due to some Espresso limitations (e.g., a limited set of accessibility criteria that will be detected and printed in the accessibility log). In this case, we marked the corresponding mutant as covered. Other mu- tants were marked as “unreachable” since their mutations are related to widgets that are not reachable in the app (e.g., dead code). So, we counted the number of generated, killed, cov- ered, and unreachable mutants by T . Then, we extended T so that all mutants were killed or at least covered. We refer to this extended test suite as xT . The inclusion of a test case was conducted in the following way: (i) pick an alive mutant (not covered, not killed by T ); (ii) manually record a test that exercises the mutation using Espresso Test Recorder in Android Studio, and if needed, refactor the test code to make it repeatable2; (iii) analyze if the mutant is killed by the new test, if not mark it as covered. The mutants information was collected again for xT . As cost indicators, we collected the number of tests of a test suite T C(T ), and its size, given by the number of lines of test code LoC(T ). As for effectiveness, we counted per test suite the number of accessibility faults reported by the Espresso accessibility check. Table 4 shows the information on the seven selected apps. Authorizer is the app with the greatest value of LoC (28,286), while AnyMemo has 30 activities (#Act.). AlarmClock is the app with the smallest number of LoC: 1,349, and Equate has only 2 activities. The table also shows the number of test cases (#TC) and LoC for the original set T and the extended one xT . Notice that AlarmClock has 41 tests and 1,068 lines of test code (LoC(T )). Kolabnotes has only one test, yet AnyMemo has the smallest LoC(T ) (76). Concerning xT , AlarmClock and Authorizer require more tests (both 43) and more LoC(xT ) (1,341 and 1,700, re- spectively). PleesTracker has the smallest number of test cases (5) and LoC(xT ) (345). However, Authorizer re- quired more additional test cases, 32, while Piwigo only one. 1https://www.f-droid.org 2The code generated by Espresso Test Recorder may be too specific and fails in re-runs. Table 4. Selected apps App∗ LoC #Act. #TC(T ) LoC(T ) #TC(xT ) LoC(xT ) AlarmClock 1,349 5 41 1,068 43 1,341 AnyMemo 19,751 30 3 76 13 932 Authorizer 28,286 7 11 652 43 1,700 Equate 5,826 2 6 511 9 709 Kolabnotes 11,025 9 1 494 6 884 Piwigo 4,744 7 8 408 9 579 PleesTracker 1,868 5 2 89 5 345 ∗ The app’s name is a clickable link to the GitHub project. 4.2 Analysis of Results Table 5 summarizes the main results of the evaluation and is used in this section to answer our RQs. This table shows the number of mutants that were generated (columns G), killed by some test (columns K), covered but alive (columns C), and unreachable (columns U). Notice that the results are shown for 4 out of 6 operators described in Table 3; operators MLF and MNFD did not gen- erate any mutant for the selected apps. For each app, two rows are presented, one for the results obtained by T and the other for xT . The last four columns list the total for all oper- ators, while the last rows bring the total for all apps. For instance, for the app AnyMemo the operator MTS gener- ated 64 mutants, 11 unreachable. The test set T was not capa- ble of killing any mutant but covered 14. The set xT covered 52; that is, 38 additional mutants could be covered. Consider- ing all operators, only one mutant could be killed by xT , and 70 mutants were covered out of 84 generated mutants. For this app, four mutants change a screen that is reached only when integrated with a third-party app. As exercising these mutants would require other tools beyond Espresso, we were not able to cover them. However, they can not be classified as unreachable. Because of this, the sum of killed, covered but alive, and unreachable mutants are not equal to the num- ber of generated mutants for this app, as it happens for all of the other ones. RQ1 – Approach applicability. To answer RQ1, we eval- uate the number of mutants generated by each operator. We observe in Table 5 that operator MTS generated more mu- tants (145 in total), followed by MIT (68), MH (34), and MIA (9). MTS generated mutants for all apps, MIT for 6, and MH for 5 apps. Operator MIA generated mutants only for Authorizer. In total, 256 mutants were generated, with AnyMemo with more mutants (86) and Piwigo with 5. This means that the apps selected contain more code elements as- sociated with the principle Perceivable (operators MTS and MIT), which may indicate: (i) developers are worried about content descriptions for non-text elements more than the prin- ciple Robust (operator MIA that generated mutants for only one app) or Operable (operator MNFD that did not generate any mutant); (ii) User Experience (UX) and User Interface (UI) documents include a more significant amount of code elements of the Perceivable principle in their guidelines. Operators MIT and MIA generated mutants that were not killed; only one mutant of MTS was killed, and 17 out of 34 mutants generated by MH were killed. The process using Espresso was capable of distinguishing mutants in the great majority generated by removing the code element :hint. Analysing alive mutants, we identified 222 as cov- ered, and 12 as unreachable. Unreachable mutants were gen- https://osf.io/vfs2d/?view_only=6c3af7cdbb7f4132a9367e196735c68f https://www.f-droid.org https://github.com/yuriykulikov/AlarmClock https://github.com/helloworld1/AnyMemo https://github.com/tejado/Authorizer https://github.com/EvanRespaut/Equate https://github.com/konradrenner/kolabnotes-android https://github.com/Piwigo/Piwigo-Android https://github.com/vmiklos/plees-tracker Silva et al. 2022 Table 5. Summary of the results per operator Android App Mutation Operator Total MTS MIT MH MIA G K C U G K C U G K C U G K C U G K C U AlarmClock T 12 - 9 - 1 - - - 1 - - - - - - - 14 - 9 - xT - 12 - 1 1 - - - 1 13 AnyMemo T 64 - 14 11 22 - - - - - - - - - - - 86 - 14 11 xT 1 52 - 18 - - - - 1 70 Authorizer T 18 - 1 - 27 - 3 - 18 - 3 - 9 - 2 - 72 - 9 - xT - 18 - 27 6 12 - 9 6 66 Equate T 3 - - - 2 - - - 2 1 - 1 - - - - 7 1 - 1 xT - 3 - 2 1 - - - 1 5 Kolabnotes T 23 - 8 - 13 - 3 - 12 - - - - - - - 48 - 11 - xT - 23 - 13 8 4 - - 8 40 Piwigo T 1 - - - 3 - 3 - 1 1 - - - - - - 5 1 3 - xT - 1 - 3 1 - - - 1 4 PleesTracker T 24 - 8 - - - - - - - - - - - - - 24 - 8 - xT - 24 - - - - - - - 24 Total T 145 - 40 11 68 - 9 - 34 2 3 1 9 - 2 - 256 2 54 12 xT 1 133 - 64 17 16 - 9 18 222 Number of mutants Generated, Killed, Covered but alive, Unreachable by the original test suite T and the extended one xT . The Mutation Operators are: Missing textSize; Missing inputType; Missing Hint; and Missing importantForAccessibility. Table 6. Efforts to build xT App #Mutants / app KLoC A-TC A-LoC MTS MIT MH MIA Total AlarmClock 8.9 0.7 0.7 0.0 10.37 2 273 AnyMemo 3.2 1.1 0.0 0.0 4.35 10 856 Authorizer 0.6 0.9 0.6 0.3 2.58 32 1048 Equate 0.5 0.3 0.3 0.0 1.20 3 198 Kolabnotes 2.0 1.1 1.0 0.0 4.35 5 390 Piwigo 0.2 0.6 0.2 0.0 1.05 1 171 PleesTracker 12.8 0.0 0.0 0.0 12.8 3 256 Average 4.0 0.67 0.4 0.043 5.42 8 456 A-TC stands for the number of test cases added to T to obtain xT . A-LoC stands for the number of LoC added to T to obtain xT . erated mainly for AnyMemo and are related to implementation smells like dead code. For a deeper analysis, Table 6 contains the number of mu- tants generated by the operator divided by the KLoC of each app. The last two columns present information regarding the effort required to add new test cases so that an accessibility mutant adequate test suite is obtained. The last rows contain min, max and average values. We can see that the operators generate a mean value of 5.42 mutants per KLoC, and, in the worst case, 12.8 for PleesTracker. Notice that a greater number of mutants is generated for the largest apps in terms of LoC and number of activities: AnyMemo, Authorizer and Kolabnotes. Given the fact that the proposed operators only remove code elements, the num- ber of mutants tends to be equal to the number of existing elements associated to the accessibility WCAG success cri- teria. Due to this characteristic, it is unlikely that the operators generate equivalent mutants. This is an advantage, because the identification of such mutants is usually costly. Moreover, we have not found either stillborn or trivial mutants. The first ones are mutants that do not compile, and the second ones are mutants that crash in the initialization. We also measured the effort of adding new test cases, considering the values in Table 4. As Table 6 shows, Authorizer demanded more effort required 32 additional tests (with 1,048 A-LoC), fol- lowed by AnyMemo: which required 10 additional tests (with 856 A-LoC); and Kolabnotes: 5 tests (390 A-LoC). These apps are the greatest in terms of size. Response to RQ1: The number of mutants is related to the size of the app, mainly to the number of GUI ele- ments, and code elements associated with the accessibil- ity success criteria. Operators MTS and MIT, related to the principle Perceivable, produce more mutants, while no mutant is generated for operator MNFD, related to the Operable principle. Moreover, we did not observe any stillborn, trivial, or equivalent mutants. Implications: The operators are deletion style and depend on the use of accessibility-related code elements. The num- ber of generated mutants grows proportionally to the num- ber of accessibility code elements used in the app. Operators MTS and MIT generated more mutants, which may indicate that code elements related to the principle Perceivable are the most used in the app selected. Our set of operators repre- sents a first proposal, and we intend to improve the set with other kinds of operators, that for instance adding or modi- fying code elements, as well, and other code elements and success criteria could be considered. The proposed operators do not generate equivalent mu- tants due to their conception characteristics. We did not ob- serve any stillborn or trivial mutant. This is important, be- cause they imply in cost. These kinds of mutants are very common in the Android mutation testing (Linares-Vásquez et al., 2017). We observe Espresso’s limited ability to detect accessibil- ity faults, and as a consequence, a reduced number of mutants were killed. Because of this other accessibility testing tools should be used in future versions of AccessibilityMDroid. We also intend to implement mechanisms to automatically determine covered mutants. The analysis of dead mutants is a drawback of most mutation testing approaches for Android apps. The great majority do not offer an automatic way to per- Silva et al. 2022 form this task, they do not even provide a way to consider a mutant killed. RQ2 – Adequacy of existing test suites. RQ2 evaluates the adequacy of the test suites concerning the proposed oper- ators. The answer can shed some light on the quality of the test cases regarding accessibility faults and if the developers are worried about the test of such a non-functional property. To answer this question, Table 7 brings the percentage of mu- tants killed and covered by T , per app. Unreachable mutants were not considered. On average, the original sets were capa- ble of killing only 5.23% of the mutants. The killed percent- age reaches 20% for Piwigo, the app with the fewest number of mutants. But this percentage is equal to zero for five apps. The percentage of covered mutants are better, 30.24% on av- erage. The best percentages were achieved by AlarmClock (64.3%) and Piwigo (60%). The other five apps achieved a percentage lower than 35%. Table 7. Adequacy results of original test suites App Killed Covered AlarmClock 0.0% 64.3% AnyMemo 0.0% 18.67% Authorizer 0.0% 12.5% Equate 16.67% 0% Kolabnotes 0.0% 22.91% Piwigo 20% 60% PleesTracker 0.0% 33.33% Average 5.23% 30.24% Response to RQ2: The existing test suites of the stud- ied apps killed or covered only a small fraction of the accessibility-related mutants. In other words, they had a low mutation score. Implications: In general, there are opportunities to im- prove the quality of GUI tests in mobile apps. While code coverage and mutation testing have better support at the unit test level, more tool support is required at GUI level. As the accessibility mutants demand better test coverage at GUI level, the results herein presented helped to expose those weaknesses. RQ3 – Accessibility faults. By answering RQ2, we ob- serve that the existing tests obtained a small coverage of ac- cessibility mutants, and new tests are required to obtain ad- equate test suites. However, it is important to know if such additional tests and efforts improve the test quality in terms of accessibility faults revealed. RQ3 aims to answer this ques- tion. Table 8 shows the number of accessibility faults pointed by Espresso when the original (T ) and extended (xT ) test sets are used; the last column also shows the percentage of improvement. For T , AlarmClock has more accessibility faults (126), while PleesTracker has only 2 faults. On av- erage we have 45.28 accessibility faults per app. Concern- ing the mutant-adequate test suite xT , Piwigo has more faults (447); PleesTracker presented the best percentage of improvement (3,650%). But, the smallest percentage of improvement was obtained for AlarmClock. On average xT revealed 186.4 accessibility faults. The improvements varied from 3.2 to 3,650%. Table 8. Accessibility faults detected by T and xT App #faults(T) #faults(xT) Improv. AlarmClock 126 130 3.2% AnyMemo 24 355 1,479% Authorizer 65 201 209.2% Equate 19 27 42.1% Kolabnotes 43 70 62.8% Piwigo 38 447 1,076.3% PleesTracker 2 75 3,650% Average 45.28 186.4 931.8% Response to RQ3: Mutant-adequate test suites con- tribute to meaningful improvements in the number of accessibility faults detected. On average, the extended test suites improved around 932% the number of acces- sibility faults revealed in the original test suites. Implications: The results gave evidence that the use of the mutation operators contributed to an increase in the number of revealed accessibility faults. We anticipate that the qual- ity of the test suite is improved too, besides the accessibility point of view. 5 Threats to Validity There are some threats to the validity of our study. Sample selection. It is not easy to guarantee the represen- tativeness of the apps. In addition, the adopted sample has only Android native apps with Espresso test suites. To mit- igate this, we selected the apps from F-Droid a diverse set of open-source apps with recent updates. F-Droid has been used in other studies (Mao et al., 2016; Zeng et al., 2016; Gu et al., 2019). Limited oracle. The mutant analysis strategy is linked to the Espresso tool. However, the proposed approach is also compatible with other tools that monitor the running app and produce accessibility logs like MATE (Eler et al., 2018) and A11y (Toff, 2018); we plan to integrate them in the future. Manual determination of covered elements. This task was performed manually and is subject to errors. To mini- mize this threat, this analysis was carefully conducted and double-checked. Flaws in the implementation. There may be implementa- tion errors in any of tools or routines used in our study, like the MDroid+ extension, Android emulator management, and Espresso. The number of mutation operators. The set of accessi- bility mutation operators proposed represents only a fraction of all accessibility violations that can occur in a mobile app. We created this initial deletion set to validate the proposed tool. This set of deletion mutation operators is tested and val- idated as effective in practice. Silva et al. 2022 6 Concluding Remarks This paper presented an approach for accessibility mutation testing of Android apps. First, we defined a set of six acces- sibility mutation operators for Android apps. Then, for an Android app, we generated the mutants. Based on the orig- inal test suite, we checked which mutants are killed or at least covered. Following our approach, we extended the orig- inal test suite to cover more mutants. The empirical results show that the original test suites cover only a small part of the accessibility-related mutants. Besides, mutant-adequate test suites contribute to meaningful improvements in the number of accessibility faults detected. As future work, we plan to extend the tool support to han- dle APK files and commercial apps (closed source). The mu- tation operators may also be described more generically so that the approach can be extended to include other mobile development languages and frameworks (e.g., Swift, React- Native, Kotlin). Another direction is to experiment with different oracles (e.g., MATE (Eler et al., 2018)), besides the accessibility check of Espresso we used in this study. Finally, different accessibility mutation operators can be defined, now focused on including and changing code elements. Acknowledgment This work is partially supported by CNPq (Andre T. Endo grant nr. 420363/2018-1 and Silvia Regina Vergilio grant nr. 305968/2018-1). References Abuaddous, H. Y., Jali, M. Z., and Basir, N. (2016). Web ac- cessibility challenges. International Journal of Advanced Computer Science and Applications (IJACSA). Acosta-Vargas, P., Salvador-Ullauri, L., Jadán-Guerrero, J., Guevara, C., Sanchez-Gordon, S., Calle-Jimenez, T., Lara- Alvarez, P., Medina, A., and Nunes, I. L. (2020). Acces- sibility assessment in mobile applications for android. In Nunes, I. L., editor, Advances in Human Factors and Sys- tems Interaction, pages 279–288, Cham. Springer Interna- tional Publishing. Acosta-Vargas, P., Salvador-Ullauri, L., Perez Medina, J. L., Zalakeviciute, R., and Perdomo, W. (2019). Heuristic method of evaluating accessibility of mobile in selected applications for air quality monitoring. In International Conference on Applied Human Factors and Ergonomics, pages 485–495. Springer. Alshayban, A., Ahmed, I., and Malek, S. (2020). Accessi- bility issues in android apps: State of affairs, sentiments, and ways forward. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, page 1323–1334, New York, NY, USA. Association for Computing Machinery. Ballantyne, M., Jha, A., Jacobsen, A., Hawker, J. S., and El- Glaly, Y. N. (2018). Study of Accessibility Guidelines of Mobile Applications. In Proceedings of the 17th Interna- tional Conference on Mobile and Ubiquitous Multimedia, pages 305–315. ACM. BBC (2017). The BBC Standards and Guidelines for Mobile Accessibility. https://www.bbc.co.uk/ accessibility/forproducts/guides/mobile. Brazilian Government (2007). Accessi- bility Model in Electronic Government. https://www.gov.br/governodigital/ pt-br/acessibilidade-digital/ modelo-de-acessibilidade. Cisco (2017). Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2017–2022 White Paper - Cisco. https://www.cisco.com/c/en/ us/solutions/collateral/service-provider/ visual-networking-index-vni/ white-paper-c11-738429.html. Damaceno, R. J. P., Braga, J. C., and Mena-Chalco, J. P. (2018). Mobile device accessibility for the visually im- paired: problems mapping and recommendations. Univer- sal Access in the Information Society, 17(2):421–435. Delamaro, M. E., Offutt, J., and Ammann, P. (2014). Design- ing deletion mutation operators. In 2014 IEEE Seventh In- ternational Conference on Software Testing, Verification and Validation, pages 11–20. Deng, L., Mirzaei, N., Ammann, P., and Offutt, J. (2015). Towards mutation analysis of Android apps. In Proceed- ings of the Eighth International Conference on Software Testing, Verification and Validation Workshops, ICSTW, pages 1–10. IEEE. Eler, M. M., Rojas, J. M., Ge, Y., and Fraser, G. (2018). Au- tomated Accessibility Testing of Mobile Apps. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST), pages 116–126. Escobar-Velásquez, Camilo, O.-R., Michael, and Linares- Vásquez, M. (2019). MutAPK: Source-Codeless Mutant Generation for Android Apps. In 2019 IEEE/ACM Inter- national Conference on Automated Software Engineering (ASE). Gamma, E. and Beck, K. (2019). The new major version of the programmer-friendly testing framework for Java. https://junit.org. Google (2018). Espresso. https://developer.android. com/training/testing/espresso. Google (2018). Improve your code with lint checks. https: //developer.android.com/studio/write/lint. Google (2020). Accessibility Scanner. https://play. google.com/store/apps/details?id=com.google. android.apps.accessibility.auditor&hl=en_U. Grechanik, M., Xie, Q., and Fu, C. (2009). Creating gui test- ing tools using accessibility technologies. In 2009 Inter- national Conference on Software Testing, Verification, and Validation Workshops, pages 243–250. Gu, T., Sun, C., Ma, X., Cao, C., Xu, C., Yao, Y., Zhang, Q., Lu, J., and Su, Z. (2019). Practical GUI Testing of Android Applications via Model Abstraction and Refine- ment. In Proceedings of the 41st International Conference on Software Engineering, ICSE ’19, page 269–280. IEEE Press. https://www.bbc.co.uk/accessibility/forproducts/guides/mobile https://www.bbc.co.uk/accessibility/forproducts/guides/mobile https://www.gov.br/governodigital/pt-br/acessibilidade-digital/modelo-de-acessibilidade https://www.gov.br/governodigital/pt-br/acessibilidade-digital/modelo-de-acessibilidade https://www.gov.br/governodigital/pt-br/acessibilidade-digital/modelo-de-acessibilidade https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://junit.org https://developer.android.com/training/testing/espresso https://developer.android.com/training/testing/espresso https://developer.android.com/studio/write/lint https://developer.android.com/studio/write/lint https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.auditor&hl=en_U https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.auditor&hl=en_U https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.auditor&hl=en_U Silva et al. 2022 Hartley, S. D. (2011). World Report on Disability (WHO). Technical report, WHO and World Bank. Jabbarvand, R. and Malek, S. (2017). µDroid: an energy- aware mutation testing framework for Android. In Pro- ceedings of the 11th Joint Meeting on Foundations of Soft- ware Engineering, ESEC/FSE, pages 208–219. ACM. Jia, Y. and Harman, M. (2011). An analysis and survey of the development of mutation testing. IEEE Trans. Software Eng., 37(5):649–678. Kirkpatrick, A., Connor, J. O., Campbell, A., and Cooper, M. (2018). Web Content Accessibility Guidelines (WCAG) 2.1. https://www.w3.org/TR/WCAG21/. Linares-Vásquez, M., Bavota, G., Tufano, M., Moran, K., Di Penta, M., Vendome, C., Bernal-Cárdenas, C., and Poshyvanyk, D. (2017). Enabling Mutation Testing for Android Apps. In Proceedings of the 2017 11th Joint Meet- ing on Foundations of Software Engineering, ESEC/FSE, pages 233–244, New York, NY, USA. ACM. Lisper, B., Lindstrom, B., Potena, P., Saadatmand, M., and Bohlin, M. (2017). Targeted mutation: Efficient mutation analysis for testing non-functional properties. In Proceed- ings - 10th IEEE International Conference on Software Testing, Verification and Validation Workshops, (ICSTW), pages 65–68. Luna, E. and El Ariss, O. (2018). Edroid: A Mutation Tool for Android Apps. In Proceedings of the 6th International Conference in Software Engineering Research and Inno- vation, CONISOFT, pages 99–108. IEEE. Mao, K., Harman, M., and Jia, Y. (2016). Sapienz: Multi- objective automated testing for android applications. In Proceedings of the 25th International Symposium on Soft- ware Testing and Analysis, ISSTA 2016, page 94–105, New York, NY, USA. Association for Computing Machin- ery. Moher,D.,Liberati,A.,Tetzlaff,J.,andAltman,D.G.(2009). Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. BMJ, 339. Moran, K., Tufano, M., Bernal-Cárdenas, C., Linares- Vásquez, M., Bavota, G., Vendome, C., Di Penta, M., and Poshyvanyk, D. (2018). Mdroid+: A mutation testing framework for android. In Proceedings of the 40th Interna- tional Conference on Software Engineering: Companion Proceeedings, pages 33–36. ACM. Reda, R. (2019). RobotiumTech: Android UI Testing. https://github.com/RobotiumTech/robotium. SejaSidier (2015). Guide to the Development of Accessi- ble Mobile Applications. http://www.sidi.org.br/ guiadeacessibilidade/index.html. Silva, C., Eler, M. M., and Fraser, G. (2018). A survey on the tool support for the automatic evaluation of mobile acces- sibility. In Proceedings of the 8th International Confer- ence on Software Development and Technologies for En- hancing Accessibility and Fighting Info-Exclusion, DSAI 2018, page 286–293. ACM. Silva, H. N., Endo, A. T., Eler, M. M., Vergilio, S. R., and Durelli, V. H. R. (2020). On the Relation between Code Elements and Accessibility Issues in Android Apps. In Proceedings of the V Brazilian Symposium on Systematic and Automated Software Testing, SAST. Silva, H. N., Prado Lima, J. A., Endo, A. T., and Vergilio, S. R. (2021). A mapping study on mutation testing for mobile applications. Software Testing, Verification Relia- bility. Su, T., Meng, G., Chen, Y., Wu, K., Yang, W., Yao, Y., Pu, G., Liu, Y., and Su, Z. (2017). Guided, stochastic model- based GUI testing of android apps. In Proceedings of the 11th Joint Meeting on Foundations of Software Engi- neering, ESEC/FSE, Paderborn, Germany, September 4- 8, pages 245–256. Toff, D. (2018). A11y ally. https://github.com/ quittle/a11y-ally. Vendome, C., Solano, D., Liñán, S., and Linares-Vásquez, M. (2019). Can Everyone use my app? An Empirical Study on Accessibility in Android Apps. In 2019 IEEE Inter- national Conference on Software Maintenance and Evolu- tion (ICSME), pages 41–52. W3C (2019). W3C Accessibility Standards Overview. https://www.w3.org/WAI/ standards-guidelines/. Wei, Y. (2015). MuDroid: Mutation Testing for Android Apps. Technical report, UCL-UK. Undergraduate Final Year Individual Project. Wille, K., Dumke, R. R., and Wille, C. (2016). Measuring the accessability based on web content accessibility guide- lines. In 2016 Joint Conference of the International Work- shop on Software Measurement and the International Con- ference on Software Process and Product Measurement (IWSM-MENSURA), pages 164–169. Yan, S. and Ramachandran, P. G. (2019). The current sta- tus of accessibility in mobile apps. ACM Transactions on Accessible Computing, 12. Zeng, X., Li, D., Zheng, W., Xia, F., Deng, Y., Lam, W., Yang, W., and Xie, T. (2016). Automated test input gener- ation for android: Are we really there yet in an industrial case? In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software En- gineering, FSE 2016, page 987–992. https://www.w3.org/TR/WCAG21/ https://github.com/RobotiumTech/robotium http://www.sidi.org.br/guiadeacessibilidade/index.html http://www.sidi.org.br/guiadeacessibilidade/index.html https://github.com/quittle/a11y-ally https://github.com/quittle/a11y-ally https://www.w3.org/WAI/standards-guidelines/ https://www.w3.org/WAI/standards-guidelines/ Introduction Related Work Mutation testing of Android Apps Accessibility evaluation of Android Apps A Mutation Approach for Accessibility Testing Fault Model Mutation Operators Mutation Process Implementation Evaluation Study Setup Analysis of Results Threats to Validity Concluding Remarks