Microsoft Word - 6-shehab IIUEJ 33-21-3 O.K..doc IIUM Engineering Journal, Vol. 2, No. 1, 2001 S. A. Hameed and A. A. Abbasi 45 DEVELOPING AN INTELLIGENT GENERATOR FOR SEMI-ACTUAL TEST DATA S. A. Hameed, A. M. A. Al-Abbasi Centre of Measurement and Evaluation, University of Bahrain,Bahrin shihab@iiu.edu.my, amajid@iiu.edu.my Abstract: The actual test data generation is one of the difficult and expensive parts of applying software- testing techniques. Many of the current test data generators suffer from the reduction of user’s confidence in generated test data and testing process. This is because of focusing on developer and database administrator viewpoints regardless of users concerns and focusing on data type and structure regardless of meaning. This paper proposes a model of an intelligent generator for semi-actual test data with the aim of increasing users confidence in software testing. The model uses samples of real data as a resource data and a set of efficient generation techniques based on statistical methods such as permutations, combination, sampling, and statistical distributions. The selection of the suitable structure and generation technique is based on one of the intelligent soft computing techniques such as fuzzy logic, neural network, heuristic, or genetic algorithm. The generated test data is validated according to the data specifications then tested by one of the normality testing techniques to be close to the real world or environment of the testing processes. This model offers the ability of simulating real environments. Keywords: Software Testing, Test Data Generation, Semi-Actual Data, Intelligent Generator, Simulation. 1. INTRODUCTION During the 1990’s, the primary challenge and goal of software engineering was the production of quality software and the reduction cost of computer-based solutions that can be implemented with software [1, 2]. To improve software quality, software testing is one of the essential tools. It is one of the complicated problems in the life cycle of software development, which is expensive (40-50% of the total software development cost) and labor intensive [3, 4]. Software is now being applied in critical situations to control valuable machinery, handle money, and safeguard human lives. The failure in such situations can be disastrous; therefore there is much need for efficient software testing to reduce the risk of software [5, 6]. Software testing requires set(s) of test data. The automation and improvement of test data generation will reduce the cost of software development and testing. Unfortunately, automatic test data generation still faces many problems. These problems can be summarized as follows: Shortage and inefficiency of some test data generation techniques or tools. Duplicated or conflicting descriptions of the same or similar data items used in applications that have similar goals. The generated data holds high ratio of meaningless items, which may not reflect the specifications, culture, and environment of the population under test. Using inefficient set of generation technique(s) based mainly on random number generator (RND) or similar functions. Focusing on developer or database administrator viewpoints regardless of user concerns. Minimum participation of the user in test data generation process. Lack of user’s confidence in the generated test data, testing process, and consequently in application under test. To overcome these problems this paper proposes a model of an intelligent generator for semi-actual test data. It is an improvement on a previous work [7]. The proposed model uses intelligent, soft computing, approaches such as fuzzy logic, neural network, heuristic, or genetic algorithm [8] to provide approximate solutions to selected problems. These approaches are suitable for selecting the suitable data item’s structures and generation technique for the proposed model. The generated test data is then checked according to the normality test(s) to satisfy the required specifications. 2. GOAL AND OBJECTIVES The main goal is to generate suitable set(s) of semi- actual test data to support the software testing process, and to improve user confidence in testing software applications. The function of this generator to: Generating different volumes, types, and structures of test data. Generate different sets of test data that contains high ratio of meaningful and semi-actual items, which reflects the specifications or environment of the population under test. Offer a unified description of meaningful data items to eliminate duplication or conflicting description of data items. Develop a set of efficient and powerful generation techniques based mainly on several statistical IIUM Engineering Journal, Vol. 2, No. 1, 2001 S. A. Hameed and A. A. Abbasi 46 methods. These techniques offer good flexibility to the user to generate different types and structures of meaningful data. Offer a set of test normality techniques to insure the efficiency of the generated data. Use a suitable intelligent soft computing such as fuzzy logic, neural network, or genetic algorithm in selecting the suitable data item structure and data generation technique. Increase the user’s confidence in test data, testing, and application under test by allowing more participation to the user than recent test data generation tools; through the selection of the required list and its specifications, resource data and generation technique. Improve software industry by developing reliable software products. 3. TEST DATA GENERATOR MODULES The intelligent generator for semi-actual test data consists of many modules, which include: 3.1 Setup Specifications The setup specification is a very important preparation step before generation. The main activities of this step are shown in Fig. 1, which can be summarized as: Specifying the required list to be generated from the MDI sub-library. Specifying, or selecting, the suitable list structure from MDI structures sub- library using one of the intelligent selection techniques. Determining the list and fields specifications or selecting it from default specifications sub-library. Specifying the resource data or selecting it from default values sub-library. Specifying, or selecting, the suitable generation technique(s) using one of the intelligent selection techniques. Specifying the output file and device. This module reflects the interface between the user and the meaningful data generation model to setup the required list and its related specifications and restrictions. It offers good flexibility and participation to the different users to select or insert the requirements for generating list(s) of meaningful data. 3.2 Data Descriptions This module is used to offer a unified description for the meaningful data items, which are used by a set of applications that have the same or similar goal. The main steps for meaningful data description and library construction can be summarized as: Preparing a unified list by collecting all the data items and its related structures, from set of applications that have the same or similar goal, in a unified list(s). Sorting the unified list by sorting the contents of the unified list of data items. Eliminating duplicated items by selecting one data item from each set of duplicated items and deleting the others then links all structures for the deleted items into the selecting one. Eliminating similar items by selecting one item from each set of similar meaning items and deleting the others then linking all structures from the deleted items into the selected one. The produced data item called a pure list. Creating a MD library by storing all data items in the pure list in a MD items sub-library and the structures in MD item structures sub-library. The result of this step is a meaningful data library, which contains: Meaningful data items (MDI) Sub-library: contains a set of possible meaningful data items that could be generated by this model. MDI Structures Sub-library: contains all possible structures for each element in the previous MDI sub-library. The user has the ability to modify the library contents according to the application goal and environment. The main advantage of this step is to eliminate the duplicated or conflicted description(s) for meaningful data items used by the similar applications. It is a step towards standardizing the data description used in such applications. 3.3 Default Specifications and Values This module represents the second part of the meaningful data library. It consists of two optional components that includes a: MDI default-specifications Sub-library (optional): contains the default specifications of each element in the previous MDI sub-library. MDI default samples Sub-library (optional): contains the default values or samples for each simple type of the elements in the MDI sub-library. The importance of the defaults is to help the non- professional users in selecting the specifications and the values for the required data item. 3.4 Sample of Resource Data The data generation process requires a set of resource data, which could be a sample of real data taken from the actual environment, set(s) of assumed data prepared by the expert or professional people who is working in the environment, default pre-saved data, pre-generated data, or sets of alphabets or boundary values data. Resource data is an important factor in this model and affects the efficiency of the generated data. The model focuses on using a sample of real or assumed data as a main resource. This generates data reflecting the population’s specifications or cultures. The other resources of data are used as supporting resources. The resource data should validate according to data specifications used before by the generation engine. The main advantages of using a sample of real or assumed data as resource data can be concluded as increase: IIUM Engineering Journal, Vol. 2, No. 1, 2001 S. A. Hameed and A. A. Abbasi 47 The ratio of meaningful (semi-actual) data, which reflects the specification, environment and culture of the population under test. The user participation in software development and testing. This could support the current trend of giving more participation the user and to eliminate the developer or database administrator bias in selecting the test data. The user confidence in test data, testing process and testing results. 3.5 Generation Techniques This module uses a set of efficient generation techniques based mainly on several statistical methods. It offers the ability to generate lists of meaningful data, which are of different structures and volumes. The techniques include permutation with replacement, permutation without replacement, permutation with partial replacement, and permutation from multiple groups. These different permutation techniques produces (nk), (N!/(N-K)!), ((N-1)K + (n-1)K+1), and (N1*N2*N3* ….) permutations respectively [9, 10]. Besides the permutation techniques there are several statistical distributions, which include discrete binomial or multinomial distributions and continuous normal or Gamma distributions, and sampling random, systematic, or sequential techniques [11, 12]. The generation procedures are supported by hashing, sorting and searching techniques [13, 14]. This set of generation techniques offers flexibility to the users to generate the suitable volume of data. The generated data could be numeric, character or Boolean and representing simple, compound or composite structures. The selection of the suitable generation technique(s) is based on intelligent selection techniques. The generation techniques built as functions or routines are stored in a special library and ready when called by the generation engine. 3.6 Generation Engine The generation engine represents the main processing part in the MD generation model. It generates the required list(s) of data based on the user setup to the list specifications and restrictions, which has been mentioned in setup step. This module uses information from MD library, resource data, generation techniques, and output requirements to generate the required data. There are two main strategies used for data generation. The first one is called one phase strategy, which is suitable to generate all types of lists. It requires that all fields in the selected structure be of simple type so it uses sampling or permutation techniques directly to generate the data in one phase. The second strategy is the multiple-phases, which is suitable to generate composite lists by generating its components in many phases. The result of each phase will be used as resource data for the next phases. The generation will continue until the required list is produced. The main steps for the meaningful test data generation phases are shown in Fig. 2, which can be summarized as: i. Preparation phase: Select the required list to be generated. Select the suitable structure for the selected list, using one of the intelligent selection approaches. Setup list specifications. Setup field specifications for each field in the structure. Specify or insert the sample data for the field(s). Validate the sample data according to the field specifications. Store all specifications and sample data in temporary files to be used by the next phase. ii. Generation phase: Read the preparation files. Call the main generation technique, according to one of intelligent selection approaches, to generate list of raw data. Filtering the raw list according to list specification. Testing the generated data according to the normality testing techniques. Customizing the generated list according to the customizing techniques to get the exact volume of data. Store the generated list on the specified output file within the specified output device. 3.7 Validation and Statistical Test This module is responsible for validating the generated set(s) of data according to the list and field specifications. It is a descriptive testing of the generated data. The second type is statistical testing such as the normality testing and correlation coefficient. A common statistical technique judges whether an assumed model provides us with an adequate description of the observed data, it is a statistical test of the distributional assumptions built into the model. The power of any statistical test depends very much on the amount of information, which is available. The influence on the power is the detailed use of the data items, that is the nature of the test in terms of the criterion and the critical region. The tests for evaluating the assumed normality are [15]: The W- Test for normality An approximate analysis of variance test for normality. The probability plot correlation coefficient test for normality. The D-test for normality. The multiple correlation coefficient (R) will reflect the measure of the linear association between the dependent variable Y and the independent variables x1, x2, .., xk. 3.8 Output This module is used to store the generated data in specified form, on the required output file and device according to specifications inserted by the user. The output device could be one of the I/O devices such as hard disk, floppy disk, CD, screen or printer. The importance of this module is to store the result in the required file and device for later usage. IIUM Engineering Journal, Vol. 2, No. 1, 2001 S. A. Hameed and A. A. Abbasi 48 Fig. 1: Setup specifications for Data Item Setup List & field specifications Select List Select List Structure Size of Generated List (Quantity) Main Generation Technique Different Sequential, Systematic, or Random Sampling, and Statistical Dist. From MDI Structures Sub-library (Default) From MD Items Sub-library Margin + / - Supporting Gen. Technique Is List has a Primary Key Yes / From a List of Output Devices Output device Output file Insert required list Name Select List Type Field type Resource of Sample Data Default Sample / Insert By User Sample Size Integer, Real, Character, Boolean Lower & Upper Limits For Numeric Type only Primary Key Type Field format Not-key, Primary Key, Composite Relational Level Relation with other fields Relation Field name Restriction on the field Constraint Constant / Variable Field Length For Character Type only Relation with other Lists Relation Field name List Name Relations & Constraints include >, >=, <, <=, =, <> Not-Related, Relational, Multiple Related IIUM Engineering Journal, Vol. 2, No. 1, 2001 S. A. Hameed and A. A. Abbasi 49 . Fig. 2: Flow of semi-actual test data generator Select Required Select List Setup List Specifications Setup Field(s) Specifications Select or Insert Sample Data Validate Sample Data Store Preparation phase Results MDI Sub-library MDI Structures Sub-library Call Main Generation Filtering Raw Generated List Call Seconded GT to Customize Raw Output the Generated Results Library of Generation Techniques (Routines) Preparatio n files Intelligent Selection Fuzzy Logic Neural NW Heuristic Genetic algorithm Validation MDI Default Specifications Sub-library MDI Default Values Sub-library Normality Test, Correlation Cof. IIUM Engineering Journal, Vol. 2, No. 1, 2001 S. A. Hameed and A. A. Abbasi 50 Fig. 3: Generation of Different Levels of Test Data 4. CONSTRUCTING DIFFERENT LEVELS OF DATA To identify meaningful data there is a need for describing the different levels that represent the meaningful data. These levels could be classified mainly into single list, multiple dimension list, and integrated lists. 4.1 . Single List Level The single list is constructed from a set of records that have the same type and structure. The list will hold the type of its base record; therefore the lists are classified into a simple, similar compound, not-similar compound, composite and relational types. The following is a brief description of different structures of records: Simple record consists of one field of integer, real, character, or Boolean type. Similar-Compound record consists of duplicating the same simple field many times. Not similar-Compound record consists of many simple fields but of different types. Composite record consists of aggregation of many simple, compound or composite fields. Relational record is a compound or composite record with a relationship(s) between some of its fields. The construction of different records is shown in Fig. 3. 4.2 Multiple-Dimension List Level This level is constructed from a collection of a set of the same type and structure lists. The table holds the same type of its base list, therefore the tables are classified into a simple, similar compound, not-similar compound, Character Numeric Other Symbols Simple Record Data Structure & Specification Compound Record Relational Record Relationships between fields Alphabet Sample of Real-Data Set of Semi-Actual Data Record Level Composite Record Simple List Compound List Relational List List Level Composite List Simple Multi-D Table Compound Multi-Dim Table Relational Multi-Dim Table Multi-Dim Table Level Composite Multi-Dim Table Relationships between lists Set of Integrated Lists IIUM Engineering Journal, Vol. 2, No. 1, 2001 S. A. Hameed and A. A. Abbasi 51 composite, and relational. The generation of a multi- dimension table is done by repeating the process of generating a list for specific times according to the required table. 4.3 Integrated or Multiple Related Lists Level Some of the software applications, specially the relational database, contain related lists or tables because there is some relationship(s) between these lists. The generation of this type requires specifying the relationships between these lists in accordance to its usage in the generation process. 5. CONCLUSION The proposed model that overcomes some of the current problems to improve the test data generation. It offers good flexibility of the users, specially experts or professionals, to insert or select the required list of data to be generated, its structure and specifications, its components and their specifications, the generation technique(s), and resource of sample data. The usage of more powerful generation techniques, which are based on statistical methods besides the usage of real or assumed sample of data will participate in generating test data that holds a high ratio of semi-actual and meaningful items. The generated data will reflect the specification or environment of population under test. The intelligent selection of data item’s structure and the generation technique will increase the model efficiency in selection, and increase total performance. The construction of meaningful data library will increase the efficiency of the test data generation process. The validation, normality test, and correlation coefficient will increase the efficiency and reliability on the generated data. The above results lead to increase the user confidence in the generated test data, testing process, testing result, and consequently in application under test. The ability of generating different types and structures of data taking in consideration the relationships between data fields will make this model suitable for generating test data for different applications specially the database applications, which are used in testing control systems. REFERENCES [1] B. T. Mynatt, Software Engineering with Student Project Guidance. USA, Prentice-Hall International Editions, 1990. [2] R. S. Pressman, Software Engineering: A Practitioner's Approach. (4th Ed.), Singapore, McGraw-Hill Book Company, 1997. [3] R. C. Ferguson and B. Korel, “The Chaining Approach for Software Test Data Generation”. ACM Transactions on Software Engineering and Methodology, 5(1), pp.63 – 86, 1996. [4] A. J. Offutt, Zhenyi J. Zhenyi, J. Pan, The Dynamic Domain Reduction Approach to Test Data Generation. Software Practice and Experience, to appear in 1999. [5] A. J. Offutt, “An Integrated Automatic Test Data Generation System. Journal of Systems Integration”, 1(3), pp.391-409, 1991,. [6] S. A. Hameed, A. Deraman and A. Hamdan, A Framework for Database Test Data Generator, Technical Report FTSM / MEI LT- 48, University Kebangsaan Malaysia, 1998. [7] S. A. Hameed, Meaningful Test Data Generation based on Statistical Methods, Ph.D. Thesis, Faculty of Information Science and Technology, University Kebangsaan Malaysia, 2000. [8] L. H. Tsoukalas and R.E. Uhrig, Fuzzy and Neural approaches in Engineering, USA, John Wiley & Sons, Inc, 1997. [9] L. Devore Jay, probability and statistics for engineering and the sciences, US Brooks / Cole Publishing Company, 3rd Edition, 1991. [10] W. Hamming Richard, The Art of Probability for Scientists and Engineers. USA, Addison-Wesley Publishing Company, 1991. [11] A. Agresti and B. Finlay, Statistical Methods for the social Sciences. San Francisco: Dellen Publishing Company, 1986. [12] L. Lapin Lawrence, Statistics Meaning and Methods, Harcourt brace Jovanovich Inc, 1975 [13] M. A. Weiss, Data Structures and Algorithm Analysis in C, 2nd ed., Addison-Wesley Logman, Inc, 1997. [14] C. A. Shaffer, A Practical Introduction to Data Structures and Algorithm Analysis. International Edition, Prentice- Hall International, Inc, 1997. [15] A. A. Aziz, Simulation Systems for Statistical Tests, Ph.D. Thesis, University of Essex, UK, 1987. BIOGRAPHIES Shihab A. Hameed was Asst. Prof. at Electrical & Computer Eng. Department, Faculty of Engineering, IIUM University (currently in University of Bahrain). He obtained his Ph.D. from UKM university (Malaysia) in software engineering / SW testing. He has over twenty year industrial and educational experience in software development and as academician. He has published many research papers both locally and internationally. Abdulmajid A. Al-Abbasi was Assoc. Prof. at Science in Engineering Department, Faculty of Engineering - IIUM University. Currently he is working at University of Bahrain. He obtained his Ph.D. from Essex university (UK) in Simulator Systems, 1988. His research interests are random number generation and analysis. He has several research papers published internationally. IIUM Engineering Journal, Vol. 2, No. 1, 2001 52