Test-Case Generation for SQL Nested Queries with Existential Conditions Electronic Communications of the EASST Volume 55 (2012) Proceedings of the XII Spanish Conference on Programming and Computer Languages (PROLE 2012) Test-Case Generation for SQL Nested Queries with Existential Conditions Rafael Caballero José Luzón-Martı́n Antonio Tenorio 15 pages Guest Editors: Marı́a-del-Mar Gallardo Managing Editors: Tiziana Margaria, Julia Padberg, Gabriele Taentzer ECEASST Home Page: http://www.easst.org/eceasst/ ISSN 1863-2122 http://www.easst.org/eceasst/ ECEASST Test-Case Generation for SQL Nested Queries with Existential Conditions Rafael Caballero1 ∗ José Luzón-Martı́n 2 Antonio Tenorio 3 1 rafa@sip.ucm.es 2 jose.11.10.88@gmail.com 3 senrof21@gmail.com Departamento de Sistemas Informáticos y Computación Universidad Complutense de Madrid, Spain Abstract: This paper presents a test-case generator for SQL queries. Starting with a set of related SQL views that can include existential subqueries in the conditions, the technique finds a database instance that can be used as a test-case for the target view. The proposal reduces the problem of generating the test-cases to a Constraint Satisfaction Problem using finite domain constraints. In particular, we present a new approach for existential conditions that makes possible to find test-cases for a wider set of queries. The soundness and correctness of the technique with respect to a simple operational semantics for SQL queries without aggregates is proven. The theoretical ideas have been implemented in an available prototype. Keywords: SQL, Test-cases, constraints, finite domains 1 Introduction This paper presents a technique for producing test-cases that allows the checking of a set of corre- lated SQL [SQL92] views. Test-cases are particularly useful in the context of databases, because checking if queries which are defined over real-size database instances are valid is a very dif- ficult task. Therefore, the goal is to automatically obtain a small database instance that allows the user to readily check if the defined SQL views work as expected. There are many different possible coverage criteria for test-cases (see [ZHM97, AO08] for a general discussion). In the particular case of SQL [CT04, ST09], it has been shown that good criteria like the Full Predicate Coverage (FPC) can be obtained by transforming the initial SQL query into a set of independent queries, where each one is devoted to checking a particular component of the query and then ob- taining a positive test-case for each query in this set. A positive test-case is a test case following the simple criterium of being non-empty database instances such that the query/relation studied produces a non-empty result. As proposed in [CGS10], negative test-cases can be defined in terms of positive test-cases, thus, negative test cases can also be obtained with a positive test- case generator. A first step towards a solution was presented in [CGS10], where the problem was reduced to a Constraint Satisfaction Problem (CSP in short). Although self-contained, this paper must be seen as an improvement with respect to this previous work. More specifically, the main contribution is that queries including existential subqueries are allowed. These subqueries are introduced in SQL conditions by means of the reserved word exists and play an important role in ∗ Work partially supported by the Spanish projects STAMP (TIN2008-06622-C03-01), Prometidos-CM (S2009TIC- 1465) and GPD (UCM-BSCH-GR35/10-A-910502) 1 / 15 Volume 55 (2012) mailto:rafa@sip.ucm.es mailto:jose.11.10.88@gmail.com mailto:senrof21@gmail.com Test-Case Generation for SQL defining SQL views. In [CGS10] the treatment of these subqueries was very limited (only if the exists occurred as part of a conjunction at the outer level). Moreover, the operational semantics used in that work (the Extended Relation Algebra, ERA in short) does not allow subqueries, a very important feature that could not be included in the theoretical results. We overcome this limitation in the current paper by: 1. Defining a new SQL Operational Semantics (called SOS), presented in Section 2 that allows existential subqueries as part of the conditions in the where clause. We prove the equivalence of SOS and ERA over the set of basic queries without aggregates. 2. Defining a CSP that includes representation of existential subqueries (Section 3), and prov- ing its correctness and completeness. 3. Implementing a new prototype (Section 4) that generates the test-cases using a constraint solver. The work is concluded in Section 5, which summarizes the results and discusses future work. 2 SQL operational semantics A table schema is of the form T (A1,...,An), with T the table name and Ai attribute names for i = 1...n. We will refer to a particular attribute A by using the notation T.A. Each attribute A has an associated type (integer, string, . . . ) represented by type(T.A). In this paper we only consider integer types, although the inclusion of other data types will be considered in future work. An instance of a table schema T (A1,...,An) will be represented as a finite multiset of functions (called rows) {|µ1, µ2,..., µm|} such that dom(µi) ={T.A1,...,T.An}, and µi(T.A j)∈ type(T.A j) for every i = 1,...,m, j = 1,...,n. Observe that we qualify the attribute names in the domain by table names. This is done because in general we will be interested in rows that combine attributes from different tables, usually as result of cartesian products. In the following, it is useful to consider each attribute T.Ai in dom(µ) as a logic variable, and µ as a logic substitution. The concatenation of two rows µ1, µ2 with disjoint domain is defined as the union of both functions represented as µ1 �µ2. Given a row µ and an expression e we use the notation eµ to represent the value obtained applying the substitution µ to e. If dom(µ) = {T.A1,...,T.An} and ν = {U.A1 7→ T.A1,...,U.An 7→ T.An} (i.e., ν is a table renaming) we will use the notation µU to represent the substitution composition ν ◦ µ . The previous concepts for concatenations and substitutions can be extended to multisets of rows in a natural way. For instance, given the multiset of rows S and the row µ , Sµ represents the application of µ to each member of the multiset. Analogously, given two multisets S1, S2, we define S1 �S2 ={ν ◦µ | ν ∈ S1, µ ∈ S2} which constitutes the natural representation of cartesian products in relational databases. Ob- serve that although our representation for rows is different from the representation used usually in relational databases (tuples), it is possible to define an isomorphism l ·m between the two representations. If R is a relation and µ ∈ R is defined as µ = {|T1.A1 7→ a1,...Tn.An 7→ an|}, Proc. PROLE 2012 2 / 15 ECEASST then lµm = (a1,...,an) where the order in the tuple is the lexicographic order of the attribute names. In order to define the inverse transformation we assume that every relation has a well- defined schema with attribute names (T1.A1,...,Tn.An) with the attribute name positions in lexi- cographic order. Obviously, if R is either a table or a view, Ti = R for i = 1...n, but in the case of queries we can assume different relation names as prefixes. Then if t = (a1,...,an) is a tuple in R we define lµm−1 = T1.A1 7→ a1,...Tn.An 7→ an. It can be checked that the transformation is distributive with respect to ∏ and �, this means that lt1 ×t2m−1 = lt1 m−1 �l t2m−1 and lµ1 �µ2m = lµ1 m×l µ2m. Definition 1 SQL view and query syntax. The general form of a view is: create view V(A1, . . . , An) as Q, with Q a query and V.A1,...V.An the name of the view attributes. Queries can be: 1. Basic queries Q = select e1, . . . , en from R1 B1 , . . . , Rm Bm where C; with R j tables or views for j = 1...m, ei, i = 1...n expressions involving constants, pre- defined functions and attributes of the form B j.A, 1 ≤ j ≤ m, and A an attribute of R j. The condition C must have one of the following forms: C ::= false | true | e1 �e2 |C1 and C2 |C1 or C2 |notC1 |exists(Q) with e1,e2 arithmetic expressions, �∈ {=,<>,>,<,>=,<=}, C1, C2 SQL conditions and Q an SQL query. In the rest of the paper the logic values true and false are represented respectively by > and ⊥ in order to be distinguished from the SQL reserved words true and false. 2. Unions of the form Q1 union Q2, with Q1 and Q2 queries. 3. Intersections of the form Q1 intersects Q2, with Q1 and Q2 queries. A database schema D is a tuple (T ,C ,V ), where T is a finite set of tables, C a finite set of database constraints and V a finite set of views. A database instance d of a database schema is a set of table instances, one for each table in T verifying C (thus we only consider valid instances). To represent the instance of a table T in a database instance d we will use the notation d(T ). In this paper primary key and foreign key constraints are considered. They can be formally defined as follows: Definition 2 Let D be a database schema and d a database instance. Let T be a table in D such that d(T ) ={|µ1,..., µn|}, then: 1. If T has a primary key constraint defined over the columns C1,...,Cp, then we say that d is a valid instance with respect to this constraint if for every i = 1...n the logic formula n∧ j=1, j 6=i ( p∨ k=1 µi(T.Ck) 6= µ j(T.Ck)) holds (that is, it is evaluated to >). 3 / 15 Volume 55 (2012) Test-Case Generation for SQL 2. If T has a foreign key constraint for the columns C1,...,C f referring the columns C′1,...C ′ f from table T2. Then suppose that d(T2) = {|ν1,...,νn′|}. Then d is a valid instance with respect to this foreign key if for every µi, i = 1...n the following logic formula holds: n′∨ j=1, j 6=i ( f∧ k=1 µi(T.Ck) = ν j(T2.C ′ k)) That is, in the case of the primary key we require that every pair of rows must differ in at least one attribute of the primary key. In the case of the foreign key, every row µi in T requires the existence of another tuple ν j in T2 such that they have the same values over the attributes defining the primary case in each case. A symbolic database instance ds is a database instance whose rows can contain logic variables. We say that ds is satisfied by a substitution µ when (ds µ) is a database instance. µ must substitute all the logic variables in ds by domain values. Next we present a small example that is used in the rest of the paper. Example 1 A board game for multiple players. Players are included into the following table create table player { id int primary key }; The board can contain pieces from different players, but only one piece at each position. The board state is represented by the following table: create table board { int x, int y, int id, primary key x,y; foreign key(id) references player(id); }; where x, y are the position in the board and id the player identifier. We are interested in detecting the ids of the players that are still playing (that is, they have at least one piece on the board), and all their pieces are threatened: for every piece of the player there is a piece of a different player which is either in the same x or in the same y. The views defining such pieces are the following: create view nowPlaying(id) as select p.id from player p where exists (select b.id from board b where b.id=p.id); create view checked(id) as select p.id from player p where exists (select n.id from nowPlaying n where n.id = p.id) and not exists (select b1.id from board b1 Proc. PROLE 2012 4 / 15 ECEASST where b1.id = p.id and not exists (select b2.id from board b2 where (b2.x - b1.x) * (b2.y-b1.y)=0 and (b1.id <> b2.id))); The syntactic dependency tree for a view, table or query R, with R in the root of this tree, is defined as: • If R is a table it has no children. • If R is a view the only child is the syntactic dependency tree of its associated query. • If R is a query the children are the syntactic dependency trees of the relations occurring in the from section, plus one child for each subquery occurring in the where section. Now we are ready to define our SQL operational semantics (SOS for short). Definition 3 Let D a relational database schema and d be an instance of D . Then the SQL operational semantics, SOS(d) is defined as follows: 1. For any table T defined in D , 〈T,d〉= d(T ). 2. For any simple query Q select e1, . . . , en from R1 B1 , . . . , Rm Bm where C; define: 〈Q,d〉={sQ(µ) | ν1 ∈〈R1,d〉,...,νm ∈〈Rm,d〉, µ = ν1B1 �···�νmBm,〈Cµ,d〉} where sQ(µ) = {Q.A1 7→ (e1 µ),...,Q.An 7→ (en µ)}, and 〈C,d〉 is defined as follows • If C ≡ false then 〈C,d〉=⊥ • If C ≡ true then 〈C,d〉=> • If C ≡ e, with e an arithmetic expression involving constants then 〈C,d〉= e • If C ≡ e1 �e2, with � a relational operator, then 〈C,d〉= (〈e1,d〉�〈e2,d〉) . • If C ≡C1 and C2 then 〈C,d〉= 〈C1,d〉∧〈C2,d〉 • If C ≡C1 or C2 then 〈C,d〉= 〈C1,d〉∨〈C2,d〉 • If C ≡not C1 then 〈C,d〉=¬〈C,d〉 • If C ≡exists Q then 〈C,d〉= 〈Q,d〉 6= /0 3. If Q is a view with attributes A1,...,An defined by a query of the form Q1 union Q2, then 〈Q,d〉= 〈Q1,d〉∪〈Q2,d〉 with ∪ the union multiset operator. 4. Analogously, if Q is of the form Q1 intersection Q2 then, 〈R,d〉= 〈Q1,d〉∩〈Q2,d〉), with ∩ the intersection multiset operator. 5. For any view V with definition create view V(E1, . . . , En) as Q, 〈V,d〉 = 〈Q,d〉{V.E1 7→ Q.A1,...,V.En 7→ Q.An}. 5 / 15 Volume 55 (2012) Test-Case Generation for SQL The following observations can be useful for understanding SOS. Queries are treated as anony- mous views, and the name Q is always given as table name to the rows in the result. Analogously Ai is always the name given to the i-th expression in the select clause. This is useful for instance for ensuring that unions and intersections provide homogeneous rows, and also for the renaming applied by views in the last item. The expression (e1 �e2)µ in the second case represents the logic value > or ⊥ obtained after replacing each attribute by its value in µ and evaluating the relational expression in usual logic. In well-defined SQL queries all the attributes in the expres- sions correspond exactly to one attribute of a relation defined in the query. Well-defined exis- tential subqueries must verify that after replacing the external attributes by values they become well-defined queries. This property is exploited in the definition used in the existential condition C(µ) = (〈Qµ,d〉 6= /0), which must be read as: first replace in Q all the external attributes by their values, then obtain their semantics, and finally check whether the result is different from the empty multiset. Using this definition we can define a positive test-case (PTC) for a view V as a non-empty database instance d such that 〈V〉 6= /0. In previous papers we have used the Extended Relational Algebra (ERA from now on) [GUW08] as suitable operational semantics for SQL. However ERA does not allow existential subqueries, a key point of this work that is solved in SOS. Defining ERA and the transformation of SQL into ERA is beyond the scope of this paper, and we only mention some of the basic operations: • Unions and intersections. The union of R and S, is a multiset R∪S in which the row µ occurs n + m times. The intersection of R and S, R∩S, is a multiset in which the row µ occurs min(n,m) times. • Projection. The expression πe1 7→A1,...,en 7→An(R) produces a new relation producing for each row µ ∈ R a new row {A1 7→ e1 l µm,...,An 7→ en l µm}. That is, we substitute in each ei the attribute names by their values in µ and then evaluate ei. The resulting multiset has the same number of rows as R. • Selection. Denoted by σC(R), where C is the condition that must be satisfied for all rows in the result. The selection operator on multisets applies the selection condition to each row occurring in the multiset independently. • Cartesian products. Denoted as R×S, each row in the first relation is paired with each row in the second relation. • Renaming. The expression ρS(R) changes the name of the relation R to S. A sound requirement is that both ERA and SOS must define the same semantics over the com- mon SQL fragment. This is important because ERA is assumed as the standard SQL semantics. Theorem 1 Let D be a database schema and d a database instance. Let R be a relation in D or a query defined over relations in D. Assume that the relations in the syntactic dependency tree of R do include neither aggregates not existential subqueries. Then η ∈〈R,d〉 with cardinality k iff lηm∈ ERA(R,d) with cardinality k. Proc. PROLE 2012 6 / 15 ECEASST Proof. Using induction on the depth of the syntactic dependency tree for R, and distinguishing cases depending on the form of R: - R is a query of the form select e1, . . . , en from R1 B1 , . . . , Rm Bm where C;, and C does not contain subqueries. Then according to Definition 3 〈Q,d〉={sQ(µ) | ν1 ∈〈R1,d〉,...,νm ∈〈Rm,d〉, µ = ν1B1 �···�νmBm,〈Cµ,d〉} (1) And it is easy to check that in the case of conditions without subqueries 〈Cµ,d〉= C′µ where C′ is obtained from C by replacing true by >, false by ⊥, and by ∧, or by ∨ and not by ¬. Analogously in ERA the query can expressed as: ERA(Q,d) = ∏ e1 7→Q.A1,...,en 7→Q.A1 σC′(ρB1(ERA(R1,d))×···×ρBm(ERA(Rm,d))) (2) with C′ defined as above. Now assume that η ∈〈Q,d〉 with cardinality k. This means that there exists µ1,..., µk such that η = sQ(µi) for i = 1...k. According to 1 this implies that µi = ν i 1 B1 �···�ν im Bm , with ν ij ∈〈R j,d〉 for j = 1...m (3) Then applying the induction hypothesis to 3 we have that lν ijm∈ ERA(R j,d) with the same car- dinality for i = 1...n, j = 1...m, which means that ρ ν ij B j ∈ ρB j (ERA(R j,d)) for i = 1...n, j = 1...m, and therefore ρ µi ∈ (ρB1(ERA(R1,d))×···×ρBm(ERA(Rm,d))). From 1 we have that 〈Cµ,d〉 = C′µi holds, which means that ρ µi ∈ σC′(ρB1(ERA(R1,d))×···×ρBm(ERA(Rm,d))). Finally, taking into account that η = sQ(µi) for i = 1...k, and considering that lηm = lsQ(µi)m = {Q.A1 7→ (e1 l µim),...,Q.An 7→ (en l µim)} = ∏e1 7→Q.A1,...,en 7→Q.A1(lµim) we have from 2 that lηm∈ ERA(Q,d). 3 Generating Constraints In our tool, initially the user chooses a view V and the number of rows of the initial symbolic database instance (that is the number of rows of the tables involved in the computation of V ). Each attribute value in each row corresponds to a fresh logic variable with its associated domain integrity constraints. The tool tries to obtain a positive test-case by binding the logic variables in the symbolic database. Notice that the number of rows directly affects the result since for some queries does not exists positive test-cases with an specific size. In the future we plan to decide automatically the more convenient size by inspecting the relations definition. Observe also that currently only integer domains are supported. Extending the proposal to other domains such as strings does not seem difficult to achieve, but is beyond the scope of this work. For instance, in Example 1, suppose that the user decides to look for a positive test-case with two rows in each table. Then, initially the prototype builds the following symbolic instance: player id player.id.0 player.id.1 board id x y board.id.0 board.x.0 board.y.0 board.id.1 board.x.1 board.y.1 7 / 15 Volume 55 (2012) Test-Case Generation for SQL Notice that player.id.0, player.id.1, board.id.0, board.id.1, board.x.0, board.x.1, board.y.0, board.y.1 represent logic variables. The next definition is employed by the tool to establish the constraints on these variables. Definition 4 Let D be a database schema and d a database instance. We define θ (R,d) for every relation R in D as a multiset of pairs (ψ,u) with ψ a first order formula, and u a row. This multiset is defined as follows: 1. For every table T in D such that d(T ) ={|µ1,..., µn|}: where µi = (T.C1 7→Xi1,...,T.Cm 7→ Xim) then: • If the definition of T has neither primary key nor foreign key constraints: θ (T,d) = {|(true, µ1),...,(true, µn)|}. • If the definition of T contains primary or foreign key constraints: – if T has a primary key constraint for the columns C1,...,Cp: Let T ′ be the table T without that primary key constraint. Assume that θ (T ′,d) = {|(ψ1, µ1), ..., (ψn, µn)|}, then: θ (T,d) ={|((ψi ∧( n∧ j=1, j 6=i ( p∨ k=1 µi(T.Ck) 6= µ j(T.Ck)))), µi)|i ∈ 1,...,n|} – if T has a foreign key constraint for the columns T.C1,...,T.C f referring the columns T 2.C′1,...T 2.C ′ f from table T 2: Let T ′ be the T without that foreign key constraint. Assume that θ (T ′,d) = {|(ψ1, µ1),...,(ψn, µn)|}, and that d(T2) = {|ν1,...,νn′|}. Then: θ (T,d) ={|((ψi ∧( n′∨ j=1, j 6=i ( f∧ k=1 µi(T.Ck) = ν j(T 2.C ′ k)))), µi)|i ∈ 1,...,n|} 2. For every view V = create view V(E1, . . . , En) as Q, θ (V,d) = θ (Q,d){V.E1 7→ Q.A1,...,V.En 7→ An} 3. If Q is a basic query of the form: select e1,...,en from R1 B1,...,Rm Bm where Cw; Then: θ (Q,d) = {|(ψ1 ∧···∧ψm ∧ϕ(Cw µ,d), sQ(µ)) | (ψ1,ν1)∈ θ (R1,d),...,(ψm,νm)∈ θ (Rm,d), µ = ν1B1 �···�νmBm|} where • sQ(µ) = {Q.A1 7→ (e1 µ),...,Q.An 7→ (en µ)}, • The first order formula ϕ(C,d) is defined as Proc. PROLE 2012 8 / 15 ECEASST – If C ≡ false then ϕ(C,d) =⊥ – If C ≡ true then ϕ(C,d) => – If C ≡ e, with e an arithmetic expression involving constants then ϕ(C,d) = e – If C ≡ e1 �e2, with � a relational operator, then ϕ(C,d) = (ϕ(e1,d)�ϕ(e2,d)) . – If C ≡C1 and C2 then ϕ(C,d) = ϕ(C1,d)∧ϕ(C2,d) – If C ≡C1 or C2 then ϕ(C,d) = ϕ(C1,d)∨ϕ(C2,d) – If C ≡not C1 then ϕ(C,d) =¬ϕ(C,d) – If C ≡ exists Q then suppose that θ (Q,d) = {|(ψ1, µ1),...(ψp, µp)|}. Then ϕ(C,d) = (∨pj=1ψ j). 4. For set queries: • θ (V1 union V2,d) = θ (V1,d) ∪ θ (V2,d) with ∪ the multiset union. • (ψ, µ)∈ θ (V1 intersection V2,d) with cardinality k iff (ψ1, µ)∈ θ (V1,d) with cardi- nality k1, (ψ2, µ)∈ θ (V2,d) with cardinality k2, k = min(k1,k2) and ψ = ψ1 ∧ψ2. Observe that the notation sQ(x) with Q a query is a shorthand for the row µ with domain {E1,...,En} such that (Ei)x = (ei)x, with i = 1...n, with select e1 E1, . . . , en En the select clause of Q. If Ei’s are omitted in the query, it is assumed that Ei = ei. The following result and its corollary represent the main result of this paper, stating the sound- ness and completeness of our proposal: Theorem 2 Let D be a database schema and d a valid database instance. Let R be either a relation in D or a query defined using relations in D. Then η ∈〈R,d〉 with cardinality k iff (true,η)∈ θ (R,d) with cardinality k. Proof. The result is proven by using complete induction on the depth of the syntactic dependence tree defining R. - If R is a table. Suppose that d(R) = {|µ1,..., µn|}. We distinguish cases depending on the database constraints in the definition of R. we prove the result using induction on the number of constraints m in the definition of table R. - If m = 0, that is, R is defined with neither primary key nor foreign key constraints. In SOS by Definition 3, 〈T,d〉= d(T ) ={|µ1,..., µn|}. From Definition 4, θ (T,d) ={|(true, µ1),...,(true, µn)|}. Then every element η occurs in 〈R,d〉 with the same cardinality a (true,η) in θ (R,d). - If m > 0 R is defined using at least one primary or foreign key constraints. Select any constraint. Suppose that the selected constraint is a primary key defined by columns C1,...,Cp. Then if µi ∈〈R,d〉, it appears only once due to the primary key constraints. Let us see that µi ∈〈R,d〉 iff (true, µi)∈ θ (R,d) (with cardinality 1). From Definition 2.1, µi ∈〈R,d〉, with d a valid instance implies that n∧ j=1, j 6=i ( p∨ k=1 µi(R.Ck) 6= µ j(R.Ck)) (4) Defined T ′ as R without this constraint, as explained in Definition 4.1. Then d is also a valid instance for T ′, and thus d(T ′) = d(R) which means 〈T ′,d〉 = 〈(R,d)〉. Assume that θ (T ′,d) = 9 / 15 Volume 55 (2012) Test-Case Generation for SQL {|(ψ1, µ1), ..., (ψn, µn)|}, and from Definition 4.1 θ (T,d) ={|(ψi ∧( n∧ j=1, j 6=i ( p∨ k=1 µi(R.Ck) 6= µ j(R.Ck)))), µi)|i ∈ 1,...,n|} (5) µi ∈ 〈R,d〉 iff (from 〈T ′,d〉 = 〈(R,d)〉) µi ∈ 〈T ′,d〉 iff (true, µi) ∈ θ (T ′,d) ( by induction hypothesis) iff ψi = true. Moreover, µi ∈ 〈R,d〉 iff 4 is > (because d is valid) iff (true,µi) ∈ θ (R,d) (replacing ψi and 4 by > in 5). The result is analogous if the selected constraint is a foreign key. - If R is a view then θ (V,d) = θ (Q,d){V.E1 7→ Q.A1,...,V.En 7→ Q.An} with E1,...,En the attribute names defined by the view (Def. 4), and 〈V,d〉 = 〈Q,d〉{V.E1 7→ Q.A1,...,V.En 7→ Q.An} (Def. 3) and the result is straightforward from the induction hypothesis applied to Q. - If R is a simple query of the form: select e1, . . . , en from R1 B1 , . . . , Rm Bm where C; According to Definition 3 〈Q,d〉={sQ(µ) | ν1 ∈〈R1,d〉,...,νm ∈〈Rm,d〉, µ = ν1B1 �···�νmBm,〈Cµ,d〉} (6) and by Definition 4 θ (Q,d) = {|(ψ1 ∧···∧ψm ∧ϕ(Cµ,d), sQ(µ)) | (ψ1,ν1)∈ θ (R1,d),...,(ψm,νm)∈ θ (Rm,d), µ = ν1B1 �···�νmBm|} (7) The first step is to check that ϕ(Cµ,d) iff 〈Cµ,d〉 (8) The result is straightforward from the respective definitions (3 and 4) except in the case of existential subqueries. In this case Definition 3 indicates that an existential subquery Q′ is transformed into 〈Q′,d〉 6= /0, which is true iff there is some η ∈〈Q′,d〉. By the inductive hy- pothesis this means that (true,η) ∈ θ (Q′,d). Thus, in the definition 4 the multiset θ (Q′,d) = {|(ψ1, µ1),...(ψp, µp)|} contains at least one tuple (ψi, µi) with ψi = true for some 1 ≤ i ≤ p, which implies that ϕ(Cµ,d) = (∨pj=1ψ j) is true, proving 8. Then (true,η)∈θ (Q,d) with cardinality k iff there are exactly k values such that for i = 1...k µi = ν i 1 B1 �···�ν im Bm , for some (ψ i1,ν i 1)∈ θ (R1,d),...,(ψ i m,ν i m)∈ θ (Rm,d) (9) η = sQ(µi) (10) ψ i 1 = true,...,ψ i m = true (11) ϕ(Cµ,d) = true and thus by Def. 3 〈Cµ,d〉= true (12) From 11 and considering 9, applying the induction hypothesis: ν i 1 ∈〈R1,d〉,...,ν i m ∈〈Rm,d〉,i = 1...k (13) Proc. PROLE 2012 10 / 15 ECEASST and thus sQ(µi)∈〈Q,d〉 in 6 for i = 1...m. That is (considering 10), η ∈〈Q,d〉 with multiplicity k. - If R is a query of the form: Q1 union Q2 then 〈Q,d〉=〈Q1,d〉∪〈Q2,d〉 (Def. 3), θ (V1 union V2,d) = θ (V1,d) ∪ θ (V2,d) (Def. 4), and the result is an easy consequence of the induction hypothesis. - If R is a query of the form: Q1 intersection Q2 the result is similar to the previous case and it is omitted for the sake of space. Observe that in [CGS10] the proof was restricted to queries without subqueries due to the limitations of the Extended Relational Algebra [GUW08] used as operational semantics. The following corollary contains the idea for generating constraints that will yield the PTCs: Corollary 1 Let D be a database schema and ds a symbolic database instance. Let R be a relation in D such that θ (R) ={|(ψ1, µ1),...(ψn, µn)|}, and η a substitution satisfying ds. Then dsη is a PTC for R iff ( ∨n i=1 ψi)η = true. Proof. Straightforward from Theorem 2: ( ∨n i=1 ψi)η = true iff there is some ψi with 1 ≤ i ≤ n such that ψiη = true iff (µiη)∈〈R〉 iff 〈R〉 6= /0. 4 Prototype In this section, we comment on some aspects of our implementation and show a system ses- sion for our running example. The SQL Test Case Generator (STCG) is implemented in the programming language C++. The input of STCG consists of a table or view name R, and a SQL file containing the definition of R and of all the relations in its syntactic dependence tree. The goal is to obtain a positive test- case for R. STCG can be seen as a pipeline, which helps to maintain and extend the implementation. In a first phase the prototype generates a representation in C++ of the CSP. This phase consists of two steps: 1. SQL parser: STCG includes a SQL parser to pre-process the input file. The result is a C++ representation of the database schema. This parser has been produced using GNU Bison [GNU], a general-purpose parser generator, and Flex [CF04], a tool for generating scanners. 2. Formula Generator: The core of STCG. Takes as input the output of the previous step, and following Definition 4 described in section 3, generates the first order logic formula representing the constraints associated to the table or view specified. Figure 1 shows a diagram of this first phase. Next, we want to satisfy this formula by binding the logic variables, thus obtaining a positive test case if possible. In the second phase (see Figure 2) the logic formula is translated into a format accepted by some external constraint solver. We have used G12/CPX as solver, which is distributed with G12 MiniZinc distribution, however, any solver that implements FlatZinc can be used for this purpose. MiniZinc [NSB+07] is a medium-level constraint modeling language. We translate our 11 / 15 Volume 55 (2012) Test-Case Generation for SQL Figure 1: Pipeline of STCG. Figure 2: Pipeline of STCG extended with an input builder for a solver. logic formula into a MiniZinc model, so it can be solved, returning, if possible, the positive test- cases. Other constraint programing languages can be added by implementing new translators. Below we present a system session for the example at hand. Consider that the input SQL file corresponds to Example 1. Following the pipeline, the SQL parser takes this file as input and generates an internal C++ representation of the database schema (it can be obtained at the command line using the ”- v” option of STCG). Suppose that the user intends to generate a test case for view checked indicating that tables player and board must have two rows at most. Thus, the Formula Generator module takes the database schema as input and generates the following formula (where symbolic variables are of the form table name.column name.row): #(checked)= {(((((board.id.0 == player.id.0) OR (board.id.1 == player.id.0)) AND ... (not (((((board.x.0 - board.x.1) * (board.y.0 - board.y.1)) == 0) AND (board.id.1 != board.id.0)) OR ...))))), {checked.id -> player.id.0}), ...} This corresponds to θ (checked,board) without considering primary and foreign keys (dis- played below). For the sake of space we display only part of the first formula and the first row {checked.id → player.id.0}. This row indicates that all the pieces of player.id.0 are checked if the formula holds. Firstly, the formula indicates that player.id.0 must have a piece on either the position corresponding to board.id.0 or to board.id.1. The other part of the formula dis- played indicates that there is a check between positions contained in board.x.0,board.y.0, and board.x.1,board.y.1 for two different players. Proc. PROLE 2012 12 / 15 ECEASST Primary key constraint of table player is shown below. For simplicity, only logic formula for the first row {player.id → player.id.0} is displayed. That formula indicates that this table cannot have two players with the same id attribute, that is player.id.0 != player.id.1. #(player)={(player.id.0 != player.id.1), {player.id -> player.id.0}), ...} Next, we display primary key and foreign key constraints of board for the first row {board.x → board.x.0, board.y → board.y.0, board.id → board.id.0}. That formula indicates that id attribute of each row of this table must reference one of id attributes of the table player (foreign key), that is board.id.0 == player.id.0 ∧ board.id.1 == player.id.0, and two pieces cannot be in the same game board position, that is (board.x.0 != board.x.1) ∨ (board.y.0 != board.y.1) (primary key). #(board)= {(((board.id.0 == player.id.0) AND (board.id.1 == player.id.0)) AND ((board.x.0 != board.x.1) OR (board.y.0 != board.y.1)), {board.x-> board.x.0,board.y-> board.y.0,board.id-> board.id.0}),...} Actually, STCG generates only one formula, with primary key and foreign key constraints included, but we show them separated for simplicity. The next step of the pipeline is to solve this constraints if possible. As mentioned above, we use the MiniZinc constraint modeling language for this purpose. We introduce a new step to the pipeline, that will translate our Formula to a MiniZinc model. This model consists of four sections: variable declarations, constraint specification, solver invocation and output formatting. The model is saved in a file that can be later executed by MiniZinc, returning the positive test- case if possible. To obtain meaningful results we restrict the domain of symbolic variables to 1..5 in this session. In the future, our tool will support domain constraints of symbolic variables. The result returned by MiniZinc is: INSERT INTO board(id,x,y) VALUES (1,1,1); INSERT INTO board(id,x,y) VALUES (2,1,5); INSERT INTO player(id) VALUES(1); INSERT INTO player(id) VALUES(2); This result is a positive test-case, written as a set of SQL insert statements that populates the tables of the database with the following data: player id 1 2 board id x y 1 1 1 2 1 5 Thus, the board has player 1 in position (1,1) and player 2 in position (1,5) and both players are checked as defined in the example 1. We can check that this instance is a positive test-case by executing the SQL query select * from checked, which returns the following non empty result: id 1 2 13 / 15 Volume 55 (2012) Test-Case Generation for SQL 5 Conclusions and Future Work We have presented a technique for generating positive test-cases for sets of correlated SQL views. Our setting considers that a database instance constitutes a test-case for a view if the view com- putes at least one tuple with respect to the instance. Our proposal is to reduce the problem of obtaining a test-case to a Constraint Satisfaction Problem over finite domains. The generated constraints take into account primary and foreign key database constraints if they are defined in an SQL table definition statement. The work improves the proposal of [CGS10] by introducing a complete treatment of existential subqueries. In order to prove the correctness of the technique, a new operational semantics for SQL views without aggregates is introduced. The theoretical ideas have been implemented in a working prototype called STCG, available at: https://gpd.sip.ucm.es/trac/gpd/wiki/GpdSystems/STCG The prototype takes a set of views and the name of a particular view V , and generates a positive test-case for V . It uses MiniZinc [NSB+07] as constraint modeling language. The output is given in the form of a list of SQL insert commands that create the database instance. Although the framework presented is useful for many types of queries, including those defined by set operations and those queries including existential subqueries, it is only a first step since there are many useful SQL features still not supported by our tool. As immediate future work, we plan the integration of aggregate subqueries. The theoretical problem was already addressed in [CGS10], and the implementation seems straightforward. It would also be interesting to extend the SQL operational semantics to aggregates in order to prove the soundness of the proposal. Another limitation that must be solved is the inclusion of null values, which is currently not supported. However, the main limitation of the tool, and thus the main goal for our future work, is to extend the data types allowed for SQL columns. Currently the type of table attributes is limited to integer, and at least varchar (that is, string types) are required in order to obtain a really applicable tool. From the theoretical point of view it would be interesting to define the introduction of exis- tential nested subqueries directly into ERA, thus replacing SOS by an enriched version of ERA. Another theoretical improvement would be to determine formally the minimum number of rows necessary for ensuring that a positive test-case exists. Currently, it is the user who indicates the number of rows for each table, but we think that the size could be determined by automatically examining the SQL code. Proc. PROLE 2012 14 / 15 https://gpd.sip.ucm.es/trac/gpd/wiki/GpdSystems/STCG ECEASST Bibliography [AO08] P. Ammann, J. Offutt. Introduction to Software Testing. Cambridge University Press, 2008. [CF04] J. Coelho, M. Florido. CLP(Flex): Constraint Logic Programming Applied to XML Processing. In Proceedings of the CoopIS/DOA/ODBASE. Pp. 1098–1112. Springer LNCS 3291, Heidelberg, Germany, 2004. [CGS10] R. Caballero, Y. Garcı́a-Ruiz, F. Sáenz-Pérez. Applying Constraint Logic Program- ming to SQL Test Case Generation. In Blume et al. (eds.), Functional and Logic Pro- gramming. Lecture Notes in Computer Science 6009, pp. 191–206. Springer Berlin / Heidelberg, 2010. [CT04] M. J. S. Cabal, J. Tuya. Using an SQL coverage measurement for testing database applications. In Taylor and Dwyer (eds.), SIGSOFT FSE. Pp. 253–262. ACM, 2004. [GNU] GNU. Bison - GNU parser generator. [GUW08] H. Garcia-Molina, J. D. Ullman, J. Widom. Database Systems: The Complete Book. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2008. [NSB+07] N. Nethercote, P. J. Stuckey, R. Becket, S. Brand, G. J. Duck, G. Tack. MiniZinc: Towards a Standard CP Modelling Language. In Bessiere (ed.). Lecture Notes in Computer Science 4741, pp. 529–543. Springer, 2007. [SQL92] SQL, ISO/IEC 9075:1992, third edition. 1992. [ST09] M. Surez-Cabal, J. Tuya. Structural Coverage Criteria for Testing SQL Queries. Jour- nal of Universal Computer Science 15(3):584–619, 2009. [ZHM97] H. Zhu, P. A. V. Hall, J. H. R. May. Software unit test coverage and adequacy. ACM Computing Surveys 29:366–427, 1997. 15 / 15 Volume 55 (2012) Introduction SQL operational semantics Generating Constraints Prototype Conclusions and Future Work