Int. J. of Computers, Communications & Control, ISSN 1841-9836, E-ISSN 1841-9844 Vol. V (2010), No. 4, pp. 567-577 A New Rymon Tree Based Procedure for Mining Statistically Significant Frequent Itemsets P. Stanišić, S. Tomović Predrag Stanisic, Savo Tomovic University of Montenegro Department of Mathematics and Computer Science Dzordza Vasingtona bb, Podgorica, Montenegro E-mail: pedjas@ac.me, savica@t-com.me Abstract: In this paper we suggest a new method for frequent itemsets mining, which is more efficient than well known Apriori algorithm. The method is based on special structure called Rymon tree. For its implementation, we suggest modified sort-merge-join algorithm. Finally, we explain how support measure, which is used in Apriori algorithm, gives statistically significant frequent itemsets. Keywords: frequent itemset mining, association analysis, Apriori algorithm, Rymon tree 1 Introduction Finding frequent itemsets in databases is fundamental operation behind association rule mining. The problem of mining association rules over transactional databases was introduced in [1]. An example of such rule might be that "85% of customers who bought milk also bought bread". Discovering all such rules is important for planning marketing campaigns, designing catalogues, managing prices and stocks, customer relationships management etc. The supermarket is interested in identifying associations between item sets; for example, it may be interested to know how many of customers who bought milk also bought bread. This knowledge is important because if it turns out that many of the customers who bought milk also bought bread, the supermarket will place bread physically close to milk in order to stimulate the sales of bread. Of course, such a piece of knowledge is especially interesting when there is a substantial number of customers who buy two items together and when large fraction of those individuals who buy milk also buy bread. For example, the association rule milk ⇒ bread [support=20%, confidence=85%] represents facts: • 20% of all transactions under analysis contain milk and bread; • 85% of the customers who purchased milk also purchased bread. The result of association analysis is strong association rules, which are rules satisfying a minimal support and minimal confidence threshold. The minimal support and the minimal confidence are input parameters for association analysis. The problem of association rules mining can be decomposed into two sub-problems [1]: • Discovering frequent itemsets. Frequent itemsets have support higher than minimal support; • Generating rules. The aim of this step is to derive rules with high confidence (strong rules) from frequent itemsets. For each frequent itemset l all nonempty subsets of l are found; for each a ⊂ l ∧ a ̸= ∅ the rule a ⇒ l − a is generated, if support(l)support(a) > minimal confidence. Overall performances of mining association rules are determined by the first step; we do not cosider the second step in this paper. Efficient algorithms for solving the second sub-problem are presented in [12]. The paper is organized as follows. Section 2 provides formalization of frequent itemsets mining Copyright c⃝ 2006-2010 by CCC Publications 568 P. Stanišić, S. Tomović problem. Section 3 describes Apriori multiple_num algorithm which is a modification of well known Apriori algorithm [1]. Section 4 presents a new candidate generation procedure which is part of Apriori multiple_num. In section 5 we use hypothesis testing to validate generated frequent itemsets. 2 Preliminaries Suppose that I is a finite set; we refer to the elements of I as items. We primarily use notions from [10]. Definition 1. A transaction dataset on I is a function T: {1, ...,n} → P(I), where P(I) is set of all subsets of I. The set T(k) is the kth transaction of T. The numbers 1,...,n are the transaction identifiers (TIDs). [10] Given a transaction data set T on the set I, we would like to determine those subsets of I that occur often enough as values of T. [10] Definition 2. Let T: 1, ..., n→P(I) be a transaction data set on set of items I, where P(I) is set of all subsets of I. The support count of subset K of set of items I in T is the number suppcountT (K) given by: suppcountT (K) = |{k|1 ≤ k ≤ n ∧ K ⊆ T (k)}|. (1) The support of an item set K (in the following text instead of "item set K" we will use "itemset K") is the number: supportT (K) = suppcountT (K)/n. (2) [10] The following rather straightforward statement is fundamental for the study of frequent itemsets. It is known as Apriori principle [1]. Proof is presented in order to introduce anti-monotone property. Theorem 3. Let T: 1, ..., n→ P(I) be a transaction data set on a set of items I, where P(I) is set of all subsets of I. If K and K’ are two itemsets, then K ′ ⊆ K implies supportT (K ′) ≥ supportT (K). [10] Proof: The previous theorem states that supportT for an itemset has the anti-monotone property. It means that support for an itemset never exceeds the support for its subsets. For proof, it is sufficient to note that every transaction that contains K also contains K’. The statement from the theorem follows immediately. 2 Definition 4. An itemset K is µ -frequent relative to the transaction data set T if supportT (K) ≥ µ . We denote by F µT the collection of all µ -frequent itemsets relative to the transaction data set T and by F µ T,r the collection of µ -frequent itemsets that contain r items for r ≥ 1 (in the following text we will use r-itemset to denote itemset that contains r items). [10] Note that F µT = ∪ r≥1 F µ T,r. If it is clear what µ and T are, we can omit them. In this paper we will propose new algorithm for frequent itemsets mining which is based on special structure: Rymon tree. The Rymon tree was introduced in [8] in order to provide a unified search-based framework for several problems in artificial intelligence; the Rymon tree is also useful for data mining algorithms. In Definition 5 and 6 we define necessary concepts and in Definition 7 we define the Rymon tree. Definition 5. Let S be a set and let d : S → N be an injective function. The number d(x) is the index of x ∈ S. If P ⊆ S, view of P is subset view(d,P) = {s ∈ S|d(s) > maxp∈Pd(p)}. A New Rymon Tree Based Procedure for Mining Statistically Significant Frequent Itemsets 569 Definition 6. A collection of sets C is hereditary if U ∈ C and W ⊆ U implies W ∈ C. Definition 7. Let C be a hereditary collection of subsets of a set S. The graph G = (C,E) is a Rymon tree for C and the indexing function d if: • the root of the G is /0 • the children of a node P are the sets of the form P ∪{s}, where s ∈ view(d,P) If S = {s1,...,sn} and d(si) = i for 1 ≤ i ≤ n, we will omit the indexing function from the definition of the Rymon tree for P(S). Let S = {i1,i2,i3,i4} and let C be P(S), which is clearly a hereditary collection of sets. Finally, let d be injective mapping: d(ik) = k for 1 ≤ k ≤ 4. The Rymon tree for C and d is shown in Fig. 1. Figure 1: Example of Rymon tree A key property of a Rymon tree is stated next. Theorem 8. Let G be a Rymon tree for a hereditary collection C of subsets of a set S and an indexing function d. Every set P of C occurs exactly once in the tree. Note that in the Rymon tree of a collection P(S), the collection Sr, that consists of sets located at distance r from the root, denotes all subsets of the size r of S. 3 Apriori multiple_num Algorithm Apriori multiple_num algorithm generates frequent itemsets starting with frequent 1-itemsets (item- sets consisted of just one item). Next, the algorithm iteratively generates frequent itemsets to the maximal length of frequent itemset. Each iteration of the algorithm consists of two phases: candidate generation and support counting. In candidate generation phase potentially frequent itemsets or candidate itemsets are generated. The Apriori principle [1] is used in this phase. It is based on anti-monotone property of the itemset support (see Theorem 3) and provides elimination or pruning of some candidate itemsets without calculating its support. According to the Apriori principle, if X is frequent itemset, then all its subsets are also frequent. This fact is used in candidate generation phase in a way that the candidate containing at least one not frequent subset is being pruned immediately (before support counting phase). Support counting phase consists of calculating support for all previously generated candidates (which are not pruned according to the Apriori principle in the candidate generation phase). Calculating can- didate support requires one database scan and efficient determination if the candidates are contained in 570 P. Stanišić, S. Tomović particular transaction t ∈ T. For candidates contained in t ∈ T, it’s support will be incremented. On ac- count of that, the candidates are organized in hash tree. The candidates which have enough support are termed as frequent itemsets. The main difference between iterations in original Apriori algorithm [1] and Apriori multiple_num algorithm is that iterations in later one are "longer", which is determined by multiple_num param- eter. Actually, in original Apriori algorithm in the iteration k set Fk(containing all frequent item- sets with k items) is generated, while Apriori multiple_num algorithm in the iteration k generates sets Fk+i,0 ≤ i ≤ multiple_num. If kMax < multiple_num is true, where kMax is the maximal length of frequent itemset, Apriori multiple_num algorithm terminates in just two iterations, or just two database scans. In addition, all candidate k-itemsets (itemsets containing k items) will be signed as Ck, and all fre- quent k-itemsets as Fk. Pseudocode for Apriori multiple_num algorithm is given bellow. Apriori multiple_num Algorithm Input: T-transactional database; µ -minimal support; Output: F-frequent itemsets in T Method: 1. F1 = all_large_1itemsets(T, µ) 2. multiple_num = maximal_length_o f _transactions 3. C2 = apriori_gen(F1,F1) 4. FOR i=3 TO multiple_num Ci = apriori_gen(Ci−1,Ci−2) END FOR 5. FOR i=2 TO multiple_num createCandidateHashtree(Ci) END FOR 6. FOR EACH t ∈ T DO FOR i=2 TO multiple_num traverseHashtree(Ci,t) END FOR END FOR 7. FOR i=2 TO multiple_num Fi = {c ∈ Ci|support(c) ≥ µ} END FOR 8. F = ∪ k Fk Let us explain the most important steps briefly. Generating frequent 1-itemsets is done in the same way as in original Apriori algorithm [1]. This step requires one database scan. Then, parameter mul- tiple_num is set to the length of the longest transaction from the database T, which ensures that the algorithm will need just one more database scan. Steps 3 and 4 are concerned with candidate genera- tion: in step 3 set C2 is generated by calling apriori_gen function, then loop in step 4 generates all other candidates Ci,3 ≤ i ≤ multiple_num, by calling apriori_gen function, but with the following difference. According to original Apriori algorithm [1] candidate itemsets Ck+1 (candidate itemsets containing k+1 items) is formed from the set Fk(frequent itemsets containing k items) in iteration k+1. However, we want to generate all itemsets in just one loop in order to reduce number of iterations (database scans) to two, but we do not have the necessary frequent sets. As the solution, arguments are candidate sets Ck−1 and Ck−2, which is known at this moment. The next section describes modification of apriori_gen function and its fast implementation. The support counting phase comes next. All candidate itemsets Ci,2 ≤ i ≤ multiple_num are orga- A New Rymon Tree Based Procedure for Mining Statistically Significant Frequent Itemsets 571 nized in separate hash trees in order to make support counting process efficient. Then, we scan database and calculate support for candidates by traversing corresponding hash trees. At the end of support count- ing phase frequent itemsets Fi,1 ≤ i ≤ multiple_num are generated. As we stated earlier, we will not further consider support counting phase. Apriori algorithm [1] performs kmax + 1 iterations, where kmax is the maximal length of frequent itemsets, and in each iteration it scans whole database. Apriori multiple_num algorithm finishes after 2 iterations and performs 2 database scans. 4 New Procedure for Candidate Generation We assume that any itemset K is kept sorted according to some relation <, where for all x,y ∈ K, x < y means that object x is in front of object y. Also, we assume that all transactions in database T and all subsets of K are kept sorted in lexicographic order according to relation <. For candidate generation we suggest an original method by which the set CT,k is calculated by joining CµT,k−1 with C µ T,k−2, for k ≥ 3. Candidate k-itemset is created from one candidate (k-1)-itemset and one candidate (k-2)-itemset in the following way. Let X = {x1,...,xk−1} ∈ C µ T,k−1 and Y = {y1,...,yk−2} ∈ CµT,k−2. Itemsets X and Y are joined if and only if the following condition is satisfied: xi = yi,(1 ≤ i ≤ k −3)∧ xk−1 < yk−2 (3) producing the candidate k-itemset {x1,...,xk−2,xk−1,yk−2}. We will prove the correctness of the suggested method. In the following text we will denote this method by CT,k = C µ T,k−1 ×C µ T,k−2. Let I = i1,...,in be a set of items that contains n elements. Denote by GI = (P(I),E) the Rymon tree of P(I). The root of the tree is /0. A vertex K = {ip1,...,ipk } with ip1 < ip2 < ... < ipk has n − ipk children K ∪ j, where ipk < j ≤ n. Let Sr be the collection of itemsets that have r elements. The next theorem suggest a technique for generating Sr starting from Sr−1 and Sr−2. It is a modification of Theorem 7.8. from [10]. Theorem 9. Let G be the Ryman tree of P(I), where I = i1,...,in. If W ∈ Sr, where r ≥ 3, then there exists a unique pair of distinct sets U ∈ Sr−1 and V ∈ Sr−2 that has a common immediate ancestor T ∈ Sr−3 in G such that U ∩ V ∈ Sr−3 and W = U ∪ V . Proof: Let u and v and p be the three elements of W that have the largest, the second-largest and the third- largest subscripts, respectively. Consider the sets U = W − {u} and V = W − {v, p}. Note that U ∈ Sr−1 and V ∈ Sr−2. Moreover, Z = U ∪ V belongs to Sr−3 because it consists of the first r-3 elements of W. Note that both U and V are descendants of Z and that U ∪ V = W (for r=3 we have Z = /0). The pair (U,V ) is unique. Indeed, suppose that W can be obtained in the same manner from another pair of distinct sets U1 ∈ Sr−1 and V1 ∈ Sr−2 such that U1 and V1 are immediate descendants of a set Z1 ∈ Sr−3. The definition of the Rymon tree GI implies that U1 = Z1 ∪ {im,iq} and V1 = Z1 ∪ {iy}, where the letters in Z1 are indexed by a number smaller than min{m,q,y}. Then, Z1 consists of the first r-3 symbols of W, so Z1 = Z. If m < q < y, then m is the third-highest index of a symbol in W, q is the second-highest index of a symbol in W and y is the highest index of a symbol in W, so U1 = U and V1 = V . 2 The following theorem, together with the obvious fact CµT,k ⊂ F µ T,k for all k, directly proves correctness of our method CT,k = C µ T,k−1 × C µ T,k−2. It is modification of Theorem 7.10. from [10]. Theorem 10. Let T be a transaction data set on a set of items I and let k ∈ N such that k > 2. If W is a µ -frequent itemset and |W | = k, then there exists a µ -frequent itemset Z and two itemsets {im,iq} and {iy} such that |Z| = k −3, Z ⊆ W , W = Z ∪ {im,iq,iy} and both Z ∪ {im,iq} and Z ∪ {iy} are µ -frequent itemsets. 572 P. Stanišić, S. Tomović Proof: If W is an itemset such that |W | = k, than we already know that W is the union of two subsets U and V of I such that |U| = k − 1, |V | = k − 2 and that Z = U ∩ V has k-3 elements (it follows from Theorem 2). Since W is a µ -frequent itemset and Z, U, V are subsets of W, it follows that each of these sets is also a µ -frequent itemset (it follows from Theorem 1). 2 Apriori algorithm [2] generates candidate k-itemset by joining two large (k-1)-itemsets, if and only if they have first (k-2) items in common. Because of that, each join operation requires (k-2) equality comparisons. If a candidate k-itemset is generated by the method CT,k = C µ T,k−1 ×C µ T,k−2 for k ≥ 3, it is enough to process (k-3) equality comparisons. The method CT,k = C µ T,k−1 ×C µ T,k−2 can be represented by the following SQL query: INSERT INTO CT,k SELECT R1.item1,...,R1.itemk−1,R2.itemk−2 FROM CµT,k−1 AS R1,C µ T,k−2 AS R2 WHERE R1.item1 = R2.item1 ∧...∧ R1.itemk−3 = R2.itemk−3 ∧ R1.itemk−1 < R2.itemk−2 For the implementation of the join CT,k = C µ T,k−1 × C µ T,k−2 we suggest a modification of sort-merge- join algorithm (note that CµT,k−1 and C µ T,k−2 are sorted because of the way they are constructed and lexi- cographic order of itemsets). By the original sort-merge-join algorithm [9], it is possible to compute natural joins and equi-joins. Let r(R) and s(S) be the relations and R ∩ S denote their common attributes. The algorithm keeps one pointer on the current position in relation r(R) and another one pointer on the current position in relation s(S). As the algorithm proceeds, the pointers move through the relations. It is supposed that the relations are sorted according to joining attributes, so tuples with the same values on the joining attributes are in consecutive order. Thereby, each tuple needs to be read only once, and, as a result, each relation is also read only once. The number of blocks transfers is equal to the sum of the number of blocks in both sets CµT,k−1 and CµT,k−2, nb1 + nb2. The modification of sort-merge-join algorithm we suggest refers to the elimination of restrictions that join must be natural or equi-join. First, we separate the condition (3): xi = yi,1 ≤ i ≤ k −3 (4) xk−1 < yk−2. (5) Joining CT,k = C µ T,k−1 ×C µ T,k−2 is calculated according to the condition (4), in other words we compute natural join. For this, the described sort-merge-join algorithm is used, and our modification is: before X = {x1,...,xk−1} and Y = {y1,...,yk−2}, for which X ∈ C µ T,k−1 and Y ∈ C µ T,k−2 and xi = yi,1 ≤ i ≤ k − 3 is true, are joined, we check if condition (5) is satisfied, and after that we generate candidate k-itemset {x1,...,xk−2,xk−1,yk−2}. The pseudocode of apriori_gen function comes next. FUNCTION apriori_gen(CµT,k−1,C µ T,k−2) 1. i = 0 2. j = 0 3. while i ≤ |CµT,k−1|∧ j ≤ |C µ T,k−2| iset1 = C µ T,k−1[i ++] S = {iset1} done = false while done = f alse ∧ i ≤ CµT,k−1 A New Rymon Tree Based Procedure for Mining Statistically Significant Frequent Itemsets 573 iset1a = C µ T,k−1[i ++] if iset1a[w] = iset1[w],1 ≤ w ≤ k −2 then S = S ∪ {iset1a} i ++ else done = true end if end while iset2 = C µ T,k−2[ j] while j ≤ |CµT,k−2|∧ iset2[1,...,k −2] ≺ iset1[1,...,k −2] iset2 = C µ T,k−2[ j ++] end while while j ≤ |CµT,k−2|∧ iset1[w] = iset2[w],1 ≤ w ≤ k −2 for each s ∈ S if iset1[k −1] ≺ iset2[k −2] then c = {iset1[1],...,iset1[k −1],iset2[k −2]} if c contains-not-frequent-subset then DELETE c else CT,k = CT,k ∪ {c} end if end for j++ iset2 = C µ T,k−2[ j] end while end while 5 Statistical Test for Validating Frequent Itemset Frequent itemset mining algorithms have the potential to generate a large number of patterns. For example, even if we assume that no customer has more than five items in his shopping cart and that there are 10000 items, there are ∑5 i=1 ( 10000 5 ) possible contents of this cart, which corresponds to the sub- sets having no more than five items of a set that has 10,000 items, and this is indeed a large number. As the size and dimensionality of real commercial databases can be very large, we could easily end up with thousands or even millions of patterns, many of which might not be interesting. It is therefore important to establish a set of well-accepted criteria for evaluating the quality of patterns. In Apriori multiple_num algorithm support measure is used to determine whether an itemset is fre- quent: an itemset X is considered frequent in the data set T, if suppT (X) > minsup, where minsup is a user-specified threshold. Support measure is kind of objective interestingness measure, which is data- driven and domain-independent approach that uses statistics derived from data for evaluating the quality of association patterns [12]. Now, we will explain how statistical hypothesis testing can be applied to validate frequent itemsets generated with support measure. Hypothesis testing is a statistical inference procedure to determine whether a hypothesis should be accepted or rejected based on the evidence gathered from data. Examples of hypothesis tests include ver- ifying the quality of patterns extracted by many data mining algorithms and validating the significance 574 P. Stanišić, S. Tomović of the performance difference between two classification models. In hypothesis testing, we are usually presented with two opposite hypothesis, which are known, re- spectively, as the null hypothesis and the alternative hypothesis. The general procedure for hypothesis testing consists of the following four steps [12]: • Formulate the null and alternative hypotheses to be tested. • Define a test statistic θ that determines whether the null hypothesis should be accepted or rejected. The probability distribution associated with the test statistic should be known. • Compute the value of θ from the observed data. Use the knowledge of the probability distribution to determine a quantity known as p-value. • Define a significance level a which controls the range of θ values in which the null hypothesis should be rejected. The range of values for θ is known as the rejection region. Frequent itemsets mining problem can be formulated into the hypothesis testing framework in the following way. To validate if the itemset X is frequent in the data set T, we need to decide whether to accept the null hypothesis, H0 : suppT (X) = minsup, or the alternative hypothesis H1 : suppT (X) > minsup. If the null hypothesis is rejected, then X is considered as frequent itemset. To perform the test, the probability distribution for suppT (X) must also be known. Theorem 11. The measure suppT (X) for the itemset X in transaction data set T has the binomial distri- bution with mean suppT (X) and variance sup pT (X)∗(1−suppT (X)) n , where n is the number of transactions in T. Proof: We will use measure suppcountT (X) and calculate mean and variance for it and later derive mean and variance for the measure suppT (X). The measure suppcountT (X) = Xn presents the number of transactions in T that contain itemset X, and suppT (X) = suppcountT (X)/n (Definition 2). The measure Xn is analogous to determining the number of heads that shows up when tossing n coins. Let us calculate E(Xn) and D(Xn). Mean is E(Xn) = n ∗ p, where p is the probability of success, which means (in our case) the itemset X appears in one transaction. According to Bernoulli low, the following holds: ∀ε > 0,limN→∞P{|Xnn − p| ≤ ε} = 1. (6) Freely speaking, for large n (we work with large databases so n can be considered large), we can use relative frequency instead of probability. So, we now have: E(Xn) = np ≈ n Xn n = Xn (7) For variance we compute: D(Xn) = np(1− p) ≈ n Xn n (1− Xn n ) = Xn(n − Xn) n (8) Now we will compute E(suppT (X)) and D(suppT (X)). Recall that suppT (X) = Xn n . We have: E(suppT (X)) = E( Xn n ) = 1 n E(Xn) = Xn n = suppT (X). (9) D(suppT (X)) = D( Xn n ) = 1 n2 D(Xn) = 1 n2 Xn(n − Xn) n = 1 n suppT (X)(1− suppT (X)) (10) A New Rymon Tree Based Procedure for Mining Statistically Significant Frequent Itemsets 575 2 The binomial distribution can be further approximated using normal distribution if n is sufficiently large, which is typically the case in association analysis. Regarding previous paragraph and Theorem 4, under the null hypothesis suppT (X) is assumed to be normally distributed with mean minsup and variance minsup(1−minsup)n . To test whether the null hypothesis should be accepted or rejected, the following statistic can be used: WN = suppT (X)− minsup√ minsup(1−minsup) n . (11) The previous statistic, according to the Central Limit Theorem, has the distribution N(0,1). The statistic essentially measures the difference between the observed support suppT (X) and the minsup threshold in units of standard deviation. Let N=10000, suppT (X) = 0.11, minsup=0.1 and α = 0.001. The last parameter is the desired sig- nificance level. It controls Type 1 error which is rejecting the null hypothesis even though the hypothesis is true. In the Apriori algorithm we compare suppT (X) = 0.11 > 0.1 = minsup and we declare X as frequent itemset. Is this validation procedure statistically correct? Under the hypothesis H1 statistics W10000 is positive and for rejection region we choose R = {(x1,...,x10000)|w10000 > k},k > 0. Let us find k. 0.001 = PH0{W10000 > k} PH0{W10000 > k} = 0.499 k = 3.09 Now we compute w10000 = 0.11−0.1√ 0.1∗(1−0.1) 10000 = 3.33... We can see that w10000 > k, so we are in rejection region and H1 is accepted, which means the itemset X is considered statistically significant. 6 Conclusion In this section we compare the proposed method with original Apriori [1] and with Apriori Multiple which we introduced in [11]. Sections 3 and 4 of the paper contain comparison with the original Apriori algorithm [1]. The main advantages of the new algorithm are: it finishes in just two database scans and it uses more efficient candidate generation procedure. The algorithm from [11] also finishes in two database scans and it uses similar procedure for can- didate generation as the Apriori multiple_num algorithm proposed here. But, the Apriori multiple_num algorithm is more efficient. The main advantage of the Apriori multiple_num algorithm in comparison with algorithm from [11] is in the following. The Apriori multiple_num uses Rymon tree structure for definition of candidate join procedure as it is explained in Section 4. Because of that, candidate sets are stored in Rymon tree structure before joining instead of storing candidates in array as it is done in [11]. The following experiment confirms that Rymon tree based implementation is more efficient. We implemented the Apriori multiple_num, the original Apriori [1] and the Apriori Multiple [11] algorithms in C in order to evaluate its performances. Experiments are performed on PC with a CPU Intel(R) Core(TM)2 clock rate of 2.66GHz and with 2GB of RAM. Also, run time used here means the 576 P. Stanišić, S. Tomović total execution time, i.e., the period between input and output instead of CPU time measured in the ex- periments in some literature. In experiments dataset which can be found on www.cs.uregina.ca is used. It contains 10000 binary transactions. The average length of transactions is 8. We did not compare number of I/O operations because the algorithm proposed here finishes in just two database scans, while the original Apriori requires at least kmax + 1 scans, where kmax is the length of the longest frequent itemset (as explained in Section 3). Figure 1 shows that the original Apriori algorithm from [1] is outperformed by both the Apriori Mul- tiple [11] and the Apriori multiple_num presented here. Also, it can be seen that Apriori multiple_num with Rymon tree based implementation is significantly better than the algorithm from [11]. Figure 2: Execution times for different algorithms Bibliography [1] Agrawal, R., Srikant, R., Fast Algorithms for Mining Association Rules, Proceedings of VLDB-94, 487-499, Santiago, Chile (1994) [2] Coenen, F.P., Leng, P., Ahmed, S., T-Trees, Vertical Partitioning and Distributed Association Rule Mining, Proceedings ICDM-2003, 513-516 (2003) [3] Coenen, F.P., Leng, P., Ahmed, S., Data Structures for Association Rule Mining: T-trees and P- trees, IEEE Transactions on Data and Knowledge Engineering, Vol. 16, No 6, 774-778 (2004) [4] Coenen, F.P., Leng, P., Goulbourne, G., Tree Structures for Mining Association Rules, Journal of Data Mining and Knowledge Discovery Vol. 8, No. 1, 25-51 (2004) [5] Goulbourne, G., Coenen, F., Leng, P., Algorithms for Computing Association Rules Using a Partial- Support Tree, Journal of Knowledge-Based Systems Vol. 13, 141-149 (1999) [6] Grahne, G., Zhu, J., Efficiently Using Prefix-trees in Mining Frequent Itemsets, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (2003) [7] Han, J., Pei, J., Yu, P.S., Mining Frequent Patterns without Candidate Generation, Proceedings of the ACM SIGMOD Conference on Management of Data, 1-12 (2000) [8] Rymon, R., Search Through Systematic Set Enumeration, Proceedings of 3rd International Confer- ence on Principles of Knowledge Representation and Reasoning, 539-550 (1992) [9] Silberschatz, A., Korth, H. F., Sudarshan, S., Database System Concepts, Mc Graw Hill, New York (2006) A New Rymon Tree Based Procedure for Mining Statistically Significant Frequent Itemsets 577 [10] Simovici, A. D., Djeraba, C., Mathematical Tools for Data Mining (Set Theory, Partial Orders, Combinatorics), Springer-Verlag London Limited (2008) [11] Stanisic, P., Tomovic, S., Apriori Multiple Algorithm for Mining Association Rules, Information Technology and Control Vol. 37, No. 4, 311-320 (2008) [12] Tan., P.N., Steinbach, M., Kumar, V., Introduction to Data Mining, Addicon Wesley (2006). Dr. Predrag Stanisic is a professor in the Faculty of Science - Department of Mathematics and Com- puter Science at University of Montenegro. He received his B.Sc. degree in Mathematics and Computer Science from University of Montenegro in 1996, his M.Sc degree in Computer Sci- ence from University of Belgrade Serbia in 1998 and his Ph.D. degree in Computer Science from Moscow State University M.V. Lomonosov in 1999. He is currently the dean of Faculty of Sci- ence at University of Montenegro and he teaches a wide variety of undergraduate and graduate courses in several computer science disciplines, especially database systems, operating systems and programming. M.Sc Savo Tomovic is a teaching assistant in the Faculty of Science - Department of Mathematics and Computer Science at University of Montenegro. He received his B.Sc. degree in Mathematics and Computer Science from University of Montenegro in 2006, his M.Sc degree in Computer Science from University of Montenegro in 2007. He is currently a Ph.D. student in Computer Science at University of Montenegro.