INT J COMPUT COMMUN, ISSN 1841-9836 Vol.7 (2012), No. 3 (September), pp. 494-508 Mining Temporal Sequential Patterns Based on Multi-granularities N. Li, X. Yao, D. Tian Naiqian Li, Xinhui Yao, Dongpin Tian Department of Computer Science Baoji University of Arts and Sciences Baoji 721016, Shaanxi, China E-mail: xalnq@hotmail.com, baojiyxh@163.com, tdp211@163.com Abstract: Sequential pattern mining is an important data mining problem that can extract fre- quent subsequences from sequences. However, the times between successive items in a sequence is typically used as user-specified constraints to pre-process the in- put data or to prune the pattern search space. In either cases, the times cannot be used to identify item intervals of sequential patterns. In this paper, we introduce a form of multi-granularity sequence patterns, which is a sequential pattern where each transition time is annotated with multi-granularity boundary interval and aver- age time derived from the source data rather than the user-predetermined time interval or only a typical time. Then we present a novel algorithm, MG-PrefixSpan, of mul- tiple granularity sequential patterns based on PrefixSpan[, which discovers all such patterns. Empirical evaluation shows that MG-PrefixSpan scales up linearly as the size of database, and has a good scalability with respect to the length of sequence and the size of transaction. Keywords: Data Mining Algorithm, Sequential Pattern Mining, Sequential Data, Time Granularity, Temporal Patterns. 1 Introduction Among various types of data mining applications [1, 2], the sequential pattern mining, which discov- eries interesting sequential patterns hidden in sequence of events, is an important data mining problem with broad applications, including market analysis, decision support, the prediction of occurrences of recurrent illnesses, system performance analysis and telecommunication network analysis etc. The problem of mining sequential patterns was first proposed by Agrawal and Strikant [3]: Given a data set of sequences, each sequence is a list of transactions, where each transaction is a set of items. The sequential pattern mining is to find all subsequences that is more frequent than a user-specified minimum support threshold while maintaining their item occurrence order. For example, in the database of a book-club, a sequential pattern might be “5 percent of customers bought ‘Foundation’, then ‘Foundation and Empire’, and then ‘Second Foundation”’ [4]. Although the discovered sequential patterns reveal what items are frequently bought together and in what order, they cannot reveal how long time the items will be bought after the preceding items. Unfortunately, not knowing the time means that we cannot exactly predict when the next purchase will happen. In addition, some sequential patterns could occur in different periods with different time granularities. For example, “HP stock could rise within 5 days after IBM stock rose" and “HP stock could fall within 6 month after IBM stock rose", which reveal HP stock rise or fall with respect to IBM stock rise at different time granularities (day and month), are two useful different patterns. However, these traditional sequence patterns can only tell us “HP stock could rise after IBM stock rose" and “HP stock could fall after IBM stock rose". This situation means that the two patterns are useless. Another situation may be “HP stock Copyright c⃝ 2006-2012 by CCC Publications Mining Temporal Sequential Patterns Based on Multi-granularities 495 could rise 5 days later after IBM stock rose" and “HP stock could rise 6 months later after IBM stock rose". Although these two patterns are completely different with regard to different time granularities, these traditional sequence patterns could treat the two patterns as the same pattern “HP stock could rise after IBM stock rose", which make the extracted pattern less precise and some useful information lost. Given the above reasons, in this paper we generalize the problem definition given in [1, 8, 9, 13] to in- corporate the maximum, minimum and average time between successive transactions, which are derived from the source data, and different time granularities in traditional sequential patterns. We present a novel algorithm of multiple granularity sequence patterns based on PrefixSpan [6] , called MG-PrefixSpan, which discovers all such patterns. Empirical evaluation shows that MG-PrefixSpan scales up linearly as the size of database, the length of sequence and the size of transaction. The rest of this paper is organized as follows. Related work is discussed in section 2. We give a formal description of the problem of mining temporal sequence patterns based multiple granularities in section 3. In section 4, we describe MG-PrefixSpan, an algorithm for finding such patterns, and then Section 5 provides empirical evaluation of the performance of MG-PrefixSpan. Finally, we conclude the paper in section 6. 2 Related Work Sequential pattern mining, in general, can be grouped into two categories. One category, called un-temporal sequence pattern mining or traditional sequence pattern mining, considers only the item occurrence order in a sequential pattern, but does not deal with time-related data [1, 3–6]. The other category, called temporal sequence pattern mining, consider not only the item occurrence order in a sequence pattern, but also the time between successive items in a sequential pattern, such as [8, 9, 11–13]. 2.1 Un-temporal Sequence Pattern Mining Agrawal and Strikant [3] introduced the notion of sequential pattern mining, and based on the prop- erty that any sub-pattern of a frequent pattern must be frequent, three Apriori-based algorithms were proposed: AprioriSome, DynamicSome and AprioriAll. Two of these algorithms were designed only to find maximal sequential patterns. The third algorithm, AprioriAll, finds all patterns. Briefly, AprioriAll is decomposed into two phases: (1) generating candidate sequences; (2) scanning the sequential database to check the support of each candidate to determine frequent sequence patterns according to minimal sup- port threshold. Although AprioriAll is not efficient, it is the basis of many efficient algorithms developed later. SPADS [5] is an algorithm proposed to find frequent patterns using efficient lattice search technol- ogy and simple joins. It decomposes the original search space (lattice) into smaller pieces (sub-lattices), which can be processed independently in the main memory. Due to adopting a vertical id-list database format to count the number of frequent patterns, all the sequential patterns are discovered with only a few passes over the database. SPADS outperforms ApriorAll [2,3]. PrefixSpan[6] is another more effi- cient algorithm for mining sequential patterns comparing with the aprior-based algorithm ApriorAll and SPADS, especially in dealing with very large databases. It mainly adopts a projection-based, sequential pattern-growth method to make the database for next pass much smaller and consequently make the algo- rithm more speedy. Also in PrefixSpan there is no need to generate candidates, only recursive projection of database according to their prefix. Our method is based on the PrefixSpan algorithm. Some have tried to exert constraints on the mining of sequential patterns so that only those sequential patterns interesting to users are discovered rather than the whole possible sequential patterns[2,5,6,7]. Strikant and Agrawal [4] generalized their definition of sequential patterns in [3] to integrate with time constraints, sliding time window, and user-defined taxonomy, and proposed algorithm GSP. Mannila et al. [7] proposed a method of mining frequent episodes in a sequence of events, in which episodes are essentially constraints on events in form of acyclic graphs. Garofalakis et al. [8] proposed a family of 496 N. Li, X. Yao, D. Tian SPIRIT algorithms of mining user-specified sequential patterns by using regular expression constraints. Pei et al. [9] developed an extended framework based on a sequential pattern growth for constraint-based sequential pattern mining. Although time between successive items is typically used as a user-specified constraint to shrink the pattern search space to make the computation more efficient, it is not used in the output frequent sequence patterns. 2.2 Temporal Sequence Pattern Mining Yoshida et al.[10] proposed a notation of delta pattern, which is a temporal sequence pattern with temporal constraints in the form A [0,7] → B [3,5] → C of bounding intervals. An example of delta pattern has the form denoting a sequential pattern A → B → C that frequently appears in the database with transition times from A to B and from B to C that contained in [0, 7] and in [3, 5] respectively. However, Yoshida et al. only provided a heuristics for finding some frequent delta patterns, and did not investigate the problem of finding all of them. Along the same direction, Chen et al. [11] introduced a form of temporal sequence pattern by inserting pseudo items into the original sequential pattern. Pseudo items are user- defined time interval segmentations in advance. When counting the support of a sequential pattern, only the sequence, in which the pseudo item between successive items is same, supports the sequential pattern. Two algorithms, I-Apriori and I-PrefixSpan, were proposed. I-Apriori is based on Apriori algorithm [12], and I-PrefixSpan is based on PrefixSpan[6]. Hirate et al. [13] proposed generalized sequential pattern mining with item intervals. They extended sequences which are defined by inserting pseudo items based on the interval itemization function and exerting four interval constraints on items. However, as they adopted some user-predefined pseudo items or constraints on the time intervals between successive items, it is difficult for a user to specify optimal constraints related to item interval and cannot reveal the time interval between successive items in the patterns precisely, it may result in some useful sequential patterns not being found. Giannotti et al. [14] introduced another form of sequential pattern, called Temporally-Annotated Sequence, TAS in short, where each transition is annotated with temporal information representing a typical time derived from the source data. For instance, TAS A t1→ B t2→C denotes the fact that a sequential pattern A → B → C frequently appears in the database and that the time for getting from event A to B is close to t1 and from event B to C is close to t2. Although this method using typical time to annotate transition between two successive events can reveal how much time the event will occur after the preceding event, they cannot distinguish completely patterns with regard to different time granularities. For example, from event A to B is close to t1 days, other from event A to B may be close to t2 months, these two different patterns are treated as the same. It is useful to be able to distinguish the patterns to understand not only what event will follow, but also when these events will occur. Finally, we mention the work in [15], where event structures that have temporal constraints with multiple granularities are introduced. Event structures essentially are used as a flexible user-defined constraint specification to define the pattern discovery problem with these structures that enable users to focus on their interested sequential patterns. This method can only find the patterns that satisfy the event structures. Furthermore, an event structure consists of a number of variables representing events and temporal constraints among these variables, an efficient event structure is difficult to define beforehand even if you are a domain expert. In our work, we introduce a new form of sequential pattern with multi-granularities, which is a se- quential pattern where each transition is annotated with multi-granularity boundary interval and average time derived from the source data rather than user-predetermined time intervals [8,9] or only a typical time [14]. We also define the pattern discovery problem involving multi-granularities for these pattern and study efficient algorithm based on PrefixSpan to solve it. Mining Temporal Sequential Patterns Based on Multi-granularities 497 3 Problem Formulation 3.1 Time granularity In order to formally define temporal sequence pattern that involves time granularities, we first review the notion of a time granularity [15]. Definition 1. A granularity is a mapping µ from the set of the positive integers ( the time ticks) R to R2 (the set of absolute time sets) such that for all positive integer i and j with i ≤ j, the following two condition are satisfied: (1) µ(i) , ∅∧µ( j) , ∅ implies that each number in µ(i) is less than all the numbers in µ( j), and (2) µ(i) , ∅ implies µ( j) , ∅. Each set µ(i), if non-empty, is called a granule of the µ. Property (1) says that granules do not overlap and the order on time ticks follow the order on the corresponding granules. Property (2) says that the subset of the time ticks corresponding to the granules forms a set of contiguous integers. The set µ(i) of reals is said to be the ith tick of µ, or tick i of µ . For example, hour, day, week, month, and year, satisfy the above definition. We can also define more complex granularities like business-week, weekend and so on. When dealing with temporal types, it is needed to determine the tick (if any) of a temporal type µ that covers a given tick z of anther temporal type v. Formally, for each positive integer z and temporal types µ and v, if ∃ z’ (necessarily unique) such that v(z) ⊆µ(z’) then ⌈z⌉µv = z′, otherwise ⌈z⌉ µ v is undefined[15]. In this paper, all timestamps in a sequence are assumed to be in terms of a fixed granularity g0, and abbreviate ⌈z⌉µv as ⌈z⌉µ if v= g0. Definition 2. A multi-granularity schema is a tuple of the form Gm = (gm,gm−1, · · · ,g2,g1), where each gi (1 ≤ i ≤ m) is a granularity. For simplicity, we require that in Gm and g0, each unit of gi is contained in a unit gi+1 (0 ≤ i ≤ m−1). 3.2 Temporal sequence and multi-granularity sequence pattern Let I = {i1, i2, · · · , in} be a set of items. An itemset si is a non-empty subset of items (without loss of generality, we assume that items of an itemset are sorted alphabetically) denoted as i1, · · · , ik, where i j (1 ≤ j ≤ k) is an item. A traditional data sequence is an ordered list of itemsets, which is sorted by the order of priority of the transaction time and denoted as (s1 → s2 →···→ sn), where si (1 ≤ i ≤ n) is an itemset [1,3]. In practice, a data sequence of a customer is also formed as an ordered list of itemsets and time stamps [11], we call it a temporal sequence. Definition 3. A temporal sequence s is represented as ⟨(s1, t1) → (s2, t2) → ··· → (sn, tn)⟩, where si is an itemset and ti stands for the time at which si occurs, ti < t j for 1 ≤ i < j ≤ n. When adding the itemset time information in the sequence, the time interval value between any two elements in the sequence can be computed as follows: Ti j = T j − Ti,where 1 ≤ i < j ≤ n We now formally define multi-granularity schema, multi-granularity sequence patterns and its’ sup- port, then formalize our novel mining problem as the discovery of all frequent multi-granularity sequence patterns in the database D. Definition 4. Given a multi-granularity schema Gm = (gm,gm−1, · · · ,g2,g1), a multi-granularity se- quence pattern is represented as follows: α=α1 [L1,M1,U1]µ1→ α2 [L2,M2,U2]µ2→ ··· [Ln−1,Mn−1,Un−1]µn−1→ αn 498 N. Li, X. Yao, D. Tian where (1) αi (1 ≤ i ≤ n) is an itemset; (2) µi ∈ {g j| j = 1, · · · ,m} (1 ≤ i ≤ n − 1) a granularity; (3) Li, Mi, and Ui (1 ≤ i ≤ n − 1) are respectively the lower bound, average time and upper bound of the time that αi+1 occurs after α1, α2, · · · and αi with the granularity µi. [Li, Mi,Ui]µi is called temporal annotation of αi+1, and the tuple ([Li, Mi,Ui]µi,αi+1) also called a temporal item (1 ≤ i ≤ n − 1). Example α = α1 [1,25,5]day → α2 [2,33,4]week → α3 is a multi-granularity sequential pattern, where α1,α2,α3 are itemsets or events, day and week are time granularities. The pattern indicates that α2 occurs in 1 day to 5 days or average 2.5 days after α1, then 2 weeks to 4 weeks or average 2.5 weeks later, α3 occurs. The total number of items in a multi-granularity sequence pattern is referred as the length of the pattern. A multi-granularity sequence pattern whose length is k is also represented as k-multi-granularity sequence pattern. Definition 5. Given a temporal sequence s = (s1, t1) → (s2, t2) →···→ (sn, tn) and a multi-granularity schema Gm = (gm,gm−1, · · · ,g2,g1). Let α be a multi-granularity sequence pattern α = α1 [L1,M1,U1]µ1→ α2 [L2,M2,U2]µ2→ ··· [Lk−1,Mk−1,Uk−1]µk−1→ αk, where µi ∈ g j| j = 1, · · · ,m (1 ≤ i ≤ k − 1). s is said to support( or contain ) α if and only if there exists integers 1 ≤ i1 < i2 < · · ·< ik ≤ n such that (1) ∀ j1 ≤ j ≤ k, αi ⊆ si j (2) ∀ j1 ≤ j ≤ k−1, if µ j = gh (1 ≤ h ≤ m−1) then gh(1) ≤ Ti ji j+1 22.89 < (80) [1,3,6]day → (80) > 1.71 < (80) [1,2,5]day → (80) > 1.97 < (80) [1,3,6]day → (80) [1,2,5]day → (80) > 0.54 < (80) [2,3,6]day → (569) > 0.62 < (80) [1,2,4]day → (569) > 0.84 < (80,432) > 1.19 < (80,432) [1,2,3]day → (944) > 0.60 Table 4: PART OF THE EXACTED PATTERN Extracted patterns Support(%) < (80) > 22.89 < (80) → (80) > 3.03 < (80) → (80) → (80) > 0.71 < (80) → (569) > 1.39 < (80,432) > 1.19 < (80,432) → (944) > 0.62 5.3 Time Granularity Selection In the experiment, we use multi-granularity schema: G3 = (Week, Day, Hour), and all the timestamps in the synthetic data sets are Hour, i.e. g0 = Hour. In order to compare the performance of algorithm I-PrefixSpan to that of MG-PrefixSpan, all the intervals between successive items are partitioned into five intervals: {l0, l1, l2, l3, l4}, where l0 : t = 0 : 0 < t < 1, l2 : 1 ≤ t < 24, l3 : 24 ≤ t < 168, and l4 : 168 ≤ t <∞. 5.4 Comparison of Extracted Sequential Pattern Quality The first test is a comparison of the quality of extracted patterns by algorithm PrefixSpan[6], I- PrefixSpan[11] and MG-PrefixSpan. The test is designed using the data set C10T4S5I2 and all minimum support thresholds is set to 0.005(0.5%). Table 3 shows part of the extracted frequent multi-granularity sequence patterns using MG-PrefixSpan algorithm, and Tables 4 and 5 show part of the extracted frequent sequential patterns using the algorithm PrefixSpan and I-PrefixSpan, respectively. As shown in Table 3, we can made the following observation: (1) once item 80 occurs, then item 80 will occur again within 1 day to 6 days, or with average time 3 days, with probability (1.71/22.89) × 100% = 7.4%; within 1 week to 5 weeks, or with average time 2 weeks, with probability (1.97/22.89) × 100% = 8.6%. (2) once item 80 occurs and item 80 occurs within 1day to 6 days, or with average time 3 days, then item 80 will occur again within 1 week to 5 weeks, or with average time 2 weeks, with probability (0.54/1.19)100% = 45.4%. 504 N. Li, X. Yao, D. Tian 0 500 1000 1500 2000 2500 3000 0.005 0.01 0.015 0.02 Su pport thre shol d R u n ti m e (i n s ec o n d s) MG-PrefixSpan PrefixSpan I-PrefixSpan Figure 1: Performance of the of the algorithms on data set C10T2S4I1. 0 500 1000 1500 2000 2500 3000 0.005 0.01 0.015 0.02 Su pport thre s hol d R u n ti m e (i n s ec o n d s) MG-PrefixSpan PrefixSpan I-P refixSpan Figure 2: Performance of the algorithms on data set C10T2S4I2. (3) once item 80 occurs, then item 569 will occur within 2 days to 6 days, or with average time 3 days, with probability (0.62/22.89) × 100% = 2.7%; within 1 week to 4 weeks, or with average time 2 weeks, with probability (0.84/22.89) × 100% = 3.7%. (4) once item 80 and 432 occur, then item 94 will occur within 1 days to 3 days, or with average time 2 days, with probability (0.60/1.19) × 100% = 50.4%; As shown in Table 4, although the patterns are similar to those in Table 3, the time intervals between successive itemsets are not tighter than those in table 3. Thus users cannot precisely predict when items 569 and 944 occur and 80 occurs again. On the other hand, as shown in Table 5, there is no item annotation information based on multi-granularities in extracted sequences. Thus, users are not able to predict how long time item 569, 944 will occur and item 80 will occur again. Furthermore, the pattern < (80) → (569) > also makes users unable to distinguish periods of item 80 and 569 occurrence. These results indicate that the multi-granularity sequence patterns mined by algorithm, MG-PrefixSpan, are more useful than those mined by algorithms, I-PrefixSpan, or PrefixSpan. 5.5 Comparison of the Execution Time The second test of the three algorithms would compare the run times for different minimum supports. The comparison is on the five data sets shown in the table 2, where the minimum support threshold is varied from 0.5% to 2%. Fig.1 to Fig. 5 summarizes the results. It is clearly show that how the performance of the three algorithms changes as varying of the parameters |C|, |T|, |S| and |I|, and the differences among the three algorithms. Mining Temporal Sequential Patterns Based on Multi-granularities 505 0500 1000 1500 2000 2500 3000 0.005 0.01 0.015 0.02 S u pport th re sh ol d R u n ti m e (i n s ec o n d s) MG-PrefixSpan PrefixSpan I-PrefixSpan Figure 3: Performance of the algorithms on data set C10T2S8I1. 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0.005 0.01 0.015 0.02 S upport th re sh ol d R u n ti m e (i n s ec o n d s) MG-P refixSpan PrefixSpan I-P refixSpan Figure 4: Performance of the algorithms on data set C10T4S4I1. The results in the Fig. 1 Fig. 3 indicate that the processing time of algorithm MG-PrefixSpan is at different support threshold is no significant difference. The efficiency of MG-PrefixSpan is little less than that of the I-PrefixSpan. This result matches our expectation, because the MG-PrefixSpan algorithm does some more complicated computes for the bounds and average interval time in different granularities than that of the I-PrefixSpan. On the other hand, the results in the Fig. 1, Fig. 2 and Fig. 3 show that the run time of three algorithms at different support threshold is also no significant difference. The speed of MG-PrefixSpan and I-PrefixSpan are little less than that of PrefixSpan. It is correct, because the algorithm PrefixSpan does not some complicated computes for interval time. When we see the Fig. 3, Fig. 4 and Fig. 5 we can find that the run time of the three algorithms is increase rapidly as the minimum support threshold values vary from small to large. It is correct, because larger number of frequent sequential patterns could be found as the minimum support threshold value became smaller. However, it is worth note that the algorithm MG-PrefixSpan and I-PrefixSpan take less time than the PrefixSpan algorithm and the time difference of processing become larger as the minimum support threshold declines although the MG-PrefixSpan and I-PrefixSpan are more complicated than the PrefixSpan. In order to find the reason, Let us note the data sets. Fig. 3 test is performed on the data set C10T2S8I1, where the parameter average length of maximal potentially large sequences S increased from 4(Fig.1) to 8. On average, a potentially frequent sequential pattern consists of 8 transactions, which mean that although the average number of transactions in a potentially frequent sequential pattern is set to 8, the number of transactions in a sequence is still set to 10. This situation can not cause the number and length of frequent sequential pattern increase considerably, because the data set C10T2S8I1 (1.746MB) is as sparse as C10T2S4I1 (1.733MB); Fig.4 on the data set C10T4S4I1, where the parameter average number 506 N. Li, X. Yao, D. Tian 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 0.005 0.01 0.015 0.02 Support threshold R u n ti m e (i n s ec on d s) MG-PrefixSpan PrefixSpan I-PrefixSpan Figure 5: Performance of the algorithms on data set C20T2S4I1. 0 1000 2000 3000 4000 5000 6000 10 15 20 25 30 Database si z e (i n K se qu e n ce s) R u n ti m e (i n s ec o n d s) Threshold=0.01 Threshold=0.015 Figure 6: Scalability test of the algorithm MG-PrefixSpan on data set T2S4I1. of items per transaction T increased from 2 (Fig.1) to 4, which mean, on average, a sequence consists of 10 transactions, each transaction composed of 4 items. It may result in the number of frequent sequential pattern increasing significantly; Fig.5 on the data set C20T2S4I1, where the parameter average number of transactions of per customer C increased from 10 (Fig.1) to 20, which mean, on average, a sequence consists of 20 transactions, each transaction composed of 2 items. It may result in the length of frequent sequential pattern increasing greatly. This both situations may cause the number and length of frequent sequential pattern increase considerably, because the data sets C10T4S4I1 and C20T2S4I1 are denser than the C10T2S8I1 and C10T2S4I1. When we use PrefixSpan to find all frequent sequential patterns, much more time would be spent in dealing with these more frequent sequential patterns although many of frequent sequential patterns may be useless. On the contrary, to the MG-PrefixSpan and I-PrefixSpan, although adding multiple time granularities or pseudo items to the traditional sequential pattern may make each of them to produce the several multiple granularities or time-interval patterns and need spend some time to does some complicated computation, many of them may not be frequent and some patterns found frequent by PrefixSpan may be infrequent by MG-PrefixSpan and I-PrefixSpan too. This reason reducing the numbers of frequent sequential patterns is why the MG-PrefixSpan and I-PrefixSpan are faster than the PrefixSpan. 5.6 Related global performances The final test is the scalabilities of the MG-PrefixSpan algorithm. Four tests are designed using the data set C10T2S4I1 and all minimum support thresholds are set to 0.01 and 0.015. In each test, test how the runtime of the MG-PrefixSpan algorithm scales as one parameter is increased. Fig. 6 shows the result Mining Temporal Sequential Patterns Based on Multi-granularities 507 010002000 3000 4000 5000 6000 5 10 15 20 Average number of transaction per customer R u n ti m e( in s ec on d s) Threshold=0.01 Threshold=0.015 Figure 7: Performance of the algorithms on data set T2S10I1. 01000 2000 3000 4000 5000 6000 2 3 4 5 Ave rage n u mbe r of tran sacti on pe r cu stome r R u n ti m e( in s ec o n d s) Threshold=0.01 Threshold=0.015 Figure 8: Performance of the algorithms on data set C10S10I1. of scalability of the algorithm as the data set size grows from 10000 to 30000; Fig.7 shows the result as the average number of transactions of per customer varies from 5 to 20; Fig. 8 shows the result as the average number of items per transaction varies from 2 to 5. The results indicate that the MG-PrefixSpan algorithm has good scalabilities for its runtime increases linearly as each parameter varies from small to large respectively. 6 Conclusions In this paper, we have proposed a novel approach for mining multi-granularities sequence patterns based on PrefixSpan [6], called MG-PrefixSpan. The multi-granularities sequence pattern not only re- veals what items occur frequently together and in what order, but also the time interval that the next items occur after the preceding items. We use boundary interval and average time with multi-granularities, which are derived from the source data, rather than user-predetermined the time interval or only a typical time to annotate time interval between successive items in the patterns. The performance analysis shows that MG-PrefixSpan scales up linearly as the size of database, and has a good scalability with respect to length of sequence and the size of transaction. The future work along this line of research includes several aspects as follows: (1) validation on large, real databases; (2) low-level optimizing mechanism of the algorithm. Acknowledgment This work was supported by the Department of Education of Shaanxi Province of China under Grant 05JK137 and the Natural Science Foundation of Shaanxi Province of China under Grant 2005F11. 508 N. Li, X. Yao, D. Tian Bibliography [1] Constantinescu, Z, Marinoiu, C, Vladoiu, M, “Driving Style Analysis Using Data Mining Tech- niques,” International Journal Of Computers Communications and Control, Vol.5, No.5, pp. 654- 663, Dec 2010. [2] Andonie, R, “Extreme Data Mining: Inference from Small Datasets,” International Journal Of Computers Communications and Control, Vol.5, No.3, pp. 280-291, Sep 2010. [3] R. Agrawal and R. Srikant, “Mining sequential patterns,” Proc. of the 7th International Conference on Data Engineering (ICDE’95), pp. 3-14, March, 1995. [4] R. Srikant and R. Agrawal, “Mining sequential patterns: Generalizations and performance im- provements,” Proc. of the 5th International Conference on Extending Database Technology, pp. 3-17, March, 1996. [5] M. Zaki, “An Efficient Algorithm for Mining Frequent Sequences,” Machine Learning, Vol. 40, pp. 31-60, 2000. [6] J. Pei, J. Han and H. Pinto et al, “Mining Sequential Pattern-Growth: The PrefixSpan Approach,” IEEE Transactions on Knowledge and Engineering, Vol.16, No.11, pp. 1424-1440, 2004. [7] H. Manila, H. Toivonen, and A.I. Verkamo, “Discovery of frequent episodes in event sequences,” Data Mining and Knowledge Discovery, Vol. 1, No. 3, pp. 256-289, 1997. [8] M.N. Garofalakis, R. Rastogi and K. Shim, “SPIRIT: Sequential Pattern Mining with Regular Expression Constraints,” Proc. of the 25th International Conference on Very Large Data Bases (VLDB’99), pp. 223-234, September, 1999. [9] J. Pei, J. Han and W. Wang, “Constraint-based Sequential Pattern Mining: The Pattern-growth Methods,” Journal of Intelligent Information Systems, Volume 28, Issue 2, pp. 133-160, April, 2007. [10] M. Yoshida et al. “Mining sequential patterns including time intervals”, Proc. of SPIE Conf.- DMKD, pp. 213-220, April, 2000. [11] Y.-L. Chen, M.-C. Chiang and M.-T. Ko, “Discovering time-interval sequential patterns in sequence databases,” Expert System with Applications, Volume 25, Issue3, Pp. 343-354, October, 2003. [12] R. Algawal and R. Srikant, “Fast algorithm for mining association rules in Large Databases.,” Proc. of the 20th International Conference on Very Large Data bases (VLDB’94), pp. 487-499, September 1994. [13] Y. Hirate, H. Yamana, “Generalized Pattern Mining with Item Intervals,” Journal of Computers, Vol.1, No3, pp. 51-60, June, 2006. [14] F. Giannotti, M. Nanni, and D. Pedreschi. “Efficient Mining of Temporally Annotated Sequences,” Proc. of the 6th SIAM International Conference on Data Mining, pp. 346-357, April, 2006. [15] C. Bettini, X.S. Wang and S. Jajodia et al, “Discovering Temporal Relationships with Multiple Granularities in Time Sequences,” IEEE Transactions on Knowledge and Data Engineering, Vol. 10 (2), pp. 222-237, 1998.