INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 10(3):428-440, June, 2015. Group Pattern Mining Algorithm of Moving Objects’ Uncertain Trajectories S. Wang, L. Wu, F. Zhou, C. Zheng, H. Wang Shuang Wang, Lina Wu, Fuchai Zhou, Cuicui Zheng Software College Northeastern University Shenyang, China wangsh@mail.neu.edu.cn Haibo Wang H. John Heinz III College Carnegie Mellon University Pittsburgh, USA haibowang@cmu.edu Abstract: Uncertain is inherent in moving object trajectories due to measurement errors or time-discretized sampling. Unfortunately, most previous research on trajec- tory pattern mining did not consider the uncertainty of trajectory data. This paper focuses on the uncertain group pattern mining, which is to find the moving objects that travel together. A novel concept, uncertain group pattern, is proposed, and then a two-step approach is introduced to deal with it. In the first step, the uncer- tain objects’ similarities are computed according to their expected distances at each timestamp, and then the objects are clustered according to their spatial proximity. In the second step, a new algorithm to efficiently mining the uncertain group patterns is designed which captures the moving objects that move within the same clusters for certain timestamps that are possibly nonconsecutive. However the search space of group pattern is huge. In order to improve the mining efficiency, some pruning strate- gies are proposed to greatly reduce the search space. Finally, the effectiveness of the proposed concepts and the efficiency of the approaches are validated by extensive experiments based on both real and synthetic trajectory datasets. Keywords: probabilistic frequent group pattern, uncertain data, trajectory pattern mining, moving object. 1 Introduction In recent years, with the development of various location-aware devices, such as RFID tags, cell phones, GPS navigation systems, and point of sale terminals, trajectory data has become ubiquitous in various domains [1]. Such data provides the opportunity of discovering usable knowledge about movement behavior, which fosters ranges of novel applications in human mo- bility understanding [2], smart transportation, urban planning [3], biological studies [4], envi- ronmental and sustain ability studies [5]. Ideally, a trajectory is represented as a sequence of locations, and each is being associated with a corresponding time stamp. However, this is a simplification that does not take into account the inherent uncertainties in such trajectories. For example, a position reported in a GPS signal usually implies a location point with an error range rather than an exact location. Moreover, as location privacy has become increasingly a concern, many locations are blurred when they are made public. Though the importance of mining un- certain data has been recognized [6–8], to the best of our knowledge there is little work in the trajectory pattern mining literature that studies its effect in the knowledge discovery process. A useful data analysis task in movement is to find a group of moving objects that are trav- eling together for a certain time period. This concept, what we refer to as group patterns, can Copyright © 2006-2015 by CCC Publications Group Pattern Mining Algorithm of Moving Objects’ Uncertain Trajectories 429 be defined in both spatial and temporal dimensions: (1) a group of moving objects should be geometrically close to each other; (2) they should be together for at least some minimum time duration. Although mining group patterns has been extensively studied on certain trajectory database, such as flock [9, 10], convoy [11, 12], swarm [1], traveling companion [13, 14], and gath- ering [15], none of these work deal with the inherent uncertainty in trajectory database. In this paper, we consider the problem of mining group patterns in the context of uncertain trajectory database. In contrast to previous work that did not model the trajectory with un- certainty, we represent a trajectory of a moving object as a sequence of reported locations at corresponding time stamps and a probability distribution function. The probability distribution function represents the probability distribution of possible locations of the trajectory at a given time instant. In particular, the probability distribution function can be discrete or continuous. However, discovering the group patterns from uncertain trajectories is not an easy task be- cause of adding a new dimension-probability. The challenges are two-fold: (1) The first challenge is how to design an appropriate distance function to measure the (dis)similarity between two uncertain trajectories. Particularly, an effective similarity metric should be able to conduct mea- surements in terms of different probability distributions, taking into account spatial distances, temporal intervals, as well as relevant probabilities. (2) Another challenge is how to find group patterns efficiently. Due to adding the probability in uncertain trajectory database, judging fre- quent group patterns requires to compute the frequent probabilities, and this is not a trivial task. Additionally, group patterns mining solutions may lead to an exponential number of results due to the downward closure property. So it is important to propose some effective pruning rules to reduce the massive search space and the number of probability computations. Given the afore mentioned challenges, existing works for group patterns mining on exact trajectory data do not tackle the data uncertainty problem well. For instance, due to adding the probability, the concept and similarity distance metric function are not suitable for the uncer- tain environment. And also the traditional mining algorithms cannot be used directly to solve the uncertain group patterns mining. Our research makes the following contributions: (1) We propose a new problem, mining group patterns over uncertain trajectory data. (2) We introduce a novel and adaptive metric to measure the dissimilarity between two uncertain trajectories. (3) We design an efficient algorithm to discover all uncertain group patterns. In addition, we propose several pruning techniques to reduce search space and avoid redundant computation. (4) Extensive experiments demonstrate that the effectiveness and efficiency of our algorithm. The remaining of the paper is organized as follows. Section 2 discusses the related work. We introduce the distance similarity function of two uncertain trajectories in Sections 3. In Section 4, we give the definitions of uncertain group patterns, and then introduce our efficient algorithm based on breath-first search strategy. Experiments testing effectiveness and efficiency are shown in Section 5. Finally, our study is concluded in Section 6. 2 Related work Group pattern mining over exact trajectory data.Group pattern mining, which is to discover a group of objects that move together for a certain time period, is an important data analysis task for moving object trajectories. The research mainly included flock, convoy, and swarm pattern mining. The concepts of group patterns can be distinguished based on how the group is defined and whether they require the time periods to be consecutive. Specifically, a flock [9, 10] is a group of objects that travel together within a disc of some user-specified size for at least k consecutive time stamps. A major drawback is that a circular shape may not reflect the natural group in reality, which may result in the so-called lossy-flock problem [11]. To avoid rigid restrictions on the sizes and shapes of the group patterns, the convoy is proposed to capture 430 S. Wang, L. Wu, F. Zhou, C. Zheng, H. Wang generic trajectory pattern of any shape and extent by employing the density-based clustering. Instead of using a disc, a convoy requires a group of objects to be density-connected to each other during k consecutive time points. While both flock and convoy have strict requirement on consecutive time period, the rigid definition of flock and convoy sometimes makes it not practical to find potentially interesting patterns. In contrast to flock, convoy, Li et al [1] proposed a more general type of trajectory pattern, called swarm, which is a cluster of objects lasting for at least k (possibly non-consecutive) timestamps. Because this is more realistic as different people may temporarily leave the cluster at some snapshots, in this paper our uncertain group pattern definition is based on swarm and is extended to uncertain data model. Although there are a lot of works on mining group patterns, all these algorithms are designed for exact trajectory data and cannot be extended to uncertain trajectory data directly. Pattern mining on uncertain data.Another set of researches related with our work are pattern mining over uncertain data. Existing work on mining frequent itemsets from uncer- tain databases falls into two categories based on the definition of a frequent itemset: expected support-based frequent itemset and probabilistic frequent itemset. Both definitions consider the frequency (support) of an itemset as a discrete random variable. The former employs the expec- tation of the support as the measurement. That is, an itemset is frequent only if the expected support of the itemset is no less than a specified minimum expected support. The latter uses the frequentness probability as the measurement, which is the probability that an itemset ap- pears no less than a specified minimum support times. Then, an itemset is frequent only if its frequentness probability is no less than a specified minimum probability threshold. However, the use of expected support may lead to the loss of important patterns. Thus, the use of a prob- abilistic frequentness measure has been more popular recently. A recent survey for comparing these two measures and analyzing their relationships is given in [16].For the problem of uncertain sequence pattern mining, some initial research has been undertaken. For example, the expected support-based frequent sequential pattern mining has been studied in [17]. In contrast, Zhao et al. [18] proposed to mine probabilistic frequent sequential patterns according to the frequentness probability. However, all of these works only considered the simple value-based data type, and are not suitable for the complex data type-trajectory data, which contains both spatial and tem- poral information. Our paper is the first work to solve the group patterns mining problem on uncertain trajectory data. 3 Similarity metric of two uncertain locations In this section, we present our method of computing the spatial proximity of objects with uncertain locations. We formalize the model of uncertain trajectories in Section 3.1 and give the similarity distance function in Section 3.2. Frequently used symbols throughout this paper are summarized in Table 1. 3.1 Uncertain trajectory model Let TDB = {t1, t2, · · · tn} be a linearly ordered list of n timestamps. Let ODB={o1, o2, · · · , om} be a collection of m moving objects that appear in TDB. The locations of objects o observed at timestamps TDB are uncertain. We first give the definition of a certain trajectory. Definition 1. (Certain Trajectory). A certain trajectory Tr is represented as a sequence of points {(x1, y1, t1), (x2, y2, t2), · · · , (xn, yn, tn)} (t1 < t2 < · · · < tn), where n is the number of points in the trajectory and (xi, yi) are the coordinates of the ith point at timestamp ti. Group Pattern Mining Algorithm of Moving Objects’ Uncertain Trajectories 431 In some uncertain trajectories’ studies [6,7], they commonly used probability density function (pdf) to represent the uncertainty. Unfortunately, the exact probability distribution is not easily computed. So in this paper, we adopt the expectation and variance to model the uncertain trajectory. We can use the Evolving Density Estimator [8] to compute the mean value u and standard deviation v of an object’s position at each time. Based on the definition of certain trajectories, we have the definition of an uncertain trajectory as follows: Definition 2. (Uncertain Trajectory). An uncertain trajectory UTr is a sequence of random variables, and all the random variables at different timestamps are assumed to be independent. An Utr is represented as {(x1, y1, u1, v1, t1), (x2, y2, u2, v2, t2), ..., (xn, yn, un, vn, tn)} (t1 < t2 < ... < tn) , where ui and vi are the expectation and variance of Utr at timestamp ti. Table 1: Summary of the use of notations Symbol Illustration ODB Moving object set, ODB = {o1, o2, o3, ..., om} oj The jth object O Objects subset,O ⊆ ODB TDB Timestamp set, TDB = {t1, t2, ..., tn} ti Timestamp ti T Time subset,T ⊆ TDB UTri The ith object’s uncertain trajectory UTri(j) The jth sample point of the ith uncertain trajectory disti(UTra, UTrb) The distance of the two objects at time ti CDB The database of clusters using FCM algorithm Cti The clusters at time ti ctij The jth cluster of Cti Cti(oj) The set of clusters that object oj is in at timestamp ti p(oj ∈ ctij) The probability of object oj belonging to an cluster ctij p(O ∈ ctij) The probability of objects O belonging to an cluster ctij p(O ∈ Cti) The probability of objects O belonging to the clusters Cti mino The minimum objects threshold mint The minimum timestamps threshold minprob The minimum probability threshold Pr(support(O, T) ≥ mint) The probability that objects of O are in the same cluster for at least mint timestamps 3.2 Expected distance function Before giving the definition of the uncertain group pattern, we introduce a novel and adap- tive metric EE-distance (expected Euclidean distance) for measuring the similarity between two uncertain trajectories. Trajectory similarity is commonly estimated using trajectory distance measures, such as the Euclidean distance, the dynamic time warping (DTW) distance, the principal component analysis (PCA) distance, the edit distance with real penalty (ERP), and the longest common subsequence (LCSS) distance. However, there is no trajectory similarity measure that can beat all the others in every circumstance as introduced in [19]. In this paper we adopt the Euclidean distance due to its simplicity in implementation and low computation complexity. Next, we will show how to 432 S. Wang, L. Wu, F. Zhou, C. Zheng, H. Wang compute the uncertain instant distance between two trajectories based on the expected Euclidean distance. Definition 3. (Uncertain Instant Distance). We can treat the distance between two uncer- tain trajectories at timestamp ti as a square sum of the sample points, as shown in equation(1), UTra(i) is the sample point of uncertain trajectory UTra at timestamp ti, the same as UTrb(i). disti(UTra, UTrb) = (UTra(i) − UTrb(i))2 (1) UTra(i) and UTrb(i) are the independent random variables, so disti(UTra, UTrb) is also the random variable. The expectation of the random variable disti(UTra, UTrb) can be computed in equation (2). E((UTra(i) − UTrb(i))2) = E(UTra(i))2 + var(UTra(i)) +E(UTrb(i)) 2 + var(UTrb(i)) − 2E(UTra(i)) · E(UTrb(i)) (2) According to the equation (2), we can easily compute the expected distance of two uncertain locations in O(1) time complexity. Unlike other works requiring exact probability distribution function, our distance formulation is statistically sound and only requires knowledge of the general characteristics of the data distribution, namely, its mean and variance. 4 Mining Probabilistic Frequent Group Patterns In this section, we first provide definitions of our probabilistic frequent group patterns, and then show how to find all probabilistic frequent group patterns in an uncertain trajectory database. 4.1 Probabilistic frequent group patterns definition Definition 4. (frequent group pattern)[1]. A pair (O, T) is said to be a group pat- tern if all objects in O are in the same cluster at any timestamp in T . Specifically, given two minimum thresholds mino and mint, for (O, T) to be a frequent group pattern, where O = {oi1, oi2, ..., oil} ⊆ ODB and T ⊆ TDB, it needs to satisfy two requirements: (1) there should be at least mino objects; (2) objects in O are in the same cluster for at least mint timestamps. t1 t2 t3 cluster1 cluster2 cluster1 O2 O1 O3 O2 O3 O1 cluster2 cluster1 O2 O1 O3 Figure 1: Object clusters at each timestamp in certain database t1 t2 t3 cluster1cluster1 cluster2 cluster3 cluster3 cluster1 cluster2 Figure 2: Object clusters at each timestamp in uncertain database Fig.1 shows an example. There are 3 objects and 3 timestamps, ODB = {o1, o2, o3}, TDB = {t1, t2, t3}. Each sub-figure is a snapshot of object clusters at each timestamp. It is easy to see that o1 and o2 travel together for most of the time. Given mino=2 and mint = 2, there are only one frequent group pattern:({o1, o2}, {t1, t2, t3}). Group Pattern Mining Algorithm of Moving Objects’ Uncertain Trajectories 433 As shown in Fig.2, in the uncertain scenario, the object’s location is not an exact position, but an uncertain range. Thus, it is more realistic to cluster the object in different clusters than to a single cluster. Based on this point, we can apply a fuzzy clustering algorithm (e.g.FCM [20,21]) to create the cluster at timestamp ti. Fuzzy clustering would assign each object a Ą°degree of belongingnessĄą(belongingness probability) for each cluster. A simple example of clusters is given in table 2. Table 2: An example of uncertain clusters. Time Uncertain clusters 1 c11 = {o1 : 0.3; o2 : 0.5; o3 : 1.0}, c12 = {o1 : 0.5; o2 : 0.5}, c13 = {o1 : 0.2} 2 c21 = {o1 : 0.3, o2 : 0.8}, c22 = {o1 : 0.7, o3 : 0.2}, c23 = {o2 : 0.2, o3 : 0.8} 3 c31 = {o1 : 1.0, o2 : 1.0, o3 : 1.0} An object oj ∈ ODB with an uncertain location at timestamp t ∈ TDB has a belongingness probability of cluster c ∈ CDB, the probability is denoted as p(oj ∈ c), where p(oj ∈ c) ∈ [0, 1]. At timestamp ti, an object could belong to more than one cluster, we use Cti(oj) to denote the set of clusters that object oj is in, the total belongingness probability of object oj in different clusters is one p(Cti(oj)) = ∑ c∈Cti p(oj ∈ c) = 1. In addition, for a given objectset O, we write Cti(O) = ∩ oj∈O Cti(oj), which means O occurs at timestamp ti in the same cluster. We assume that different objects and different timestamps are mutually independent, i.e., the belongingness probability of an object has no effect on those of other objects. So, the probability of the object set O belonging to a cluster c could be computed in equation (3), the probability of object set O in the same clusters at timestamp ti could be computed in equation (4). To make our framework more general, we take clustering as a preprocessing step. The details are given in example 1. p(O ∈ ctij) = ∏ oj∈O p(oj ∈ ctij) (3) p(Cti(O)) = ∑ ctij∈Cti p(ctij ∈Cti) (4) Example 1: For TDB = {t1, t2}, ODB = {o1, o2, o3} in table 2, at timestamp t1 : c11 = {o1 : 0.3; o2 : 0.5; o3 : 0.8}; c12 = {o1 : 0.5; o2 : 0.5; o3 : 0.2}; c13 = {o1 : 0.2}, O = {o1; o2}, then p(o1 ∈ c11) = 0.3, p(o2 ∈ c11) = 0.5, p(O ∈ c11) = 0.3 ∗ 0.5 = 0.15,and p(Ct1(O)) = p(O ∈ c11) + p(O ∈ c12)=0.15 + 0.25 = 0.4. In the same way, at timestamp t2 and t3, p(Ct2(O))=p(O ∈ c21)=0.24, p(Ct3(O)) = 1.0. In the uncertain scenario, the number of timestamps that objects O in the same cluster at the timestamps T , denoted as support(O, T), is no longer certain. Instead co-occurrence is described by a discrete probability distribution function. As shown in example 1, the probability of the objects O = {o1; o2} occurring in the same cluster at timestamp t1 , t2 and t3 is 0.4, 0.24 and 1.0 respectively. The frequency distribution is described in table 3. For example, the probability that O occurring in the same cluster at least two timestamps is 0.054. The definition of probabilistic frequent group pattern should consider this uncertainty. Table 3: Frequency distribution of O = {o1; o2} occurring at timestamps T = {t1, t2, t3} Frequency of timestamp ≥ 0 ≥ 1 ≥ 2 ≥ 3 Probability 1.0 1.0 0.054 0.096 Definition 5. (Probabilistic frequent group pattern). A pair (O, T) is an probabilistic frequent group pattern iff(O, T) is a frequent group pattern and Pr(support(O, T) ≥ mint) ≥ minprob, minprob is the probability threshold. 434 S. Wang, L. Wu, F. Zhou, C. Zheng, H. Wang For a given group pattern (O, T), the frequentness probability, Pr(support(O, T) ≥ mint), which is interpreted as the probability that objects of O are in the same cluster for at least mint timestamps. Under the definition of the probabilistic frequent group pattern, it is critical to compute the frequent probability of a group pattern efficiently. The frequentness probability Pr(support(O, T) ≥ mint), could be computed by means of the paradigm of dynamic program- ming shown in equation(5) [22]. P (O,T) ≥i,j =   p (O,T) ≥i−1,j−1 ∗ p(O ∈ Ctj ) + p (O,T) ≥i,j−1 ∗ (1 − p(O ∈ Ctj )) O ∈ Ctj p (O,T) ≥i,j−1 O /∈ Ctj (5) For the sake of the following discussion, we define that P (O,T)≥i,j denotes the probability that objects O appears at least i timestamps among the first j timestamps in the given time set T . The dynamic programming approach is to split the problem of computing P (O,T)≥i,j at the first j timestamps into sub-problems of computing the frequentness probabilities at the first j − 1 timestamps. Our goal is to find all probabilistic frequent group patterns. Note that even though our problem is defined in the similar form of uncertain frequent pattern mining [22], none of previous work in uncertain frequent pattern mining area can solve exactly our problem. Because FP mining problem takes transactions as input, group pattern discovery takes clusters at each timestamp as input. If we treat each timestamp as one transaction, each "transaction" is a collection of "itemsets"rather than just one itemset.Therefore, there is no trivial transformation of FP mining problem to group pattern mining problem. The difference demands new techniques to specifically solve our problem. 4.2 Uncertain trajectory data preprocessing When mining probabilistic frequent group patterns, we assume that each moving object has a reported location at each timestamp. However, in most real cases, the raw data collected is not as ideal as we expected. The sampling timestamps for different moving objects are usually not synchronized. Even though many complicated interpolation methods could be used to fill in the missing data with higher precision, any interpolation is only a guessing of real positions. In this paper, we use linear interpolation to obtain possible position and statistical value at an arbitrary time between two consecutive sample times. We define the o(t) = {xt, yt, ut, vt} as an uncertain object between two consecutive sample timestamp ti and ti+1 , the value of o(t) can be computed in equation(6). xt = xi + (xi+1 − xi) · t − ti ti+1 − ti , yt = yi + (yi+1 − yi) · t − ti ti+1 − ti ut = ui + (ui+1 − ui) · t − ti ti+1 − ti , vt = vi + (vi+1 − vi) · t − ti ti+1 − ti (6) In order to make our approach has extensibility, we treat the computation of the spatial prox- imity and construction the clusters of objects with uncertain locations as a preprocessing step. In this way, spatial proximity of objects and clustering methods can be flexible and application- dependent. We only require a set of clusters as input for each timestamp, where each object is associated with a belongingness probability that specifies the confidence the object is in a cluster at a given timestamp. The preprocessing algorithm is as follows. Group Pattern Mining Algorithm of Moving Objects’ Uncertain Trajectories 435 Algorithm Preprocessing.Preprocessing algorithm input: uncertain trajectory database UTD output: uncertain clusters database CDB 1.For each object oi 2. For each timestamp tj 3. if (no sample value) 4. interpolate the value; 5. compute the expected distance distj(UTri, UTrk) for each object ok in ODB; 6.use FCM cluster algorithm based on the expected similarity; 7.output uncertain clusters database CDB; For each object oi, we compute the expected distance of every object ok at time tj (line 1-5), then we uses the FCM mining algorithm to cluster the uncertain objects at each timestamps(line 6). Finally, we get the uncertain clusters database CDB(line 7). 4.3 Pruning Techniques Although we use the dynamic programming to compute the frequent probability of a group pattern, the cost of computation is still high, the time complexity is O(n2), n is the number of timestamps in T . Fortunately, we find two efficient pruning rules based on properties of the frequent probability. Pruning rule 1(count prune): Given a group pattern (O, T), if cnt(O, T) < mint, then (O, T) is not a probabilistic frequent group pattern,cnt(O, T) is the numbers of timestamps in T that O in the same clusters. Pruning rule 2 (expected prune): Given a group pattern (O, T), e sup(O, T) is the expected support of a group pattern (O, T), is defined as the sum of the probabilities that objects O occurring in the same cluster in each of the timestamps in T , e sup(O, T) = ∑ tj∈T p(O ∈ Cti). If λ− < minprob, then (O, T) is not a probabilistic frequent group pattern. λ− =   −(mint −1 − e sup(O, T)2) 4e sup(O, T) 0 < mint −1 − e sup(O, T) e sup(O, T) < 2e − 1 21+e sup(O,T)−mint mint −1 − e sup(O, T) e sup(O, T) ≥ 2e − 1 e sup(O, T) mint other (7) Pruning rules could be used to prune infrequent group pattern before computed the exact frequent group probability. The running time of computing the cnt(O, T) and e sup(O, T) is O(n), but O(n2) for the exact probability Pr(support(O, T) ≥ mint). So pruning rules can significantly improve the running time of mining algorithm. 4.4 Mining algorithm of probabilistic frequent group pattern Traditional frequent itemset mining is based on support pruning by exploiting the anti- monotonic property of support. In uncertain databases, recall that support is defined by a prob- ability distribution and that we mine group patterns according to their frequentness probability. It turns out that the frequentness probability is also anti-monotonic. Theorem 6. ∀O ⊆ O(prime), Pr(support(O, T) ≥ mint) ≥ Pr(support(O(prime), T) ≥ mint), all subsets of a probabilistic frequent group patterns are also probabilistic frequent group patterns. 436 S. Wang, L. Wu, F. Zhou, C. Zheng, H. Wang We can use the Theorem 1 to prune the search space for probabilistic frequent group pattern. That is, if a group pattern (O, T) is not a probabilistic frequent group pattern,i.e.Pr(support(O, T) ≥ mint) < minprob, then all patterns O(prime) ⊃ O cannot be probabilistic frequent group pat- terns. We propose a probabilistic frequent group patterns mining approach based on the Apriori al- gorithm. Like Apriori, our method iteratively generates the probabilistic frequent group patterns using a bottom-up strategy. Each iteration is performed in two steps, a join step for generating new candidates and a pruning step for calculating the frequentness probabilities and extracting the probabilistic frequent group patterns from the candidates. The pruned candidates are, in turn, used to generate candidates in the next iteration. Theorem 1 is exploited in the join step to limit the candidates generated and in the pruning step to remove group patterns that need not be expanded. In the join step, our algorithm adopts the breadth-first implementation. Because the depth-first strategy does not fully use the downward closure of the probabilistic support,this is due to the fact that the depth-first implementation does not know all frequent k-objectset before considering the (k + 1)-objectset. This may lead to a bigger search space. The detailed steps of our algorithm to compute the probabilistic frequent group pattern is listed in Algorithm PFGPM. We first select frequent 1-object sets (line 1), and then recursively generate candidate (k + 1)-objectset from k-objectset (line 2-9). At each iteration, only the frequent k-objectsets are extended (Apriori property, line 3-4). We first scan the database to calculate the expected support of each candidate, and use the pruning rules to prune candidate (line 5), if not be pruned, compute the frequentness probability(line 6-8). Next, we output those patterns satisfying |O| ≥ mino and probabilistic support (line 9). Algorithm PFGPM. Probabilistic frequent group pattern mining algorithm input: uncertain clusters database CDB output: probabilistic frequent group patterns 1.Apply the pruning rules first, then calculate the frequentness probability for each 1-objec- tset,find all the frequent 1-objectset called Cand, and sort them in alphabetic order; 2.K = 2 3.For each O in Cand 4. extend O using a breadth-first search like strategy to its supersets with O as prefix, denoted O(prime); 5. Using pruning rule 1 and 2 for O(prime); 6. if(O(prime) not pruned) 7. Compute the frequent group probability(fgp) as shown in equation(5); 8. If(fgp ≥ minprob) 9. OUTPUT (O(prime), fgp), if|O(prime)| ≥ mino 10.K = K + 1; 11.go to 3; 5 Experiments 5.1 Datasets Real data. A truck dataset (http://www.chorochronos.org/) recording 50 trucks delivering concrete to construction sites around Athens over 33 days and consisting of 276 trajectories. To increase the size of moving objects, we considered each distinct trajectory as the ID of an object, yielding 276 trucks with 2449 timestamps. In our experiments, we consider only the first 128 Group Pattern Mining Algorithm of Moving Objects’ Uncertain Trajectories 437 positions of each trajectory. We normalize the positions of trajectories into a unit space. The probabilities of each trajectory were assigned according to two different distributions: (1) Each certain position p was assigned a probability according to a uniform distribution in the range of (0.5, 1.0]. (2) Each position was assigned a normal distribution N(0.5,0.2) in the range of [0, 1.0]. Synthetic data. We first generate two certain trajectories based on a custom GSTD data generator [23]. Each dataset contains 10,000 uncertain trajectories with the same length 128. We then convert these certain trajectories to uncertain trajectories in the way described above. The fuzzy clusters of teach timestamp are obtained by the fuzzy c-means clustering algorithm [20] with m = 2 and EPS = 0.01, where m is the weighting exponent and EPS is the termination criterion. Each object is assigned a belongingness probability by the fuzzy clustering algorithm. The default values of mino=10, mint=0.5,(half of the overlapping time span), minprob=0.3. 5.2 Performance evaluation No previous technique addresses probabilistic frequent group patterns mining for uncertain trajectories. We compare our PFGPM approach with an alternative, PFGPM-NP, that does not use any pruning rules.We present the experimental results in this section. All the experiments are run on a desktop PC with a 2.66GHz CPU and 4GB RAM. Effect of punning rules: We use two dataset with varying parameter settings to test the performance of the pruning rules. (1) Synthetic dataset. We extract the first 274 trajectories of synthetic dataset; (2) Truck datasets. Each position p in two datasets was assigned a probability according to a uniform distribution approach.As shown in Fig.3, the pruning works well for skewed dataset (truck dataset). The reason is straightforward: the more skewed the data, the higher the number of objects is infrequent and, thus, cannot be pruned. Regarding the effect of parameters, the larger mint and minprob are, the more objects will be pruned. However,mint has a more significant influence than minprob, that is, pruning rule 1 has more strong power than pruning rule 2. min t n u m b e r o f o b je c t to b e e n p ru n e d truck sythetic (a) varying mint minprob n u m b e r o f o b je c t to b e e n p ru n e d truck sythetic (b) varying minprob Figure 3: Pruning power Effect of |ODB| and |TDB|. We ran out our algorithm in synthetic data set. Fig.4 and Fig.5 depict the running time when varying |ODB| and |TDB| respectively. In both figures, PFGPM-NP is much slower than PFGPM. Furthermore, PFGPM-NP is usually 5 times slower than PFGPM. Comparing Fig.4 and Fig.5, we can see that PFGPM is more sensitive to the change of |ODB|. This is because its search space is enlarged with larger |ODB|, whereas the change of |TDB| increases the computing time of frequent probability, which does not directly affect the running time of PFGPM. Effect of mino. Fig.6 shows the running time w.r.t. mino. With the increasing of mino, the running time will decrease because less group patterns would meet the requirement. Besides, it is 438 S. Wang, L. Wu, F. Zhou, C. Zheng, H. Wang |ODB |(*100) R u n n in g t im e ( se c o n d ) Figure 4: Running time with varying |ODB| |T DB |(*100) R u n n in g t im e ( se c n o d ) Figure 5: Running time with varying |TDB| obvious that PFGPM-NP takes much longer time than PFGPM. The reason is that PFGPM-NP does not use any pruning rules to find the frequent group patterns, this leads to more redundant computation for frequent probability and larger search space. 0 200 400 600 800 1000 1200 1400 1600 3 10 15 20 25 30 min o ru n n in g t im e (s e c o n d s) PFGPM-NP PFGPM (a) synthetic data set 0 100 200 300 400 500 600 700 3 10 15 20 25 30 min o ru n n in g t im e (s e c o n d s) PFGPM-NP PFGPM (b) truck data set Figure 6: Object cluster at each timestamp in certain trajectory database Effect of mint. Fig.7 shows the influence of mint on the runtime when using different data sets. When mint decreases, we observe that the running time of all the algorithms goes up due to the number of probabilistic frequent group patterns increases. However, we find that the growth speed of PFGPM is low due to all pruning methods it employed. min t (*10) R u n n in g t im e ( se c o n d ) (a) synthetic data set 0 400 800 1200 1600 2000 1 2 3 4 5 6 7 8 9 10 min t (*10) R u n n in g t im e ( se c o n d ) PFGPM-NP PFGPM (b) truck data set Figure 7: Running time with varying mint Effect of minprob. Finally, we test the running time of two compared algorithms with varying the probabilistic frequent threshold, minprob in two different datasets. In Fig.8, we can obverse that PFGPM is always faster than PFGPM-NP algorithm. With regards to the change of minprob, the running time of all algorithms remains approximately the same. Thus, we can discover that the influence of probabilistic frequent threshold will be smaller than that of mint to the total running time. Group Pattern Mining Algorithm of Moving Objects’ Uncertain Trajectories 439 minprob ru n n in g t im e (s e c o n d ) (a) Synthetic data set minprob ru n n in g t im e (s e c o n d ) (b) truck data set Figure 8: running time with varying minprob 6 Conclusion In this paper, we have formulated and studied the problem of mining probabilistic frequent group patterns in uncertain trajectory database. We introduce a novel notion expected distance to measure the dissimilarity between uncertain locations. In order to mining such group pat- terns efficiently, we proposed several pruning techniques to reduce search space and to avoid many complicated computations. We further designed a Apriori-based algorithms using breadth-first implementations for efficient enumeration of all probabilistic frequent group patterns from un- certain data. Extensive experimental results show the effectiveness and efficiency of the mining algorithm. In our further study, we aim to extend our current approach to be able to handle more complex patterns for the trajectory data. Acknowledgement This research is partially supported by Fundamental Research Funds for the Central Uni- versities (Grant No. 130317003), the National Natural Science Foundation of China (Grant No. 61440014, Grant No. 61300196), the Natural Science and Technology Major Project (Grant No. 2103ZX03002006), and the Liaoning Province Science and Technology Project (Grant No. 2013217004). Bibliography [1] Z. Li, B. Ding, et al, Swarm: Mining relaxed temporal moving object clusters, the VLDB Endowment, 3(1-2):723-734. [2] D. Wegener, D. Hecker, et al, Parallelization of R-programs with GridR in a GPS-trajectory mining application, 1st Ubiquitous Knowledge Discovery Workshop on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, (2008). [3] X. Li, J. Han, et al, Traffic density-based discovery of hot routes in road networks, the 10th International Symposium on Spatial and Temporal Databases, 441-459, (2007). [4] Z.Li, J.G. Lee, et al, Incremental Clustering for Trajectories, the 15th Database Systems for Advanced Applications, 32-46, (2010). [5] X. Li, J. Han, et al, Motion-alert: automatic anomaly detection in massive moving objects, the 4th IEEE International Conference on Intelligence and Security Informatics, 166-177. 440 S. Wang, L. Wu, F. Zhou, C. Zheng, H. Wang [6] N. Pelekis, I. Kopanakis, et al, Clustering uncertain trajectories, Knowledge and Information Systems, 28(1): 117-147. [7] M. Chunyang, L. Hua , et al, KSQ: Top-k Similarity Query on Uncertain Trajecto- ries,Knowledge and Data Engineering, IEEE Transactions, 25(9): 2049-2062. [8] J. Hoyoung, Managing Evolving Uncertainty in Trajectory Databases, IEEE Transactions on Knowledge and Data Engineering, 26(7): 1692-1705. [9] J. Gudmundsson, M. V. Kreveld, Computing longest duration flocks in trajectory data, the 14th annual ACM international symposium on Advances in geographic information systems, 35-42, (2006). [10] J. Gudmundsson, M. V. Kreveld, et al, Efficient detection of motion patterns in spatio- temporal data sets, the 12th annual ACM international symposium on Advances in geographic information systems, 250-257, (2004). [11] H. Jeune, M. Yiu, et al, Discovery of convoys in trajectory databases, the VLDB Endowment, 1(1):1068-1080. [12] H. Jeune, H. Shen, et al, Convoy queries in spatio-temporal databases, the 24th International Conference on Data Engineering, 1457-1459, (2008). [13] L.A. Tang, Y. Zheng, et al, A Framework of Traveling Companion Discovery on Trajectory Data Streams, ACM Transaction on Intelligent Systems and Technology, 5(1):3. [14] L.A. Tang, Y. Zheng, et al, Discovery of Traveling Companions from Streaming Trajectories, the 28th IEEE International conference on Data Engineering, 186-197, (2012). [15] K. Zheng, Y. Zheng, et al, Online Discovery of Gathering Patterns over Trajectories, IEEE Transactions on Knowledge and Data Engineering, 26(8): 1974-1988. [16] Y. Tong, L. Chen, et al., Mining frequent itemsets over uncertain databases, VLDB Endow- ment, 5(11): 1650-1661. [17] M. Muzammal, R. Raman, Mining sequential patterns from probabilistic databases, the 15th Pacific-Asia conference, 210-221, (2011). [18] Z. Zhao, D. Yan, et al, Mining probabilistically frequent sequential patterns in uncer- tain databases, the 15th International Conference on Extending Database Technology, 74- 85,(2012). [19] H. Wang, H. Su, K. et al, An Effectiveness Study on Trajectory Similarity Measures, the 24th Australasian Database Conference, 13-22, (2013). [20] J. Bezdek, R. Ehrlich, et al, FCM: The fuzzy c-means clustering algorithm, Computers Geosciences, 10(2):191-203. [21] C. Hwang, F. C.-H. Rhee, Uncertain fuzzy clustering: interval type-2 fuzzy approach to c-means, Fuzzy Systems, 15(1):107-120. [22] T. Bernecker, H.-P. Kriegel, et al, Probabilistic frequent itemset mining in uncertain databases, the 15th ACM SIGKDD on Knowledge discovery and data mining, 119-128,(2009). [23] Y. Theodoridis, J. R. O. Silva, et al, On the generation of spatiotemporal datasets, the 6th International Symposium on Advances in Spatial Databases, 147-164, (1999).