Microsoft Word - 1_R.doc HUNGARIAN JOURNAL OF INDUSTRIAL CHEMISTRY VESZPRÉM Vol. 37(1). pp. 59-67 (2009) PCA DRIVEN SIMILARITY FOR SEGMENTED UNIVARIATE TIME SERIES Z. BANKÓ, J. ABONYI University of Pannonia, Department of Process Engineering, P.O. Box 158. Veszprém H-8200, HUNGARY E-mail: abonyij@fmt.uni-pannon.hu Selection of the proper similarity measure is the cornerstone of all time series data mining task. In the recent years, many similarity measures have been introduced to fulfill the needs of chemical process engineering. These measures have been guided by data reduction methods due to the large amount of data. This data reduction can be done explicitly (by segmentation) as well as implicitly (by utilizing the latent variable space). Usually, the original multivariate data is projected into a single dimension with Principal Component Analysis (PCA) and segmentation is executed. However, the similarity measures which have been used to compare univariate, segmented representations of the original processes do not consider that the main information carried by the univariate representations is the correlation of the original variables. This paper introduces a PCA inspired similarity measure for these univariate segments. Finally, it is shown that the presented method can be considered as the logical extension of the Correlation Based Dynamic Time Warping (CBDTW) to univariate time series. Keywords: segmentation, Dynamic Time Warping, Principal Component Analysis, Piecewise Linear Approximation Introduction A time series is a sequence of values measured as a function of time. These kinds of data are widely used in the field of chemical process engineering, namely for process control, fault detection and analysis of process transitions. The increasing popularity of knowledge discovery and data mining tasks for discrete data has indicated the growing need for similarly efficient methods for time series data. These tasks share a common requirement: a similarity measure has to be defined between the elements of a given database. Moreover, the results of the data mining methods from simple clustering (partitioning the data into coherent but not predefined subsets) and classification (placing the data into predefined, labelled groups) to complex decision-making systems used for process control are highly dependent on the applied similarity measure. Related work The similarity of multivariate time series can be approached from two different perspectives. The first way is the application of metrics based warping measures such as Dynamic Time Warping (DTW) and Longest Common SubSequence (LCSS). These techniques are perfectly suitable for univariate tasks like speech recognition, where the analyzed process is represented by only one variable. In most cases, these methods can be easily generalized for the needs of the multivariate time series where the process depends on two or more variables. However, their application for correlated multivariate time series is often not as effective as it is expected. The direct comparison of the variables used by these approaches ignores the hidden process, i.e. the correlation between the process variables and this hidden process carries the real information in most process control task [1]. Hence, Principal Component Analysis (PCA) based similarity measures are used to overcome this problem. Krzanowski [2] defined the PCA similarity factor to measure the similarity between different data by comparing the hyperplanes (the dimensionality reduced latent variable spaces): p UUUUtr YXs pX T pYpY T pX nnPCA nnnn )( ),( ,,,,= where: Xn – the first n-variable multivariate time series Yn – the second n-variable multivariate time series UXn,p and UYn,p – the matrices of eigenvectors which belong to the most important p ≤ n eigenvalues of covariance matrices of Xn and Yn, i.e. the two hyperplanes. The similarity factor has a geometrical explanation, because it measures the similarity between the two hyperplanes by computing the squared cosine values between all the combinations of the first p principal components from Xn and Yn: ∑∑ = = = p i p j jinnPCA p YXs 1 1 , 2cos 1 ),( Θ 60 where: Θi,j – the angle between the i th principal component of Xn and the j th principal component of Yn. The main advantage of Krzanowski's similarity factor is its optimal variable reduction property (from variance point of view) which makes it ideal for tasks with high number of variables. PCA similarity factor has also gained in popularity because of its outstanding feature which makes possible to recognize the direction of the changes in the distance between the variables, i.e. the rotation of the hyperplanes. On the other hand, PCA similarity factor weights all principal components equally, hence it may not capture well enough the degree of similarity between the sequences when only a few principal components explain most of the variance. Thus, it was natural to define a modified PCA similarity factor that weights each principal component by its explained variance. M. C. Johannesmeyer [3] defined this modified PCA similarity factor by weighting each principal component with its eigenvalue: ∑ ∑∑ = = == p i Y i X i ji p i p j Y j X i nnPCA nn nn YXs 1 , 2 1 1 )( Θcos)( ),( λλ λλ λ where: λi Xn and λi Yn – the corresponding eigenvalues of the ith and jth principal component of Xn and Yn. This principle was developed by K. Yang [4] who presented the logical extension of PCA similarity factor and SλPCA called Eros (Extended Frobenius Norm): ∑∑ == == n i ii n i T YXinnEros wiUiUwYXs nn 11 Θcos)()(),( where: Θi – the angle between the two corresponding principal components wi – the weighting vector based on the eigenvalues of the sequences in the data set. Figure 1: Two corresponding principal components Generally speaking, the Eros measures the similarity of two multivariate time series by comparing the angle between the corresponding principal components and using the aggregated eigenvalues as weights, hence it takes into account the variance of each principal component. It has to be noted that Eros always computes the acute angle between the two principal components (eigenvectors). Therefore, as it is illustrated in Fig. 1, when the angle (α) between the two corresponding eigenvectors is not acute, the absolute value of it is taken and the similarity between the two corresponding eigenvectors is computed by using the acute angle (β). More specifically, the inner product of two normal vectors, a and b in Fig. 1, yields cos(α), while cos(β) = cos(π-α) = –cos(α) is needed. Therefore, the absolute value of cosine of the angle between the eigenvectors is taken, so that cos(α) is computed when α ≤ π/2, while -cos(α) is computed when α > π/2. The previously mentioned methods greatly improved the simple PCA similarity factor both in speed and in accuracy; however, they have not dealt with the biggest problem of every PCA related technique, i.e. PCA considers the time series as a whole but does not take into account the alterations of the correlation structure. This alternation affects the hyperplanes, therefore segmentation is required in most real-life applications to create homogeneous segments from correlation structure point of view. However, the segmentation raises another problem: Although in many real-life applications a lot of variables must be simultaneously tracked and monitored, most of the segmentation algorithms are used for the analysis of only one time-variant variable [5]. Hence, dimensionality reduction techniques are used, most likely PCA, to project the multivariate time series into the one dimensional space where any suitable univariate segmentation method such as Piecewise Aggregate Approximation (PAA) or Piecewise Linear Approximation (PLA) is executed. Finally, the segments are compared with a suitable similarity measure. Finding this suitable similarity measure is a difficult task. Besides the obvious Euclidean distance other, more advanced measures can be used such as DTW. It proved its adaptability and superiority over other similarity measures in wide range of time series applications from speech recognition [6] to fingerprint verification [7] and as final proof of this Yanikoglu won the Signature Verification Conference with a DTW based algorithm in 2004 [8]. Keogh and Pazzani presented modifications on DTW algorithm to handle PAA [9] and PLA [10] representations of univariate time series. Although these new algorithms can provide noticeably better results than Euclidean distance and speeds up the computationally expensive general DTW algorithm, they do not take the drift of the segments into account. Thus, the segmentation has to be framed in another way. Instead of compressing the multivariate data to a univariate time series applying PLA or PAA, the segmentation can be done in the multivariate space while the correlation is still considered. The authors introduced [11] two homogeneity measures as cost function for segmentation which are corresponding to the two typical applications of PCA models. The Q reconstruction error can be used to segment the time series according to the direct change of the correlation among the variables, while the Hotelling's T2 statistics can be utilized to segment the time series based on the drift of the center of the operating region. a α β b 61 Based on these new segmentation methods a novel similarity measure, called Correlation Based Dynamic Time Warping was created [12]. It was proven that it outperforms all of the previously introduced correlation based similarity measures when the multivariate time series are highly correlated. This paper introduces the univariate version of CBDTW which can be used for PCA projected univariate time series. The rest of the paper is organized as follows. Section Background details the theoretical background of the proposed similarity measure, i.e. segmentation and DTW. In the next section the problems of the current DTW algorithms used for segmented representations are pointed out and the new similarity measure is presented. It is also discussed how it was derived from CBDTW. Section 4 conducts a detailed empirical comparison of the introduced PCA based method with currently used measures on verification databases widely used by the time series data mining community. Finally, validation is performed by clustering of temperature data from a sophisticated model of an industrial catalytic fixed bed tube reactor. Background An n-variable, m-element time series, Xn = [x1,x2,…,xn], is an m-by-n element matrix, where xi = [xi(1),xi(2),…,xi(n)] T is the ith variable and xi(j) its j th element. Xn(j) = [x1(j),x2(j),…,xn(j)] is the j th sample of Xn. The similarity between Xn and Yn is denoted by s(Xn, Yn), where 0 ≤ s(Xn, Yn) ≤ 1, s(Xn, Yn) = s(Yn, Xn), and s(Xn, Xn) = 1. Obviously, the similarity is nothing more than a real number between zero and one which expresses the tightness of connection between the processes behind the time series. The closer the number is to one, the processes are treated more similar. In practice, the term distance or dissimilarity (d) is used instead of similarity. The value of the distance is given by a number ranged from zero to one. This can be associated with similarity: d(Xn, Yn) = 1 – s(Xn, Yn). Segmentation The ith segment of Xn is a set of consecutive time points, Si(a,b) = [Xn(a), Xn(a+1),…, Xn(b)]. The c-segmentation of time series Xn is a partition of Xn to c non-overlapping segments, ScXn = [S1(1,a),S2(a+1,b),…,Sc(k+1,m)]. In other words, a c-segmentation splits Xn to c disjoint time intervals, where 1 ≤ a and k ≤ m. The simplest but yet powerful segmentation technique for univariate time series is PAA. In this case, to reduce the m-length data from N, the time series are simply divided into N similar sized frames and each frame is represented by its mean value. Assuming that N is a factor of m, we get: ∑ +−= = i N m i N m j jx m N ix 1)1( 11 )()( Besides filtering noise, PAA can compensate phase shifts of time axis and difference sampling rates of time series and the distance between two PAA representations can be chosen almost freely; however, it cannot handle “locally elastic” shifts of the time axis and it is not enough tight representation for most time series as it is shown in Fig. 2. Figure 2: The original signal (top) and its PAA (middle) and PLA representation (bottom) using 10 segments To correct these faults, more sophisticated methods are applied such as PLA which, however, arise the „segmentation problem”, i.e. how to segment a times series quickly and how to represent each segment tight enough. Segmentation can be framed in several ways [13], but its main goal is always the same: finding homogenous segments by the definition of a cost function, cost(Si(a,b)). This function can be any arbitrary function which projects from the space of the time series to the space of the non-negative real numbers. Usually, cost(Si(a,b)) is based on the distances between the actual values of the time series and the values given by a simple function f (constant or linear function, a polynomial of a higher but limited degree) fitted to the data of each segment: ∑ =+− = b al nni lXflXdab bas )))((),((( 1 1 )),((cost Thus, the segmentation algorithms simultaneously determine the parameters of the models and the borders of the segments by minimizing the sum of the costs of the individual segments: ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = ∑ = c i i c X baSs N 1 ))),((cost(min)(cost This segmentation cost of a time series can be minimized by dynamic programming, which is 62 computationally intractable for many real datasets. Consequently, heuristic optimization techniques such as greedy top-down or bottom-up techniques are frequently used to find good but suboptimal c-segmentations: - Bottom-Up: Every element of Xn is handled as a segment. The costs of the adjacent elements are calculated and two elements with the minimum cost are merged. The merging cost calculation of adjacent elements and the merging are continued until some goal is reached. - Top-Down: The whole Xn is handled as a segment. The costs of every possible split are calculated and the one with the minimum cost is executed. The splitting cost calculation and splitting is continued recursively until some goal is reached. - Sliding Window: The first segment is started with the first element of Xn. This segment is grown until its cost exceeds a predefined value. The next segment is started at the next element. The process is repeated until the whole time series is segmented. All of these segmentation methods have their own specific advantages and drawbacks. Accordingly, the sliding window method is not able to divide up a sequence into predefined number of segments but this is the fastest method. The applied method depends on the acutal task. These heuristic optimization techniques were examined in detail through the application of Piecewise Linear Approximation [13]. It can be stated if real-time (on-line) segmentation is not required, the best result will be reached by Bottom-Up segmentation. While PAA represents an equal sized segment with only one value, PLA replaces the original data with not equally sized segments of straight lines, i.e. a PLA segment of x1 is a 4-taple: [ ]iiii ylxylxxrxxlxix )(,)(,)(,)()( 11111 = where: x1(xl)i and x1(xr)i – left and right time coordinates of the ith segment of x1 x1(yl)i and x1(yr)i – the values of x1 in x1(xl)i and x1(xr)i. Finding the optimal PLA a of time series and a suitable distance for this representation is a difficult task, that usually depends on the application. However, precison of the mostly used, traditional Euclidean distance (or any other Lp norm) can be significantly increased with the application of DTW. Dynamic Time Warping The traditional comparision approaches are rarely precise enough for the most applications. This is caused by the brittleness of the conventional similarity measures such as Euclidean distance. They are unable to handle the distortions in time axis, so these distortions almost randomly affect the distance between time series. The solution for this problem is the application of DTW which can “warp” the original time series (nonlinearly dilate or compress their time axes) to be similar in shape to the query series as much as possible. To align two univariate sequences (x1 and y1) with DTW, firstly a grid D have to be defined with size of the two time series (mx x my). Each cell of this matrix represents the actual distance between the appropriate indices of the two time series. In this step, any application-dependent distance like L1 and L∞ norms can be used but generally Euclidean distance is suggested because it allows of the efficient indexing of DTW. Considering this we have: 2 2 11 ))()((),( jyixjiD −= Using grid D, many arbitrary mappings – called warping paths – can be created between x1 and y1. However, the construction of a warping path [p(1),p(2),…,p(l)] has to be restricted with the following constraints: - Boundary condition: The path has to start in D(1,1) and end in D(mx,my). - Monotonicity: The path has to be monotonous, i.e. always heading from D(1,1) to D(mx,my). If p(k) = D(i,j) and p(k+1) = D(i’,j’) then i’–i ≥ 0 and j’–j ≥ 0 . - Continuity: The path has to be continuous. If p(k) = D(i,j) and p(k+1) = D(i’,j’) then i’–i ≤ 1 and j’–j ≤ 1. To find the optimal warping path (the DTW distance of the two time series), every warping path has an assigned cost which is the sum of values of the affected cells divided by normalization constant K: ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ = ∑ = K ip yxd l i DTW 1 11 )( min),( The value of K depends on the application and in most cases this is the length of the path, but it can also be omitted. More information about the method of defining K and its significance can be found in [14]. Note that the Euclidean distance is a special case of DTW, i.e. we choose the path that is located on the diagonal of grid D and K = 1. Figure 3: Cumulative distance matrix D and the optimal warping path on it 63 Obviously, the number of paths exponentially grows by the size of time series. Fortunately, the optimal path can be found in O(mxmy) time with the help of dynamic programming using cumulative distance matrix D. A cell of the cumulative matrix contains the sum of the appropriate cell value in matrix D and the minimum of the three cells from where the cell can be reached: ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎣ ⎡ +−− +− +− = ),(D)1,1(D ),(D)1,(D ),(D),1(D min),(D jiji jiji jiji ji The DTW distance between the two time series can be found in D(mx,my). Warping the representations DTW can be applied easily on the PAA representations of univariate time series. Each segment is represented by its mean, thus the DTW algorithm is the same as for the original time series but the computational time is lowered to O((m/N)2). Warping the PLA representation is much more difficult. Vullings et al. [15] were the first to use PLA based DTW on ECG data while Keogh and Pazzani [10] gave a generalized distance between PLA segments which can be used to fill matrix D: 2 11 2 )()( 2 )()( ))(),(( ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + − + = iiii yryylyyrxylx jyixd As it can be seen this distance arises from PAA: it compares the means of the corresponding segments but it utilizes the tighter PLA representation. However, this tighter representation, i.e. the different length of the segments, introduces a new problem. The normalization constant K usually based on the length of the warping path but PLA, contrary to PAA, does not generate equidistant segments. Thus, Keogh and Pazzani suggested to recursively sum an additional variable on the warping path which stores the lengths of the visited segments. PCA driven similarity of PLA representation As it was mentioned before many problems arise when one would like to use data mining alogrithms on highly correlated mutivariate time series data. The high amount of data requires some reduction techniques. In chemical process engineering, usually PCA is used to create univariate time series from the multivariate data because PCA preserves the correlation of the original time series which is definitely an important factor to consider. The generated univariate time series make it possible to use traditional segmentation techniques which are often required for two reasons: first, the highly appreciated DTW algorithm computational extensive thus further data compression is advised and second, process transitions and frauds can be revealed by segmentation. Considering the previously mentioned, PAA is not suitable for our purposes because the segments are equidistant and they are represented by their means, hence PAA representation lost the direction of the latent variable in each segment. PLA generates a much tighter representation and it can preserve the direction information. In addition, it can be seen as the one dimensional version of PCA based segmentation presented in [12,13]. The authors used the Q reconstruction error, i.e. the goal of the segmentation was to minimize the sum of the squared Euclidean distances between the original and the reconstructed variables in each segment. PLA does the same, it minimizes the squared Euclidean distance between the original data point and their reconstructed pair (the closest point on the PLA segment). This reconstruction error is shown in case of PLA segmentation in Fig. 4. X 1 (t) time Q Figure 4: Time series x1 (black dots) and its reconstruction using the PLA representation (gray dots) Unfortunately, the currently used DTW based measures do not take into consideration that the PLA created straight line is the latent variable of each segment while the mean of each segment are used. In [12], it is suggested to use any of the PCA based similarity measures to compare the hyperplanes of two segments when the segmentation was driven by PCA. This leads to represent each PLA segment with its angle of slope: ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − − = ii ii xlxxrx ylxyrx ix )()( )()( atan)(1 Please note, none of the segments can be perpendicular to the time axis, hence no further restrictions are required. Now, any of the above presented PCA based similarity measures can be used to compare the segments based on the slope information. The interested reader might notice that the only difference between the reviewed PCA based similarity measures how they weights the angles between the hyperplanes. In the one dimensional space all of them can be reduced to the same equation: ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + − = )()(1 )()( atancos))(),(( 11 11 11 iyix iyix iyixd where |cos(…)| – can be replaced by cos2(…). 64 This would be a perfect measure for hyperplanes of a PCA projection, where the hyperplanes are the coordinates of the projected space, thus their signs are not only meaningless but considering them would destroy the precision. However, the PLA representation of a segment of a univariate time series (i.e., its latent variable) is also represents the original time series in the same space, thus the sign information cannot be neglected. For example, if one compares x1(i) = 57 (~89°) and y1(i) = –57 (~91°) as hyperplanes they are almost the same (only 2° is the difference); however, they are completely different from the time series point of view (increasing and decreasing trends). So, the basis distance of DTW should be a function which can measure the difference between the signed slopes. For this paper, the simple squared Euclidean distance was chosen: 2 1111 ))()(())(),(( jyixjyixd −= The last thing to handle is the K constant. If it can be treated that preprocessing steps are properly executed (i.e. data is filtered and it is proved that no additional noise is added to it), the role of the K constant is changed. The proper preprocessing ensures that all of the segments are detected due to the underlying process and they are not detected due to the noise; moreover, the minimum number of elements of a segment can also be defined. Thus, an additional warping is achieved if the weights of the segments are not considered and the length of the warping path can be used as the value of K or it can be omitted as it is done in this paper. Validation According to [16], the presented similarity measure was compared to other methods using the free and widely used datasets and classification algorithm of the UCR Time Series Classification/Clustering Homepage [17]. The datasets were kindly provided by Mr. Keogh on March 7, 2007. Please note, for easier reconstruction of the results, the algorihms were not executed on the four biggest dataset: Face (all), Two Patterns, Wafer, Yoga. In this test, the suggested 1-NN classification algorithm was executed on the databases and the error rate (proportion of the faulty classified time series) of each database for every measure was recorded in Table 1 and Table 2. As a reference, the non-segmented time series was also compared with Euclidean distance and DTW, their results can be found in the third and fourth columns. The first four colored columns contain the results of the currently used similarity measures. The time series were segmented with PAA and PLA to 30 pieces using Bottom-Up technique and Euclidean or the reviewed DTW distance was applied. Please note, this means that the mean of the segments were used to compare two segments irrespectively of the applied segmentation method. A ng le 3 0 Se gm en ts D T W 0. 37 0. 64 0. 63 0. 59 0. 11 0. 19 0. 39 0. 15 A ng le 3 0 Se gm en ts E uc lid ea n 0. 46 0. 71 0. 67 0. 64 0. 32 0. 23 0. 72 0. 46 PL A 3 0 Se gm en ts D T W 0. 40 0. 66 0. 50 0. 04 0. 29 0. 20 0. 30 0. 38 PL A 3 0 Se gm en ts E uc lid ea n 0. 35 0. 76 0. 53 0. 13 0. 18 0. 20 0. 39 0. 51 PA A 3 0 Se gm en ts D T W 0. 36 0. 43 0. 53 0. 02 0. 25 0. 21 0. 23 0. 22 PA A 3 0 Se gm en ts E uc lid ea n 0. 36 0. 41 0. 50 0. 07 0. 25 0. 13 0. 17 0. 21 N o W ar pi ng W in do w D T W 0. 31 0. 40 0. 50 0. 00 0. 18 0. 23 0. 17 0. 17 E uc lid ea n D is ta nc e 0. 37 0. 39 0. 47 0. 15 0. 25 0. 12 0. 22 0. 22 T im e Se ri es L en gt h 27 0 17 6 47 0 12 8 28 6 96 35 0 46 3 N um be r o f cl as se s 5 0 37 5 3 2 2 4 7 N am e 50 W or ds A di ac B ee f C B F C of fe e E C G Fa ce (f ou r) Fi sh Ta bl e 1: R es ul ts o f th e U C R t im e se ri es c la ss if ic at io n al go ri th m o n th e fi rs t ei gh t 65 A ng le 3 0 Se gm en ts D T W 0. 05 0. 44 0. 67 0. 37 0. 29 0. 22 0. 67 0. 01 A ng le 3 0 Se gm en ts E uc lid ea n 0. 05 0. 39 0. 78 0. 40 0. 70 0. 51 0. 67 0. 17 PL A 3 0 Se gm en ts D T W 0. 07 0. 20 0. 44 0. 30 0. 44 0. 28 0. 01 0. 03 PL A 3 0 Se gm en ts E uc lid ea n 0. 07 0. 31 0. 47 0. 33 0. 52 0. 44 0. 01 0. 19 PA A 3 0 Se gm en ts D T W 0. 07 0. 26 0. 30 0. 13 0. 42 0. 21 0. 04 0. 10 PA A 3 0 Se gm en ts E uc lid ea n 0. 09 0. 25 0. 34 0. 13 0. 47 0. 21 0. 01 0. 29 N o W ar pi ng W in do w D T W 0. 09 0. 13 0. 27 0. 13 0. 41 0. 21 0. 01 0. 00 E uc lid ea n D is ta nc e 0. 09 0. 25 0. 43 0. 13 0. 48 0. 21 0. 12 0. 24 T im e Se ri es L en gt h 15 0 63 7 31 9 57 0 42 7 12 8 60 27 5 N um be r o f cl as se s 2 2 7 4 6 15 6 4 N am e G un -P oi nt L ig ht ni ng -2 L ig ht ni ng -7 O liv eO il O su L ea f Sw ed is h L ea f Sy nt he tic C on tr ol T ra ce ‘Angle’ denotes the presented similarity measure. The time series were segmented to 30 pieces again but now the slope of each segment was considered. The provided representations were also compared with Euclidean distance and DTW. The best results are marked with dark grey for each database. As it can be seen the newly presented method kept up with the expectations. However, some notes have to be made on the results. One can realize that the reduction obtained by the segmentation was only about one order of magnitude which ensures the tight representation even with PAA and 30 segments is too much for most databases when PLA is used. Moreover, all of the databases require different number of segments to get the best results. The aim of this test was to show that it really makes sense to use the slope of a segment instead of its mean when the segment is provided by PLA. Obviously, the best PAA and PLA representation can be found for all distance measure but the idea behind the creation of such a test is to provide a unified test environment for all measures and to prevent the researcher from “over optimization”. Validation on an industrial fixed bed tube reactor The proposed PCA driven similaity measure has also been applied for clustering of temperature data from a sophisticated model of an industrial catalytic fixed bed tube reactor. coolant in coolant out reagents product x = 0 x = L ( )Wout,W B,T ( )Win,W B,T ( )out,Gout,Giout,Gout,G p,c,B,T ( )in,Gin,Giin,Gin,G p,c,B,T Figure 5: The simplified scheme of the studied reactor The studied vertically built up reactor contains a great number of tubes with catalyst. Highly exothermic reaction occurs as the reactants rising up the tubes pass the fixed bed of catalyst particles and the heat generated Ta bl e 2: R es ul ts o f t he U C R ti m e se ri es c la ss if ic at io n al go ri th m o n th e se co nd e ig ht d at as et s 66 by the reaction escapes through the tube walls into the cooling water. Due to this highly exothermic reaction which takes place in the catalyst bed the reactor very sensitive for the development of reactor runaway. Reactor runaway means a sudden and considerable change in the process variables. The development of runaway is in very close relationship with the stability of reactor/model. Runaway has two main important aspects. In one hand runaway forecast has a safety aspect, since it is important for avoiding the damage the constructional material or in the worst case scenario the explosion of reactor. On the other hand, runaway has a technology aspect, since the forecast of the runaway can be used for avoiding the development of hot spots in catalytic bed. The selection of operation conditions is important to avoid the development of reactor runaway and to increase the lifetime of catalyst at same time. The worked out mathematical model of the studied reactor has been presented in [5]. The model has been implemented in MATLAB and solved with a low order Runge-Kutta method. 0 1 2 3 4 250 300 350 400 Reactor length [m] T e m p e ra tu re [ K ] 0 1 2 3 4 200 400 600 800 Reactor length [m] T e m p e ra tu re [ K ] Figure 6: Normal (upper graph) and runaway (lower graph) profiles of the studied reactor The obtained simulator was applied to calculate two kinds of profiles. The inlet conditions were set to provide five profiles which correspond to the normal working conditions and other five profiles which describe the development of reactor runaway. The presented similarity measure has been used to classify these profiles. The result of clustering can be seen in the dendogram of Fig. 7. 1 3 5 2 4 6 7 10 9 8 D is ta n ce Normal Runaway Figure 7: Clustering result of the five Normal (number 1-5) and five Runaway (number 6-10) profiles Conclusion We presented a new similarity measure for segmented univariate time series by considering the PLA representation of a segment as the latent variable. The proposed similarity was validated on many real world dataset used by data mining community and on the temperature profiles generated by a sophisticated model of an industrial reactor. Both evaluations show that it is worth to consider representing the segments by their slopes instead of their means and using this feature for the comparison. However, there is no decided difference between the two methods, thus we intend to combine these approaches in the future. ACKNOWLEDGEMENT János Abonyi is grateful for the support of the Bolyai Research Fellowship of the Hungarian Academy of Sciences. The financial support of the TÁMOP-4.2.2- 08/1/2008-0018 project is gratefully acknowledged. REFERENCES 1. KOURTI T.: Application of Latent Variable Methods to Process Control and Multivariate Statistical Process Control in Industry, International Journal of Adaptive Control and Signal Processing 19 (2005), 213-246 2. KRZANOWSKI W.: Between-groups comparison of principal components, Journal of the American Statistical Society 74 (1979), 703-707 3. JOHANNESMEYER M. C.: Abnormal Situation Analysis Using Pattern Recognition Techniques and Historical Data, M.Sc. Thesis, University of California, Santa Barbara, CA (1999) 67 4. YANG K., SHAHABI C.: A PCA-based Similarity Measure for Multivariate Time Series, Proceedings of the 2nd ACM International Workshop on Multimedia Databases (2004), 65-74 5. KIVIKUNNAS S.: Overview of Process Trend Analysis Methods and Applications, ERUDIT Workshop on Applications in Pulp and Paper Industry (1998), ERUDIT 6. SAKOE H., CHIBA S.: Dynamic Programming Algorithm Optimization for Spoken Word Recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing (1978) 7. VAJNA Z. M. K.: A Fingerprint Verification System Based on Triangular Matching and Dynamic Time Warping, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000), 1266-1276 8. KHOLMATOV A., YANIKOGLU B. A.: Biometric Authentication Using Online Signatures, ISCIS (2004) 9. KEOGH E. J., PAZZANI M. J.: Scaling up Dynamic Time Warping for Datamining Applications, Knowledge Discovery and Data Mining (2000) 10. KEOGH E. J., PAZZANI M.: Scaling up Dynamic Time Warping to Massive Datasets, 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases (1999) vol. 1704 11. ABONYI J., FEIL B., NÉMETH S., ÁRVA P.: Principal Component Analysis Based Time Series Segmentation, IEEE International Conference on Computational Cybernetics (2005) 12. BANKÓ Z., ABONYI J.: Correlation Based Dynamic Time Warping of Multivariate Time Series, Computational Statistics & Data Analysis (In press) 13. KEOGH E. J., CHU S., HART D., PAZZANI M. J.: An Online Algorithm for Segmenting Time Series, ICDM (2001) 14. SAKOE H., CHIBA S.: Dynamic Programming Algorithm Optimization for Spoken Word Recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing (1978) 15. VULLINGS H. J. L. M., VERHAEGEN M. H. G., VERBRUGGEN H. B.: ECG Segmentation Using Time-Warping, Proceedings of the Second International Symposium on Advances in Intelligent Data Analysis, Reasoning about Data (1997) 16. KEOGH E. J., KASETTY S.: On the Need for Time Series Data Mining Benchmarks: a Survey and Empirical Demonstration, Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002) 17. KEOGH E. J., XI X., WEI L., RATANAMAHATANA C. A.: The UCR Time Series Classification/Clustering Homepage, http://www.cs.ucr.edu/~eamonn/time_series_data/, Riverside CA. University of California - Computer Science and Engineering Department (2006) 18. VARGA T., SZEIFERT F., RÉTI J., ABONYI J.: Analysis of the runaway in an industrial heterocatalytic reactor, Computer-Aided Chemical Engineering 24, (2007), 751-756