ijcccv10n1dz.pdf INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 10(1):8-21, February, 2015. Method for Visual Detection of Similarities in Medical Streaming Data J. Bernatavičienė, G. Dzemyda, G. Bazilevičius, V. Medvedev, V. Marcinkevičius, P. Treigys Jolita Bernatavičienė*, Gintautas Dzemyda, Gediminas Bazilevičius, Viktor Medvedev, Virginijus Marcinkevičius, and Povilas Treigys Vilnius University Institute of Mathematics and Informatics Lithuania, LT-08663 Vilnius, Akademijos, 4 Jolita.Bernataviciene@mii.vu.lt, Gintautas.Dzemyda@mii.vu.lt, Gediminas.Bazilevicius@mii.vu.lt, Viktor.Medvedev@mii.vu.lt, Virginijus.Marcinkevicius@mii.vu.lt, Povilas.Treigys@mii.vu.lt *Corresponding author: Jolita.Bernataviciene@mii.vu.lt Abstract: The analysis of medical streaming data is quite difficult when the prob- lem is to estimate health-state situations in real time streaming data in accordance with the previously detected and estimated streaming data of various patients. This paper deals with the multivariate time series analysis seeking to compare the current situation (sample) with that in chronologically collected historical data and to find the subsequences of the multivariate time series most similar to the sample. A visual method for finding the best subsequences matching to the sample is proposed. Using this method, an investigator can consider the results of comparison of the sample and some subsequence of the series from the standpoint of several measures that may be supplementary to one another or may be contradictory among themselves. The advantage of the visual analysis of the data, presented on the plane, is that we can see not only the subsequence best matching to the sample (such a subsequence can be found in an automatic way), but also we can see the distribution of subsequences that are similar to the sample in accordance with different similarity measures. It allows us to evaluate differences among the subsequences and among the measures. Keywords: Streaming data, similarity measures, multivariate time series, visualiza- tion, multidimensional scaling. 1 Introduction Time series data are widely available in different fields including medicine, finance, and science. A time series is a collection of chronologically performed observations of the values of a feature that characterizes the behaviour of a particular object. There are many topics in time series data mining, i.e., similarity search, clustering, classification, anomaly detection, motif discovery, etc. The similarity problem can be defined as a comparison of two time series to determine whether they are similar or not. Usually, the choice of a similarity measure can affect the result of data mining tasks. By a similarity measure we mean a method, which compares two time series and returns the value of their similarity. If the object is characterized by several features, we have a multivariate time series (MTS) [1]. In this paper, we investigate the similarity search in multivariate physiological time series. A physiological time series is a series of some medical observations over a period of time. Such a type of data can be collected using devices (or sensors) that collect personal medical features, such as heart rate, blood pressure, etc. An example of such data can be the intensive care multivariate online-monitoring time series [2]. A sensor is an instrument that detects or measures a physical or environmental characteristic or state, transmits and/or records the reading in some form (e.g., Copyright © 2006-2015 by CCC Publications Method for Visual Detection of Similarities in Medical Streaming Data 9 Figure 1: Example of the multivariate time series a visual display, audio signal, digital transmission, etc.). A sensor converts the physical quantity to electric output. For example, a pressure sensor converts pressure to electric output. Remote monitoring of health parameters such as the pulse rate, oxygen level in blood or blood pressure can be very helpful for early detection of diseases, resulting in reduction of treatment time. Most methods, used to analyse medical data, focus primarily on analysing the univariate time series. However, because of parameter dependences and variation over time, examination of all medical data together in a multivariate time series can provide more information about the data and patient, to make a better diagnosis and treat the patient [3]. Therefore, this paper deals with the multivariate time series analysis with a view to com- pare the current situation with that of in chronologically collected historical data, and to find subsequences of the multivariate time series most similar to the sample, corresponding, e.g. to the current situation. An example of MTS of four features (heart rate HR, non-invasive systolic arterial blood pressure SYS, non-invasive diastolic arterial blood pressure DIAS, temperature TEMP) is presented in Figure 1. Let us have a multivariate time series of n features and Ta observations: Xa =     xa 11 · · · xa 1Ta ... . . . ... xan1 · · · x a nTa     . Denote the sample of n features and Tb observations as Xb =     xb 11 · · · xb 1Tb ... . . . ... xbn1 · · · x b nTb     . Here Ta > Tb. In fact, Xa and Xb are matrices. As a result, we need to find the optimal place of Xb on Xa. The place is defined by some time moment T∗ : 1 ≤ T∗ ≤ Ta − Tb + 1. Our procedure analyses the multivariate time series Xa by using the moving time window the width of which is adapted to the current situation Xb (width is equal to Tb) and comparing the content of this window with the sample, in the sense of several similarity measures, at the same time. Visual method of finding the best subsequences matching to the sample is proposed in this paper. As it is indicated in [4], the goal of visual analytics research is to turn the information overload into an opportunity, i.e. decision-makers should be enabled to examine massive, mul- tidimensional, multisource, time varying information stream to make effective decisions in time critical situations. 10 J. Bernatavičienė, G. Dzemyda, G. Bazilevičius, V. Medvedev, V. Marcinkevičius, P. Treigys There are attempts to apply visual analysis for the streaming data. The example is the Vi- sual Content Analysis of Real-Time Data Streams project [5] at the Pacific Northwest National Laboratory. Its goal was to allow users to quickly grasp dynamic data in forms that are intuitive and natural without requiring intensive training in the use of specific visualization or analysis tools and methods. The project has prototyped five different visualization prototypes that rep- resent and convey dynamic data through human-recognizable contexts and paradigms such as hierarchies, relationships, time and geography. In this paper, we suggest using the specific visualization tools and methods (multidimensional data visualization [6]) that are effective and also do not require intensive training of the users. Moreover, we show a possibility of making the decision, based on five criteria of similarity of the sample with the subsequence of real-time data stream by representing the similarity as a point on a plane. Dimensionality reduction and visual analysis of multidimensional data [6] have been applied when comparing the best found subsequences in Xa. The proposed method is described in Section 2. Similarity measures for multivariate time series and comparative analysis of the measures are presented in Section 3. Multidimensional data visualization is reviewed in Section 4, where the emphasis is put on the multidimensional scaling. Comparative analysis of similarity measures for multivariate time series is presented in Section 5. An example, illustrating the proposed method, is presented in Section 6. 2 Visual Method for Finding the Best Subsequences Matching to the Sample In our method, the multivariate time series Xa are analysed by using the moving time window. The width of this window is adapted to the current situation (sample) Xb and is equal to Tb. The content of this window is compared with the sample, in the sense of several similarity measures at the same time. It includes the dimensionality reduction procedure that allows us to observe multidimensional data visually. The visual method for finding the best subsequences, matching to the sample, can be gener- alized as follows: 1. Let us have: - a multivariate time series Xa of n features and Ta observations; - sample Xb of n features and Tb observations; - m similarity measures Si, i = 1, . . . , m. 2. The sample Xb is compared with all subsequences of Xa by using m similarity measures Si, i = 1, . . . , m. The subsequences are obtained by moving the time window in the Xa from beginning to end. The content of such a window is a matrix of n rows. Denote it by Xc. The width of the window (the number of columns of Xc) is adapted to the current situation (sample) Xb (its width is equal to Tb). For each measure, k subsequences are chosen most similar to the sample. Therefore, the total number of subsequences for a further analysis is equal to km. 3. Each comparison of the sample with a subsequence, chosen in the way defined in the step above, produces a m-dimensional point Sq = (Sq1, Sq2, . . . , Sqm), where, in our case, q = 1, . . . , km. Let us derive two additional points: - S0 = (S01, S02, . . . , S0m) is the array of values of all similarity measures, computed for the subsequence, that is ide- ally coincident with the sample (the array of the best values of m similarity measures); - SC = (SC1, SC2, . . . , SCm) is the weight center of Sq = (Sq1, Sq2, . . . , Sqm), q = 1, . . . , km. Therefore, the total number of m-dimensional points for discovering the most similar subse- quences to the sample is equal to km+2. Afterwards, the normalization of the components Method for Visual Detection of Similarities in Medical Streaming Data 11 of these points is performed by z-score. Denote the obtained matrix of normalized points by Z. It consists of km + 2 rows and m columns. 4. The points from matrix Z are mapped on the plane using the Multidimensional Scaling [6] (or are other algorithm of nonlinear projection of multidimensional points on the plane). Denote the resulting matrix by Y , that contains km + 2 rows, corresponding to different comparisons of the sample with other subsequences, and 2 columns. Each row is coordinates of the point on the plane. 5. The investigator analyses the information presented graphically, where all m-dimensional points are represented as the points on a plane, and makes decisions. In general, the most similar subsequence to the sample can be the subsequence, the corresponding point of which on a plane is closest to the projection of S0 on the plane. However, more subsequences may be considered as similar to the sample. The points on the plane, corresponding to such subsequences, must be closer to the projection of S0 on a plane than to the projection of SC. These rules can be checked automatically in the program realization of this method, however participation of the investigator is valuable, because it gives a possibility to him to cognize the data deeper. The advantage of this method is that the investigator can consider the results of comparison from the standpoint of several measures that may be supplementary to one another or contra- dictory among themselves. Therefore, the similarity of subsequences with the sample will be evaluated from different standpoints. The method is universal, because different sets of similar- ity measures can be chosen, depending on the problem, but the scheme of decision remains the same. Moreover, the involvement of the dimensionality reduction and visual analysis of multidi- mensional data in the proposed method renders the opportunity to the investigator to participate in the final decision, when comparing the best found subsequences of the multivariate time series with the sample. However, the decision on their similarity can also be made automatically. 3 Similarity Measures for Multivariate Time Series To detect events in real multivariate time series, it is necessary to compare time series using the appropriate similarity measure [7]. Different techniques and similarity measures are intro- duced and used for comparison of multivariate time series of different nature [8], [9]. Multivariate time series can be reduced to univariate time series and their similarity can be measured, using a univariate time series approach [10]. That may lead to a great loss of information, there- fore, we concentrate on the multivariate time series approach here. Five similarity measures Si, i = 1, . . . , 5, used in this paper for multivariate time series, are presented below. Let us compare two multivariate time series: Xa =     xa 11 · · · xa 1Ta ... . . . ... xan1 · · · x a nTa     and Xb =     xb 11 · · · xb 1Tb ... . . . ... xbn1 · · · x b nTb     . The Frobenius norm is often used in the matrix analysis [11]. This similarity measure is based on the Euclidean distance. The Frobenius norm of a matrix Xb is defined by the formula: ∥ ∥ ∥ X b ∥ ∥ ∥ F = √ √ √ √ n ∑ p=1 Tb ∑ q=1 (xbpq) 2 = √ tr((Xb)′Xb), (1) 12 J. Bernatavičienė, G. Dzemyda, G. Bazilevičius, V. Medvedev, V. Marcinkevičius, P. Treigys where tr is the sum of elements on the diagonal of the square matrix. The Frobenius norm is used to compare the similarity of two matrices. The similarity of Xb and Xc is defined by the formula Frob = ∥ ∥Xb − Xc ∥ ∥ F . The best possible value of the Frobenius norm is 0. The correlation coefficient between two matrices of the same size (Matrix Correlation Coeffi- cient) can also be used as a similarity measure [12]: r = ∑n p=1 ∑Tb q=1 (x b pq − X̄ b)(xcpq − X̄ c) √ ∑n p=1 ∑Tb q=1 (x b pq − X̄ b)2 ∑n p=1 ∑Tb q=1 (x c pq − X̄ c)2 , (2) where X̄b and X̄c are the means of Xb and Xc, respectively. This measure is the Pearson correlation coefficient adapted to matrices and calculated using the MATLAB corr2 function [12]. The best possible value of the matrix correlation coefficient is 1.The correlation coefficient between two matrices has found wide applications in the image analysis, molecular biology, etc. The third similarity measure for multivariate time series is the Principal Component Analysis (PCA) similarity factor [9], [13]. PCA is a well-known and wide used technique for dimension- ality reduction of data. It is a linear transformation that projects the original data to a new coordinate system with the minimal loss of information. In multivariate cases, the information is the structure of the original data, i.e. the correlation between the features and alteration of the correlation structure among them. To create a projection, PCA selects coordinate axes of the new coordinate system one by one according to the greatest variance of any projection. The PCA similarity factor is defined by the following formula: SPCA(X b , X c) = tr(L′MM ′L), (3) where L and M are matrices that contain the first l principal components of Xb and Xc, respec- tively. It means that the principal components are computed by the standard algorithm using the matrices (Xb)′Xb and (Xc)′Xc, and then l principal components with the highest eigenvalues are selected. The best possible value of the PCA similarity factor is l. In our experiments l = 1. Dynamic time warping (DTW) [14] is the most widely used technique for comparison of time series data, where extensive a priori knowledge is not available. The Euclidean distance reflects the similarity in time, while the dynamic time warping (DTW) reflects the similarity in shape. DTW searches for the best alignment between two time series, attempting to minimize the distance between them. The advantage of DTW is that it can handle unequal series and distortions. Multidimensional Dynamic Time Warping (MDTW) is presented in [15]. Some distance matrix is defined: {d(p, q) = ∑n k=1 (x b kp − xc kq ) 2 , p, q = 1, . . . , Tb}. Then the matrix D of cumulative distances is calculated as in the traditional DTW algorithm [15]: D(p, q) =                          d(1, 1), if p = 1, q = 1, d(p, q) + D(p − 1, q), if p = 2, . . . , Tb, q = 1, d(p, q) + D(p, q − 1), if p = 1, q = 2, . . . ., Tb, d(p, q) + min        D(p − 1, q) D(p, q − 1), D(p − 1, q − 1) in other cases. (4) (p, q) defines the pair of the pth observation in Xb and the qth observation in Xc. Finally, the minimal path and the distance along the minimal path are obtained using matrix D. The path must start at the beginning of each time series at (1, 1) and finish at the end of both time series Method for Visual Detection of Similarities in Medical Streaming Data 13 at (Tb, Tb). See [13] for details. The best possible value of MDTW is 0. On the other hand, DTW can lead us to unintuitive alignments, where a single point on one time series maps onto a large subsection of another time series [16], [17]. Also, DTW can fail to find the obvious and natural alignments in two time series because of a single feature (i.e. peak, valley, infection point, plateau, etc.). One of the causes is due to the great difference between the lengths of the compared series. In this paper, the fifth similarity measure for multivariate time series is Eros (Extended Frobe- nius norm) [9]. Eros is based on the principal component analysis and computes the similarity between two MTS items by measuring how close the corresponding principal components are us- ing the eigenvalues as weights. In our case, Xb and Xc are two multivariate time series items of n features and Tb observations. V b = [vb1, . . . , v b n] and V c = [vc 1 , . . . , vcn] are two right eigenvector matrices obtained by applying a singular value decomposition (SVD) to the covariance matrices Mb and Mc of features in Xb and Xc respectively. The Eros similarity of Xb and Xc is defined as follows: Eros(Xb, Xc, w) = n ∑ i=1 wi| 〈 v b i , v c i 〉 |, (5) where 〈 vbi , v c i 〉 is the inner product of vbi , and v c i , w is a weight vector, based on eigenvalues of the MTS data set (see more in detail in [9]), ∑n i=1 wi = 1. 4 Multidimensional Data Visualization The method, proposed in Section 2 for finding the best subsequences matching to the sample, is based on the visual presentation and analysis of multidimensional points the coordinates of which are the values of similarity measures, computed for a pair of subsequences. The visualiza- tion technology is introduced below. For an effective data analysis, it is important to include a human into the data exploration process and combine the flexibility, creativity, and general knowledge of the human with the enormous storage capacity and computational power of today’s computer. Visual data mining aims at integrating the human in the data analysis process, applying the human’s perceptual abilities to the analysis of large data sets, available in today’s computer systems. Visualization finds a wide application in the medical data analysis, too [18], [19]. The goal of the projection method is to represent the input data items in a lower-dimensional space so that certain properties of the structure of the data set were preserved as faithfully as possible. The projection can be used to visualize a data set, if rather a small output dimensional- ity is chosen. One of these methods is the principal component analysis (PCA). The well-known principal component analysis [6] can be used to display the data as a linear projection on a subspace of the original data space such that best preserves the variance in the data. PCA cannot embrace nonlinear structures, consisting of arbitrarily shaped clusters or curved mani- folds, since it describes the data in terms of a linear subspace. Therefore, several methods have been proposed for reproducing nonlinear higher-dimensional structures on a lower-dimensional display: multidimensional scaling and its modifications [6], [20], [21], [22], Isomap [23], locally linear embedding [24], etc. Various neural network approaches are used for this aims as well (see e.g. [25], [26], [27]). Multidimensional scaling (MDS) is a group of methods that project multidimensional data to a low (usually two) dimensional space and preserve the interpoint distances among data as much, as possible. Let us have the m-dimensional points Sq = (Sq1, Sq2, . . . , Sqm), q = 1, . . . , t, (Sq ∈ 14 J. Bernatavičienė, G. Dzemyda, G. Bazilevičius, V. Medvedev, V. Marcinkevičius, P. Treigys Rm). The pending problem is to get the projection of these points onto the plane R2. Two- dimensional points Y1, Y2, . . . , Yt ∈ R2 correspond to them. Here Yq = (yq1, yq2), q = 1, . . . , t. Denote the distance between the points Sq and Sp by d∗qp, and the distance between the corre- sponding points Yq and Yp on the projected space by dqp. In our case, the initial dimensionality is m, and the resulting one is 2 (2-D). Naturally, 1-D and 3-D projections could be considered, too. However, in the 1-D case, we lose knowledge that can be obtained from 2-D or 3-D views. Advantages of 3-D can be achieved when special means to present such data on the screen are applied. Therefore, 2-D projections of the multidimensional data are commonly used. There exists a multitude of variants of MDS with slightly different so-called stress functions. In our experiments, the raw stress is minimized EMDS = ∑t q