Начиная с начала 2000 года осуществляется внедрение GHIS в здравоохранении, в рамках принятого проекта о реформирование информ Mathematical Problems of Computer Science 44, 145--153, 2015. Application of Multivariate Statistical Analysis in Process Control Tigran Z. Khachikyan and Sahak M. Narimanyan Yerevan State University Department of Probability Theory and Mathematical Statistics e-mail: tkhach@inbox.ru, sahakn@yandex.ru Abstract Due to significant increase of information systems and its intensive usage in our everyday life, several problems like automatic identification of system faults, finding times of drastic change in stochastic characteristics as well as locating those characteristics, which “went out of control” need to be addressed. To solve these problems, we propose an algorithm based on multivariate statistical analysis. The algorithm is implemented with the R software environment and tested on custom metrics for Vesta server and other groups of random metrics. Keywords: Principal component analysis, level of significance, T2 statistics, Q statistics, Loading matrix, Score matrix, Eigenvalues, Covariance matrix, Upper critical value. 1. Introduction Due to significant increase of information systems and its intensive usage in our everyday life, several problems like automatic identification of system faults, finding times of drastic change in stochastic characteristics as well as locating those characteristics, which “went out of control” need to be addressed. The normal process is usually conditioned by some characteristics, which may correlate with each other. In that case, analysis of individual characteristics may lead to significant errors due to different confidence intervals as well as impossibility of identification of joint level of significance. We used principal component analysis (PCA), Hotelling’s criteria based on 2T statistics, which is known to be uniformly the most powerful test (the null hypothesis for a vector of average values) in the class of all randomized tests invariant to transformations of similarity, to solve this problem. We also used Q -statistics for the residual matrix of dataset after PCA prediction. 2. Method Let ( , ,..., )1 2X X X X m= be a multivariate random variable, individual components of which characterize the state of a system having joint normal probability density function 145 mailto:tkhach@inbox.ru mailto:sahakn@yandex.ru 146 Application of Multivariate Statistical Analysis in Process Control { } 1 12 2( ) (2 ) | | exp ( ) ( ) , m Tf x x xπ µ µ − − −= Σ − − Σ − 𝑥𝑥 = ( 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑚𝑚), where 𝜇𝜇 a vector of average values, Σ is a covariance matrix. To test the hypothesis 0 0:H µ µ= in multivariate case, let us use the Hotelling’s 2T statistics ([1]) ( ) ( )2 10 0 T T n X S Xµ µ−= − − , where n , n m> , is a sample size, S is a sample estimate of covariance matrix Σ , is a mean vector of X . We will find values of 2T statistics for each instant time 𝑡𝑡 (𝑡𝑡 = 1, 2, … ). For this, we take historical data of fixed sample size n before time t , and shifting time t forward, while keeping the sample size unchanged. We will denote these values of 2T statistics as 2Tt . If the covariance matrix Σ is known, then the 2T statistics has a distribution 2χ and 2 2 (1 )T qcr mχ α= − , where 2 (1 )q mχ α− is the quantile of 2 mχ distribution with significance level α . And when the covariance matrix Σ is unknown, the upper critical value of 2T statistics is calculated as ( 1)2 (1 , , ) m n T qf m n mcr n m α − = − − − , where qf is the quantile of Fisher’s distribution with parameters ( , )m n m− , α is a significance level. The 2T statistics has a probability density function 1 1 21 2 1 2 ( ) , 0. 1 2 2 2 nm n x x n f x xm n m m n +−−+ Γ + = > − + Γ Γ                         We will consider that the studied system operates normally, when 22 crt TT < . The method of PC ([2]) is one of the ways to reduce the size of datasets, while losing minimum amount of information. For our system and in case of multivariate random variable, we need to construct such an orthogonal coordinate system to transform correlated variables into new uncorrelated variables. Sample divergence in relation to principal components is organized in decreasing order. We take so many principal components that the summarized sample divergence of PC is comprising 95% of the total divergence. Using this method, we get a loading matrix and a score matrix. The values of these new variables form the factor scores, and these scores can be interpreted geometrically as the projections of the observations onto the principal components. The loadings are simply the coordinates of original variables in the principal components space. The loading matrix P has dimensions ( )m k× , where m is a space dimension and k is the quantity of principal components. The scoring matrix T has dimensions ( )n k× , where n is the sample size, k - the number of principal components. The residual matrix is tR X TP= − , where is the dataset matrix, ( , ,..., )1 2X X X X m= . Using the method of principal components we decrease dataset space dimension and hence, lose some T. Khachikyan and S. Narimanyan 147 information. In order to estimate the influence of other parameters on our system, let us consider -statistics for the residual matrix introduced by Jackson ([2]). The –statistics is the following , where r is the vector-column of the residual matrix R . The upper critical value of Q statistics is = , where 1 , 1, 2,3, m i i j j k iθ λ = + = =∑ 2 1 31 .0 23 2 h θ θ θ = − Here Сα is a quantile of standard normal distribution with significance level 1 α− and jλ are eigenvalues of sample covariance matrix S . 3. Implementation 1) Let , , ,...0 1 2t t t be the arrival times of dataset, and 1t t consti i− =− . Let ∆ be the length of the interval for historical data. We will investigate the data matrix in the time interval , , 0,1,...t t kk k− ∆ =   2) At time 0t , we remove those variables from the data matrix, which are constant (do not change over that time interval). The resulting matrix represents a multivariate data sample at time 0t . We can normalize this matrix then. 3) We can employ PC method and as a result can find those components, which provide 95% of total divergence, score matrix, loading matrix, and residual matrix. 4) We can calculate the statistics at time 0t and at time 0t . Then we calculate both statistics and at time 0t .If 22 crTT < and crQQ < , then our system functions normally. Otherwise, if 22 crTT ≥ or crQQ ≥ the null-hypothesis is declined and it is assumed that the system “went out of control”, i.e., malfunctioning occurred. An automatic alert messaging to system administrators can be organized to take measures. 5) Calculation of weights for individual parameters in 2T and Q statistics takes place. Those parameters, which have significant weight, can be considered as cause for both the 2T and Q statistics to go out of control. For the next time 1t , we go back to point 1) and start over and again. 4. Computerized and Real-life Example In the real-life example below, the system monitors 16 different parameters of VESTA system working on real multivariate data. Implementation of the suggested algorithm and using PC method, the following results are obtained. 148 Application of Multivariate Statistical Analysis in Process Control F ig . 1 . 𝑄𝑄 st at is tic s tr an sf or m ed to u ni ty 𝑄𝑄 /𝑄𝑄 𝑐𝑐𝑐𝑐 . A t t im e 14 :1 2 th e va lu e of 𝑄𝑄 s ta tis tic s is g re at er th an 1 , s o th e sy st em “ w en t o ut o f c on tr ol ” . T. Khachikyan and S. Narimanyan 149 F ig . 2 . A t t im e 14 :1 2 th e gr ap h de pi ct s th e m et ri cs , w ei gh t o f w hi ch is th e hi gh es t, re su lti ng th e va lu e of Q s ta tis tic s is g re at er th an 1 . 150 Application of Multivariate Statistical Analysis in Process Control F ig . 3 . T he g ra ph o f 𝑇𝑇 2 tr an sf or m ed to u ni ty 𝑇𝑇 2 /𝑇𝑇 𝑐𝑐𝑐𝑐2 . A t t im e 16 :3 2 th e va lu e of 𝑇𝑇 2 st at is tic s is g re at er th an 1 , s o th e sy st em “ w en t o ut o f c on tr ol ”. T. Khachikyan and S. Narimanyan 151 Fi g. 4 . A t t im e 16 :3 2 th e gr ap h de pi ct s th e m et ri cs , w ei gh t o f w hi ch is th e hi gh es t, re su lti ng th e va lu e of 𝑇𝑇 2 s ta tis tic s to b e gr ea te r t ha n 1. 152 Application of Multivariate Statistical Analysis in Process Control 5. Conclusion Principal component analysis (PCA) is the most popular multivariate statistical technique and it is used by almost all scientific disciplines. PCA method can be successfully applied to provide solutions for many IT-related problems like automatic identification of system faults, finding times of drastic change in stochastic characteristics as well as locating those characteristics, which “went out of control”. A particular problem for system administrators is to monitor performance of custom metrics due to both absence of relevant thresholds and quantity of such metrics, which in some situations can be significant. The suggested algorithm used principal component analysis (PCA), Hotelling’s criteria based on 2T statistics, which is known to be uniformly the most powerful test (the null hypothesis for a vector of average values) in the class of all randomized tests invariant to transformations of similarity, to solve this problem. Q - statistics is also used for the residual matrix of dataset after PCA prediction. The algorithm is applied to real-life situation with VESTA system monitoring 16 different parameters on real multivariate data. The algorithm enables system administrators to identify event times when the system “went out of control” as well as to locate the “problematic” parameters causing such problems. Automatic alert messaging and control mechanism can be organized to warn system administrators to take measures. References [1] T. W. Anderson, An Inroduction to Multivariate Statistical Analysis, 3rd ed., Wiley series in Probability and Mathematical Statistics, 2003. [2] J. E. Jackson, A User’s Guide to Principal Components, Wiley series in Probability and Mathematical Statistics, 1991. Submitted 23.07.2015, accepted 27.11.2015 Բազմաչափ վիճակագրական վերլուծության կիրառություն գործընթացների կառավարման ոլորտում Տ. Խաչիկյան և Ս. Նարիմանյան Ամփոփում Մեր առօրյա կյանքում ինֆորմացիոն համակարգերի զգալի աճի և նրանց ինտենսիվ օգտագործման պատճառով, կարիք է առաջանում բազմաթիվ խնդիրների հետ առնչվել, ինչպիսիք են` համակարգի անսարքությունների ավտոմատ հայտնաբերումը, ստոխաստիկ բնութագրիչների կտրուկ փոփոխությունների պահերի որոշումը, ինչպես նաև` այն բնութագրիչների վերհանումը, որոնք «դուրս են եկել կառավարումից»: Այդպիսի խնդիրների T. Khachikyan and S. Narimanyan 153 լուծման նպատակով առաջարկում ենք ալգորիթմ՝ հիմնված բազմաչափ վիճակագրական վերլուծության վրա։ Ալգորիթմը իրականացված է R ծրագրավորման լեզվով և փորձարկված է Վեստա սերվերի պատահական բնութագրիչների համար, ինչպես նաև այլ պատահական մետրիկների խմբերի համար։ Применение многомерного статистического анализа для контроля процессов Т. Хачикян и С. Нариманян Аннотация В связи со значительным увеличением информационных систем и их интенсивным использованим в нашей повседневной жизни, возникают проблемы, такие как автоматическая идентификация неисправностей системы, нахождение времен резкого изменения стохастических характеристик, а также определение тех характеристик, которые "вышли из-под контроля". Для решения таких задач, мы предлагаем алгоритм основанный на многомерном статистическом анализе. Алгоритм реализован в среде программирования R и тестирован на случайных метриках сервера Веста и в других группах случайных метрик.