Начиная с начала 2000 года осуществляется внедрение GHIS в здравоохранении, в рамках принятого проекта о реформирование информ Mathematical Problems of Computer Science 44, 138--144, 2015. The Optimal Approach for Kolmogorov-Smirnov Test Calculation in High Dimensional Space Norayr Z. Akopov and Narek H. Martirosyan Yerevan Physics Institute e-mail: -narek@mail.yerphi.am Abstract Numerical estimation of the Kolmogorov-Smirnov discrepancy DN in high dimensional space is an extremely time and memory consuming problem. New approach with the minimal bin number, which essentially reduces the time and memory requirements, to perform the DN tests in two and more dimensional space is discussed. Keywords: Statistics, Kolmogorov–Smirnov test, Goodness-of-Fit tests, PRNG. 1. Introduction One of the most important tasks nowadays is the numerical experiments on super computers with the modeling of the sophisticated physics system. The Monte Carlo method is used for solving the problems where the high dimensional integration is involved. To use the Monte Carlo method for analysis of the high energy physics experimental and theoretical problems we have to solve the problem of the quality of the pseudo-random generators, which should have a strong statistical feature, also a large period of sequences and a high speed of generating of the pseudo- random number. The Kolmogorov-Smirnov (KS) test is used to compare how well the empirical cumulative distribution function (ECDF) of a sample fits the cumulative distribution function (CDF) of the reference distribution by computing the maximum distance 𝐷𝐷𝑚𝑚𝑚𝑚𝑚𝑚 between these two functions. The KS test is applied to exactly continuous data, and for dimensions 𝑑𝑑 ≥ 2 we have to compute the distance on infinite points then take a maximum, because when 𝑑𝑑 ≥ 2 for CDF and ECDF we have d-dimensional surfaces. The KS test is widely used as a powerful statistical test to check the quality of different pseudo-random number generators (PRNGs). 138 https://en.wikipedia.org/wiki/Empirical_distribution_function https://en.wikipedia.org/wiki/Empirical_distribution_function https://en.wikipedia.org/wiki/Cumulative_distribution_function N. Akopov and N. Martirosyan 139 Several algorithms for computing the two-dimensional KS test were proposed [1-3]. We propose another approach for 2 and higher dimensions. Our method is based on discretization of the space introducing a binning technique applied to continuous multidimensional data. This technique does not correspond to the usual KS test principle, but we show that it is possible to compute 𝐷𝐷𝑚𝑚𝑚𝑚𝑚𝑚 very precisely with the quite computationally efficient algorithm taking minimal bin number. Here our KS test results are obtained using the well-tested uniform Mersenne Twister PRNG [4], which is included in CERN library [5]. In the proposed paper we describe a new technique, which allows to reduce essentially the number of the bins and correspondingly decrease the needed time and memory used. 2. Formalism In one-dimensional (1D) case with N random numbers to check if they come from uniform distribution in the (0, 1) range, for which the CDF is equal to: 𝐹𝐹(𝑥𝑥) = 𝑥𝑥, we need to compute 𝐷𝐷𝑁𝑁. Usual (unbinned) approach [6- 8] in 1D case looks as follows:  N numbers should be sorted in increasing order {𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁}, 𝑥𝑥𝑖𝑖 ≤ 𝑥𝑥𝑖𝑖+1  then 𝐷𝐷𝑁𝑁 is computed in the following way defining ECDF as: 𝐹𝐹𝑁𝑁(𝑥𝑥) = 𝑘𝑘 𝑁𝑁 , 𝑥𝑥 = 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁, where k is the number of random points out of N with 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑘𝑘 ≤ 𝑥𝑥 𝐷𝐷𝑁𝑁 = max0≤𝑚𝑚≤1 |(𝐹𝐹𝑁𝑁(𝑥𝑥) − 𝑥𝑥)| = max1≤𝑖𝑖≤𝑁𝑁 �� 𝑖𝑖 𝑁𝑁 − 𝑥𝑥𝑖𝑖� , � 𝑖𝑖 − 1 𝑁𝑁 − 𝑥𝑥𝑖𝑖��. The unbinned approach can be also applied for 2D case following the algorithm described in [1], which has been used for the performed studies. In the 1D binned approach the (0, 1) interval is divided into 𝑛𝑛 bins, then N random numbers are distributed in 𝑛𝑛 bins according to their values. The ECDF is again defined as: 𝐹𝐹𝑁𝑁(𝑥𝑥) = 𝑘𝑘 𝑁𝑁 . The difference is that here 𝑥𝑥 is already the bin edge: 𝑥𝑥 = 𝑖𝑖 𝑛𝑛 , 𝑖𝑖 = 1,2, … , 𝑛𝑛. For example, if 𝑛𝑛 = 10, we have the bins edges like: {0.1, 0.2, … , 0.9,1} . Therefore, the advantage of the method is that if N is quite large then ECDF can be defined with a relatively small number of points (𝑛𝑛 < 𝑁𝑁). 𝐷𝐷𝑁𝑁 = max0≤𝑚𝑚≤1 |(𝐹𝐹𝑁𝑁(𝑥𝑥) − 𝑥𝑥)| = max1≤𝑖𝑖≤𝑛𝑛 |𝐹𝐹𝑁𝑁(𝑥𝑥𝑖𝑖) − 𝑥𝑥𝑖𝑖| = max1≤𝑖𝑖≤𝑛𝑛 � ∑ 𝑌𝑌𝑘𝑘𝑖𝑖𝑘𝑘=1 𝑁𝑁 − 𝑖𝑖 𝑛𝑛 �. The Optimal Approach for Kolmogorov-Smirnov Test Calculation in High Dimensional Space 140 In 1D case there exists the theoretical distribution function for the value of 𝐾𝐾𝑁𝑁 = √𝑁𝑁 𝐷𝐷𝑁𝑁 (also the mean value and dispersion for 𝐾𝐾𝑁𝑁), which is used to make a decision to accept or reject the hypothesis concerning the uniform distribution of the tested sample of random numbers. This binned method can be generalized for dimensions 2, 3 and higher. In two-dimensional case for 𝑥𝑥1 ≤ 𝑋𝑋1, 𝑥𝑥2 ≤ 𝑋𝑋2 the corresponding expression for 𝐷𝐷𝑁𝑁 is: 𝐷𝐷𝑁𝑁 = max0≤𝑚𝑚1≤1 0≤𝑚𝑚2≤1 |(𝐹𝐹𝑁𝑁(𝑥𝑥1, 𝑥𝑥2) − 𝑥𝑥1𝑥𝑥2)| = max1≤𝑖𝑖≤𝑛𝑛 1≤𝑗𝑗≤𝑛𝑛 � ∑ ∑ 𝑌𝑌𝑘𝑘𝑚𝑚 𝑗𝑗 𝑚𝑚=1 𝑖𝑖 𝑘𝑘=1 𝑁𝑁 − 𝑖𝑖𝑖𝑖 𝑛𝑛2 �. where 𝑌𝑌𝑘𝑘𝑚𝑚 is the number of observations that actually do fall into category {𝑘𝑘, 𝑚𝑚}, 𝑁𝑁 is the count of all observations, n is the number of bins used along the x and y. In d-dimension case of 𝑥𝑥1 ≤ 𝑋𝑋1, 𝑥𝑥2 ≤ 𝑋𝑋2, … … . , 𝑥𝑥𝑑𝑑 ≤ 𝑋𝑋𝑑𝑑 the expression for KS test is as follows: 𝐷𝐷𝑁𝑁 = max0≤𝑚𝑚1≤1 0≤𝑚𝑚2≤1… 0≤𝑚𝑚𝑑𝑑≤1 |(𝐹𝐹𝑁𝑁(𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑑𝑑) − 𝑥𝑥1𝑥𝑥2…𝑥𝑥𝑑𝑑)| = max1≤𝑖𝑖1≤𝑛𝑛 1≤𝑖𝑖2≤𝑛𝑛… 1≤𝑖𝑖𝑑𝑑≤𝑛𝑛 � ∑ ∑ … ∑ 𝑌𝑌𝑘𝑘1𝑘𝑘2…𝑘𝑘𝑑𝑑 𝑖𝑖𝑑𝑑 𝑘𝑘𝑑𝑑 𝑖𝑖2 𝑘𝑘2=1 𝑖𝑖1 𝑘𝑘1=1 𝑁𝑁 − 𝑖𝑖1𝑖𝑖2 … 𝑖𝑖𝑑𝑑 𝑛𝑛𝑑𝑑 �. With any of schemes to calculate the average 𝐾𝐾𝑁𝑁 described above (binned and unbinned) we can estimate also the statistical uncertainties Δ𝐾𝐾𝑁𝑁 making sampling for calculated values of 𝐾𝐾𝑁𝑁 and 𝐾𝐾𝑁𝑁2 with the certain number M of the used samples. The mean value for 𝐾𝐾𝑁𝑁 averaging over M samples is given by: < 𝐾𝐾𝑁𝑁 >= 1 𝑀𝑀 �𝐾𝐾𝑁𝑁 𝑖𝑖 , 𝑀𝑀 𝑖𝑖=1 and mean squared 𝐾𝐾𝑁𝑁 is defined as: < 𝐾𝐾𝑁𝑁2 >= 1 𝑀𝑀 �(𝐾𝐾𝑁𝑁 𝑖𝑖 𝑀𝑀 𝑖𝑖=1 )2, then we can calculate 𝜎𝜎𝐾𝐾𝑁𝑁 2 as: 𝜎𝜎𝐾𝐾𝑁𝑁 2 =< 𝐾𝐾𝑁𝑁2 > −< 𝐾𝐾𝑁𝑁 >2. And finally, based on the central limiting theorem we can calculate the statistical uncertainty Δ𝐾𝐾𝑁𝑁 as: Δ𝐾𝐾𝑁𝑁 = 1 √𝑀𝑀 �𝜎𝜎𝐾𝐾𝑁𝑁 2 . 3. Results and Discussion The first set of the obtained results is related to the unbinned approach, where in 1D case we can compare not only the obtained average < 𝐾𝐾𝑁𝑁 > with the predicted theoretical value of 0.8687311605.. [7], but also the observed and theoretical (TH) distributions for 𝐾𝐾𝑁𝑁 (see Fig. 1). N. Akopov and N. Martirosyan 141 Fig 1. TH (curve) and observed (red points) for KN distributions. On Fig. 2 one can see the observed distribution obtained for 2D case, which is systematically shifted in respect to 1D TH distribution. Fig 2. TH distribution (curve) for 1D and observed distribution (red points) for 2D cases. Then the next set of results, which is related to the binned method, is shown in Figs. 3-4. In Fig.3 one can see that for 𝑁𝑁 = 105 the number of 𝑛𝑛 = 104 bins is enough to get the exact distribution which was the goal of this paper. The Optimal Approach for Kolmogorov-Smirnov Test Calculation in High Dimensional Space 142 Fig 3. 1D case: TH distribution (curve) and binned distributions (colored points). In Fig. 4 the < 𝐾𝐾𝑁𝑁 > dependence on the number of bins is presented for 1, 2 and 3D cases. Here one can recognize the clear dependence of the obtained values for < 𝐾𝐾𝑁𝑁 > on the number of the bins (n) used. Taking into account the obtained saturating shape on the n for such dependences we performed the fit with the following fitting function: 𝑓𝑓𝑓𝑓𝑖𝑖𝑓𝑓 𝑝𝑝 → (𝑥𝑥) = 𝑝𝑝1 + 𝑝𝑝2 exp(−𝑝𝑝3𝑥𝑥). The meaning of the parameter 𝑝𝑝1 corresponds to the saturation level, which can be achieved with quite large value of n. Although, using very high values for the bins number it is impossible in practice to calculate the 𝐾𝐾𝑁𝑁 in the sense of the needed CPU time and memory, especially in case of higher (>4) dimensional space. That is why the idea to use a limited number of points over the n, then estimate the saturation parameter (𝑝𝑝1), and then make a correction for the average value of𝐾𝐾𝑁𝑁, estimated with e.g., n=10, introducing the correction factor: 𝐶𝐶10 = 𝑝𝑝1 − < 𝐾𝐾𝑁𝑁(𝑛𝑛 = 10) >, seems to be very interesting and effective in order to realize the procedure of the < 𝐾𝐾𝑁𝑁 > estimation in case of high dimensional space. Also the exponential coefficient (𝑝𝑝3), which regulates the speed of convergence to the saturation level, can be used to check the quality of different PRNGs. It should be noted, that the used fitting function is quite stable in respect to the variation of the fitting points used. In 3D case with the total number of points (ntotal=11), this variation was 11, 9, 7, 5. In 2D case with the ntotal=19: 19, 15, 11, 7, and in 1D case with ntotal=75: 75, 55, 35, 15. This is an important feature of the functional form used for a fit, because in case of essentially higher dimensions one can compute with the binned method a limited number of points, perform the fit and estimate the saturation level. In further studies the authors are going to provide the dependence as a function of the space dimension, which is very interesting due to the lack of knowledge on TH distribution for high dimension space. N. Akopov and N. Martirosyan 143 Fig. 4. Average KN dependence on the number of bins (n). The 1, 2 and 3D cases are presented with the fit parameters for each case. Different colors along the curves correspond to the variation of the number of fitting points used (see explanation in the text). 4. Conclusion The optimization of the Kolmogorov-Smirnov discrepancy DN calculation with the binned method for high dimensional spaces by means of reducing the used bin number is performed. Developed approach allows to extend the estimation of DN to very high dimensional spaces with reasonable requirements to the CPU time and memory, and, thus, to provide very important tests of the PRNGs, which are used to apply the Monte Carlo method for solving of essentially high dimensional physics problems. References [1] A. Justel, D. Pena and R. Zamar, “A multivariate Kolmogorov-Smirnov test of goodness of Fit”, Statistics & Probability Letters 35, pp. 251-259, 1997. [2] J. A. Peacock, “Two-dimensional goodness-of-fit testing in astronomy”, Monthly Notices Royal Astronomy Society, vol. 202, pp. 615-627, 1983. [3] G. Fasano and A. Franceschini, “A multidimensional version of the Kolmogorov-Smirnov Test”, Monthly Notices Royal Astronomy Society, vol. 225, pp. 155-170, 1987. The Optimal Approach for Kolmogorov-Smirnov Test Calculation in High Dimensional Space 144 [4] M. Matsumoto and T. Nishimura, “Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator”, ACM Transactions on Modeling and Computer Simulation, vol. 8, pp. 3-30, 1998. [5] [Online]. Available: https://cern.ch/cernlib [6] M. A. Stephens, “EDF statistics for goodness of fit and some comparisons”, Journal of the American Statistical Association (American Statistical Association), vol. 69, pp. 730–737, 1974. [7] G. Marsaglia, W. W. Tsang and J. Wang, “Evaluating Kolmogorov’s distribution”, Journal of Statistical Software, vol. 8, pp. 1-4, 2003. [8] D. Knuth, The Art of Computer Programming, V.2, Addison-Wesley Publishing Company, 1981. Submitted 30.07.2015, accepted 19.11.2015 Օպտիմալ մոտեցում Կոլմոգորով-Սմիրնով թեստի հաշվարկի համար բարձր չափողականությամբ տարածությունում Ն. Ակոպով և Ն. Մարտիրոսյան Ամփոփում Կոլմոգորով-Սմիրնով թեստի 𝐷𝐷𝑁𝑁 թվային գնահատականը բարձր չափողակա- նությամբ տարածությունում բավականին ժամանակատար և հիշողություն պահանջող խնդիր է: Առաջարկվող հոդվածում քննարկվում է նվազագույն բինների քանակով նոր մոտեցումը, որը էապես նվազեցնում է ժամանակի և հիշողության պահանջները 𝐷𝐷𝑁𝑁 մեծությունը 2 և ավելի բարձր չափողականությամբ տարածու- թյունում հաշվելու համար: Оптимальный подход для численных реализаций теста Колмогорова- Смирнова в пространствах высокой размерности Н. Акопов и Н. Мартиросян Аннотация Численные оценки дискрепанса Колмогорова-Смирнова 𝐷𝐷𝑁𝑁 в пространствах высокой размерности являются существенной проблемой в плане необходимых ресурсов пaмяти и времени. В статье обсуждается новый подход для вычисления величины 𝐷𝐷𝑁𝑁 в пространствах двух и более высоких размерностей с минимальным числом разбиений, который существенно снижает требования к памяти и времени. http://www.jstatsoft.org/v08/i18/paper