Начиная с начала 2000 года осуществляется внедрение GHIS в здравоохранении, в рамках принятого проекта о реформирование информ


Mathematical Problems of Computer Science  44,  138--144, 2015. 

 
The Optimal Approach for Kolmogorov-Smirnov Test 
Calculation in High Dimensional Space 

 
Norayr Z. Akopov and Narek H. Martirosyan  

 
Yerevan Physics Institute 

e-mail: -narek@mail.yerphi.am 
 
 
Abstract 
 

Numerical estimation of the Kolmogorov-Smirnov discrepancy DN in high 
dimensional space is an extremely time and memory consuming problem. New 
approach with the minimal bin number, which essentially reduces the time and 
memory requirements, to perform the DN tests in two and more dimensional space 
is discussed. 
  Keywords: Statistics, Kolmogorov–Smirnov test, Goodness-of-Fit tests, 

PRNG. 
 
 
1. Introduction  
 
One of the most important tasks nowadays is the numerical experiments on super computers with 
the modeling of the sophisticated physics system. The Monte Carlo method is used for solving 
the problems where the high dimensional integration is involved. To use the Monte Carlo 
method for analysis of the high energy physics experimental and theoretical problems we have to 
solve the problem of the quality of the pseudo-random generators, which should have a strong 
statistical feature, also a large period of sequences and a high speed of generating of the pseudo-
random number. 

The Kolmogorov-Smirnov (KS) test is used to compare how well the empirical cumulative 
distribution function (ECDF) of a sample fits the cumulative distribution function (CDF) of the 
reference distribution by computing the maximum distance 𝐷𝐷𝑚𝑚𝑚𝑚𝑚𝑚 between these two functions. 
The KS test is applied to exactly continuous data, and for dimensions 𝑑𝑑 ≥ 2 we have to compute 
the distance on infinite points then take a maximum, because when 𝑑𝑑 ≥ 2 for CDF and ECDF 
we have d-dimensional surfaces. The KS test is widely used as a powerful statistical test to check 
the quality of different pseudo-random number generators (PRNGs). 

 
138 

 
https://en.wikipedia.org/wiki/Empirical_distribution_function
https://en.wikipedia.org/wiki/Empirical_distribution_function
https://en.wikipedia.org/wiki/Cumulative_distribution_function


N. Akopov and N. Martirosyan 
 

139 

Several algorithms for computing the two-dimensional KS test were proposed [1-3].  We 
propose another approach for 2 and higher dimensions. Our method is based on discretization of 
the space introducing a binning technique applied to continuous multidimensional data. This 
technique does not correspond to the usual KS test principle, but we show that it is possible to 
compute 𝐷𝐷𝑚𝑚𝑚𝑚𝑚𝑚 very precisely with the quite computationally efficient algorithm taking minimal 
bin number.  

Here our KS test results are obtained using the well-tested uniform Mersenne Twister PRNG 
[4], which is included in CERN library [5].  
In the proposed paper we describe a new technique, which allows to reduce essentially the 
number of the bins and correspondingly decrease the needed time and memory used.  
 
 
2. Formalism 
 
In one-dimensional (1D) case with N random numbers to check if they come from uniform 
distribution in the (0, 1) range, for which the CDF is equal to: 
 
                                                                           𝐹𝐹(𝑥𝑥) = 𝑥𝑥,  
 
we need to compute 𝐷𝐷𝑁𝑁. Usual (unbinned) approach [6- 8] in 1D case looks as follows:  
 
 N numbers should be sorted in increasing order  {𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁}, 𝑥𝑥𝑖𝑖 ≤ 𝑥𝑥𝑖𝑖+1 

 then 𝐷𝐷𝑁𝑁  is computed in the following way defining ECDF as:  

𝐹𝐹𝑁𝑁(𝑥𝑥) =
𝑘𝑘
𝑁𝑁

, 𝑥𝑥 = 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁, 
 

where k is the number of random points out of N with 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑘𝑘   ≤ 𝑥𝑥 
 

𝐷𝐷𝑁𝑁 = max0≤𝑚𝑚≤1
|(𝐹𝐹𝑁𝑁(𝑥𝑥) − 𝑥𝑥)| = max1≤𝑖𝑖≤𝑁𝑁

��
𝑖𝑖
𝑁𝑁
− 𝑥𝑥𝑖𝑖� , �

𝑖𝑖 − 1
𝑁𝑁

− 𝑥𝑥𝑖𝑖��. 
 
The unbinned approach can be also applied for 2D case following the algorithm described in [1], 
which has been used for the performed studies.    
In the 1D binned approach the (0, 1) interval is divided into 𝑛𝑛 bins, then N random numbers are 
distributed in 𝑛𝑛 bins according to their values. The ECDF is again defined as: 
 

𝐹𝐹𝑁𝑁(𝑥𝑥) =
𝑘𝑘
𝑁𝑁

. 
 
The difference is that here 𝑥𝑥 is already the bin edge: 𝑥𝑥 = 𝑖𝑖

𝑛𝑛
, 𝑖𝑖 = 1,2, … , 𝑛𝑛. For example, if 𝑛𝑛 =

10, we have the bins edges like: {0.1, 0.2, … , 0.9,1} . Therefore, the advantage of the method is 
that if N is quite large then ECDF can be defined with a relatively small number of points (𝑛𝑛 <
𝑁𝑁). 
 

𝐷𝐷𝑁𝑁 = max0≤𝑚𝑚≤1
|(𝐹𝐹𝑁𝑁(𝑥𝑥) − 𝑥𝑥)| = max1≤𝑖𝑖≤𝑛𝑛

|𝐹𝐹𝑁𝑁(𝑥𝑥𝑖𝑖) − 𝑥𝑥𝑖𝑖| = max1≤𝑖𝑖≤𝑛𝑛
�
∑ 𝑌𝑌𝑘𝑘𝑖𝑖𝑘𝑘=1
𝑁𝑁

−
𝑖𝑖
𝑛𝑛
�. 

 
The Optimal Approach for Kolmogorov-Smirnov Test Calculation in High Dimensional Space 
 
140 

In 1D case there exists the theoretical distribution function for the value of 𝐾𝐾𝑁𝑁 = √𝑁𝑁 𝐷𝐷𝑁𝑁 (also 
the mean value and dispersion for 𝐾𝐾𝑁𝑁), which is used to make a decision to accept or reject the 
hypothesis concerning the uniform distribution of the tested sample of random numbers.  
This binned method can be generalized for dimensions 2, 3 and higher. In two-dimensional case 
for 𝑥𝑥1 ≤ 𝑋𝑋1, 𝑥𝑥2 ≤ 𝑋𝑋2 the corresponding expression for  𝐷𝐷𝑁𝑁 is: 
 

𝐷𝐷𝑁𝑁 = max0≤𝑚𝑚1≤1
0≤𝑚𝑚2≤1

|(𝐹𝐹𝑁𝑁(𝑥𝑥1, 𝑥𝑥2) − 𝑥𝑥1𝑥𝑥2)| = max1≤𝑖𝑖≤𝑛𝑛
1≤𝑗𝑗≤𝑛𝑛

�
∑ ∑ 𝑌𝑌𝑘𝑘𝑚𝑚

𝑗𝑗
𝑚𝑚=1

𝑖𝑖
𝑘𝑘=1

𝑁𝑁
−
𝑖𝑖𝑖𝑖
𝑛𝑛2
�. 

 
where 𝑌𝑌𝑘𝑘𝑚𝑚 is the number of observations that actually do fall into category {𝑘𝑘, 𝑚𝑚}, 𝑁𝑁 is the count 
of all observations, n is the number of bins used along the x and y. 
In d-dimension case of 𝑥𝑥1 ≤ 𝑋𝑋1, 𝑥𝑥2 ≤ 𝑋𝑋2, … … . , 𝑥𝑥𝑑𝑑 ≤  𝑋𝑋𝑑𝑑  the expression for KS test is as 
follows: 

𝐷𝐷𝑁𝑁 = max0≤𝑚𝑚1≤1
0≤𝑚𝑚2≤1…
0≤𝑚𝑚𝑑𝑑≤1

|(𝐹𝐹𝑁𝑁(𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑑𝑑) − 𝑥𝑥1𝑥𝑥2…𝑥𝑥𝑑𝑑)| = max1≤𝑖𝑖1≤𝑛𝑛
1≤𝑖𝑖2≤𝑛𝑛…
1≤𝑖𝑖𝑑𝑑≤𝑛𝑛

�
∑ ∑ … ∑ 𝑌𝑌𝑘𝑘1𝑘𝑘2…𝑘𝑘𝑑𝑑

𝑖𝑖𝑑𝑑
𝑘𝑘𝑑𝑑

𝑖𝑖2
𝑘𝑘2=1

𝑖𝑖1
𝑘𝑘1=1

𝑁𝑁
−
𝑖𝑖1𝑖𝑖2 … 𝑖𝑖𝑑𝑑
𝑛𝑛𝑑𝑑

�. 

 
With any of schemes to calculate the average 𝐾𝐾𝑁𝑁 described above (binned and unbinned) we can 
estimate also the statistical uncertainties Δ𝐾𝐾𝑁𝑁 making sampling for calculated values of 𝐾𝐾𝑁𝑁 and 
𝐾𝐾𝑁𝑁2 with the certain number M of the used samples. The mean value for 𝐾𝐾𝑁𝑁 averaging over M 
samples is given by: 

< 𝐾𝐾𝑁𝑁 >=
1
𝑀𝑀
�𝐾𝐾𝑁𝑁

𝑖𝑖 ,
𝑀𝑀

𝑖𝑖=1

 
and mean squared  𝐾𝐾𝑁𝑁 is defined as: 

< 𝐾𝐾𝑁𝑁2 >=
1
𝑀𝑀
�(𝐾𝐾𝑁𝑁

𝑖𝑖
𝑀𝑀

𝑖𝑖=1

)2, 

then we can calculate 𝜎𝜎𝐾𝐾𝑁𝑁
2  as: 

𝜎𝜎𝐾𝐾𝑁𝑁
2 =< 𝐾𝐾𝑁𝑁2 > −< 𝐾𝐾𝑁𝑁 >2. 

And finally, based on the central limiting theorem we can calculate the statistical uncertainty 
Δ𝐾𝐾𝑁𝑁 as: 
 

                                                            Δ𝐾𝐾𝑁𝑁 =
1
√𝑀𝑀

�𝜎𝜎𝐾𝐾𝑁𝑁
2 . 

 
3. Results and Discussion 
 
The first set of the obtained results is related to the unbinned approach, where in 1D case we can 
compare not only the obtained average < 𝐾𝐾𝑁𝑁 > with the predicted theoretical value of 
0.8687311605.. [7], but also the observed and theoretical (TH) distributions for 𝐾𝐾𝑁𝑁 (see Fig. 1). 
 

N. Akopov and N. Martirosyan 
 

141 

 
Fig 1. TH (curve) and observed (red points) for KN distributions. 

 
On Fig. 2 one can see the observed distribution obtained for 2D case, which is systematically 
shifted in respect to 1D TH distribution.   

 
Fig 2. TH distribution (curve) for 1D and observed distribution (red points) for 2D cases. 
 

Then the next set of results, which is related to the binned method, is shown in Figs. 3-4. In Fig.3 
one can see that for 𝑁𝑁 = 105 the number of 𝑛𝑛 = 104 bins is enough to get the exact distribution 
which was the goal of this paper. 

 
The Optimal Approach for Kolmogorov-Smirnov Test Calculation in High Dimensional Space 
 
142 

 
Fig 3. 1D case: TH distribution (curve) and binned distributions (colored points). 
 

In Fig. 4 the < 𝐾𝐾𝑁𝑁 >  dependence on the number of bins is presented for 1, 2 and 3D cases. Here 
one can recognize the clear dependence of the obtained values for < 𝐾𝐾𝑁𝑁 > on the number of the 
bins (n) used. Taking into account the obtained saturating shape on the n for such dependences 
we performed the fit with the following fitting function: 
 

𝑓𝑓𝑓𝑓𝑖𝑖𝑓𝑓
𝑝𝑝
→

(𝑥𝑥) = 𝑝𝑝1 + 𝑝𝑝2 exp(−𝑝𝑝3𝑥𝑥). 
 
The meaning of the parameter 𝑝𝑝1 corresponds to the saturation level, which can be achieved with 
quite large value of n. Although, using very high values for the bins number it is impossible in 
practice to calculate the 𝐾𝐾𝑁𝑁 in the sense of the needed CPU time and memory, especially in case 
of higher (>4) dimensional space. That is why the idea to use a limited number of points over the 
n, then estimate the saturation parameter (𝑝𝑝1), and then make a correction for the average value 
of𝐾𝐾𝑁𝑁, estimated with e.g., n=10, introducing the correction factor: 𝐶𝐶10 = 𝑝𝑝1 − < 𝐾𝐾𝑁𝑁(𝑛𝑛 = 10) >, 
seems to be very interesting and effective in order to realize the procedure of the < 𝐾𝐾𝑁𝑁 > 
estimation in case of high dimensional space. Also the exponential coefficient (𝑝𝑝3), which 
regulates the speed of convergence to the saturation level, can be used to check the quality of 
different PRNGs. 

It should be noted, that the used fitting function is quite stable in respect to the variation of 
the fitting points used. In 3D case with the total number of points (ntotal=11), this variation was 
11, 9, 7, 5. In 2D case with the ntotal=19: 19, 15, 11, 7, and in 1D case with ntotal=75: 75, 55, 35, 
15. This is an important feature of the functional form used for a fit, because in case of 
essentially higher dimensions one can compute with the binned method a limited number of 
points, perform the fit and estimate the saturation level. In further studies the authors are going to 
provide the <KN> dependence as a function of the space dimension, which is very interesting due 
to the lack of knowledge on TH distribution for high dimension space.  
 
 
N. Akopov and N. Martirosyan 
 

143 

 
Fig. 4. Average KN dependence on the number of bins (n). The 1, 2 and 3D cases are presented with the fit 

parameters for each case. Different colors along the curves correspond to the variation of the number of fitting 
points used (see explanation in the text). 

 
4. Conclusion 
 
The optimization of the Kolmogorov-Smirnov discrepancy DN calculation with the binned 
method for high dimensional spaces by means of reducing the used bin number is performed. 
Developed approach allows to extend the estimation of DN  to very high dimensional spaces with 
reasonable requirements to the CPU time and memory, and, thus, to provide very important tests 
of the PRNGs, which are used to apply the Monte Carlo method for solving of essentially high 
dimensional physics problems. 
 
 
References  
 
[1] A. Justel, D. Pena and R. Zamar, “A multivariate Kolmogorov-Smirnov test of goodness of 

Fit”, Statistics & Probability Letters 35, pp. 251-259, 1997. 
[2] J. A. Peacock, “Two-dimensional goodness-of-fit testing in astronomy”, Monthly Notices 

Royal Astronomy Society, vol. 202, pp. 615-627, 1983. 
[3] G. Fasano and A. Franceschini, “A multidimensional version of the Kolmogorov-Smirnov 

Test”, Monthly Notices Royal Astronomy Society, vol. 225, pp. 155-170, 1987. 


The Optimal Approach for Kolmogorov-Smirnov Test Calculation in High Dimensional Space 
 
144 

[4] M. Matsumoto and T. Nishimura, “Mersenne twister: a 623-dimensionally equidistributed 
uniform pseudo-random number generator”, ACM Transactions on Modeling and 
Computer Simulation, vol. 8, pp. 3-30, 1998. 

[5] [Online]. Available: https://cern.ch/cernlib 
[6] M. A. Stephens, “EDF statistics for goodness of fit and some comparisons”, Journal of the 

American Statistical Association (American Statistical Association), vol. 69, pp. 730–737, 
1974. 

[7] G. Marsaglia, W. W. Tsang and J. Wang, “Evaluating Kolmogorov’s distribution”, Journal 
of Statistical Software, vol. 8, pp. 1-4, 2003. 

[8] D. Knuth, The Art of Computer Programming, V.2, Addison-Wesley Publishing Company, 
1981. 

 
Submitted 30.07.2015, accepted 19.11.2015 
 
 
Օպտիմալ մոտեցում Կոլմոգորով-Սմիրնով թեստի հաշվարկի համար 

բարձր չափողականությամբ տարածությունում 
 

Ն. Ակոպով և Ն. Մարտիրոսյան 
 

Ամփոփում 
 
 Կոլմոգորով-Սմիրնով թեստի 𝐷𝐷𝑁𝑁 թվային գնահատականը բարձր չափողակա-
նությամբ տարածությունում բավականին ժամանակատար և հիշողություն 
պահանջող խնդիր է: Առաջարկվող հոդվածում քննարկվում է նվազագույն բինների 
քանակով նոր մոտեցումը, որը էապես նվազեցնում է ժամանակի և հիշողության 
պահանջները 𝐷𝐷𝑁𝑁 մեծությունը 2 և ավելի բարձր չափողականությամբ տարածու-
թյունում հաշվելու համար: 
 
 
Оптимальный подход для численных реализаций теста Колмогорова-
Смирнова в пространствах высокой размерности 

 
Н. Акопов и Н. Мартиросян 

 
Аннотация 

 
Численные оценки дискрепанса Колмогорова-Смирнова 𝐷𝐷𝑁𝑁 в пространствах высокой 
размерности являются существенной проблемой в плане необходимых ресурсов пaмяти и 
времени. В статье обсуждается новый подход для вычисления величины 𝐷𝐷𝑁𝑁 в 
пространствах двух и более высоких размерностей с минимальным числом разбиений, 
который существенно снижает требования к памяти и времени. 

http://www.jstatsoft.org/v08/i18/paper