INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 12(4), 475-491, August 2017. Asymptotically Unbiased Estimation of A Nonsymmetric Dependence Measure Applied to Sensor Data Analytics and Financial Time Series A. Caţaron, R. Andonie, Y. Chueh Angel Caţaron Department of Electronics and Computers Transilvania University of Braşov, Romania cataron@unitbv.ro Razvan Andonie 1. Department of Computer Science Central Washington University, USA 2. Department of Electronics and Computers Transilvania University of Braşov, Romania *Corresponding author: andonie@cwu.edu Yvonne Chueh Department of Mathematics Central Washington University, USA chueh@cwu.edu Abstract: A fundamental concept frequently applied to statistical machine learning is the detection of dependencies between unknown random variables found from data samples. In previous work, we have introduced a nonparametric unilateral dependence measure based on Onicescu’s information energy and a kNN method for estimating this measure from an available sample set of discrete or continuous variables. This paper provides the formal proofs which show that the estimator is asymptotically unbiased and has asymptotic zero variance when the sample size increases. It implies that the estimator has good statistical qualities. We investigate the performance of the estimator for data analysis applications in sensor data analysis and financial time series. Keywords: machine learning, sensor data analytics, financial time series, statistical inference, information energy, nonsymmetric dependence measure, big data analytics. 1 Introduction Statistical machine learning is based on the strong assumption that we use a representative training set of samples to infer a model. In this case, we select a random sample of the population, perform a statistical analysis on this sample, and use these results as an estimation to the desired statistical characteristics of the population as a whole. The accuracy of the estimation depends on the representativeness of the data sample. We gauge the representativeness of a sample by how well its statistical characteristics reflect the probabilistic characteristics of the entire population. Many standard techniques may be used to select a representative sample set [16]. However, if we do not use expert knowledge, selecting the most representative training set from a given dataset was proved to be computationally difficult (NP-hard) [10]. The problem is actually more complex, since in most applications the complete dataset is either unknown or too large to be analyzed. Therefore, we have to rely on a more or less representative training set. For example, a common statistical machine learning problem is to estimate information theory measures (such as entropy) from available training sets. This can be Copyright © 2006-2017 by CCC Publications 476 A. Caţaron, R. Andonie, Y. Chueh reduced to the construction of an estimate of a density function from the observed data [21]. We refer here only to nonparametric estimation, where less rigid assumptions will be made about the distribution of the observed data. Although it will be assumed that the distribution has the probability density f, the data will be allowed to speak for themselves in determining the estimate of f more than would be the case if f were constrained to fall in a given parametric family. The estimation of information theory measures has an important application area - the de- tection of dependency relationships between unknown random variables represented by data samples. There are two information theory strategies one can adopt when studying the rela- tionship between two random variables: the first is to measure their interdependence thought as a mutual attribute and the second is to measure how much one system depends on the other. In the first case we have symmetric (bilateral) measures of dependence, whereas in the second one we have nonsymmetric (unilateral) measures. The literature review summarizes these two strategies as follows. Strategy I. Several symmetric dependence measure were proposed (see [20]). Among them, the Shannon entropy based mutual information (MI), MI(X,Y ) = H(X) + H(Y )−H(X,Y ) = MI(Y,X), which measures the information interdependence between two random variables X and Y . Estimating MI is known to be a non-trivial task [3]. Naïve estimations (which attempt to construct a histogram where every point is the center of a sampling interval) are plagued with both systematic (bias) and statistical errors. Other more sophisticated estimation methods were proposed [19], [23], [14]: kernel density estimators, B-spline estimators, k-th nearest neighbor (kNN) estimators, and wavelet density estimators. An ideal estimator does not exist, instead the choice of the estimator depends on the structure of data to be analyzed. It is not possible to design an estimator that minimizes both the bias and the variance to arbitrarily small values. The existing studies have shown that there is always a delicate trade off between the two types of errors [3]. Strategy II. Several continuous entropy-like nonsymmetric dependence measures have been proposed [15]. But information measures can also refer to certainty (not only to uncertainty, like Shannon’s entropy), and probability can be considered as a measure of certainty. For instance, Onicescu’s information energy (IE) was interpreted by several authors as a measure of expected commonness, a measure of average certainty, or as a measure of concentration. The second strat- egy can be illustrated by a unilateral dependence measure defined for discrete random variables by Andonie et al. [2]: o(X,Y ) = IE(X|Y ) − IE(X), where IE(X) = ∑n k=1 p 2 k is Onicescu’s information energy of a discrete random variable X with probabilities pk, and IE(X|Y ) is the conditional information energy between two variables, as defined in [18]. For a continuous random variable X with probability density function f(x), the IE is [11,18]: IE(X) = E(f(X)) = ∫ +∞ −∞ f(x)f(x)dx = ∫ +∞ −∞ f2(x)dx (1) where X is a random variable with probability density function f(x). In other words, IE(X) is the expectation of the values of a density function f. For two continuous random variables X and Y , with their joint probability density function f(x,y), we introduced the continuous version of the o(X,Y ) measure in [1,8]: o(X,Y ) = IE(X|Y ) − IE(X) = ∫ +∞ −∞ ∫ +∞ −∞ f(x,y)f(x|y)dy dx− ∫ +∞ −∞ f2(x)dx (2) The discrete and the continuous o(X,Y ) measure the “additional” average certainty (or in- formation) of X occurring under the condition that Y has already or simultaneously occurred Asymptotically Unbiased Estimation of A Nonsymmetric Dependence Measure Applied to Sensor Data Analytics and Financial Time Series 477 (or is certain) “over” the average certainty of X when the certainty (or information) of Y is not available. Thus, o(X,Y ) can be regarded as an indicator of the unilateral dependence of X upon Y . As a unilateral measure, o(X,Y ) gives a different insight than the MI into the dependencies between pairs of random variables. The o(X,Y ) measure can easily be computed if the data sample is extracted from known parametric probability distributions. When the underlying distribution of data sample is un- known, the o(X,Y ) has to be estimated. More formally, we have to estimate o(X,Y ) from a random sample x1, x2, ..., xn of n d-dimensional observations from a distribution with the un- known probability density f(x). This problem is even more difficult if the number of available points is small. Contributions. In previous work [4,5], we introduced a kNN method for estimating IE and o(X,Y ) from data samples of discrete or continuous variables. In the present paper, we give a more detailed description of this method. For the first time, we provide the formal proofs showing that the estimator is asymptotically unbiased and has asymptotic zero variance when the sample size increases. This implies that the estimator has good statistical qualities. The potential application of our results in statistical machine learning for multivariate data drives the motivation for the present work and we refer to two real world applications. The first one is a study of the unilateral interactions between temperature sensor data in a refrigerator room. The second one is an analysis of the dynamics of interaction between assets and liabilities evidenced by the historical event that matches the time series formed by the unilateral dependence measures observed from the financial time series reported by Kodak and Apple. This evolution is captured by our unilateral dependence measure over time. The rest of the paper is structured as follows. In Section 2 we summarize our previous work, we review the kNN estimation method for the continuous o(X,Y ) measure, and we derive our IE estimator in a d-dimensional space. In Section 3 we analyze and prove the asymptotic unbiased behavior of this IE estimator. In Section 4 we derive the o(X,Y ) estimator for the purpose of measuring the unilateral dependence relation between a pair of random variables. Section 5 discusses applications and Section 6 contains our final remarks. 2 Background Fitting multi-dimensional data to joint density functions is challenging. We aim to summarize the kNN estimation technique which can be used for this problem. We will refer to our previous results on IE and o(X,Y ) kNN estimation. 2.1 kNN estimation of the unilateral dependence measure in an unidimen- sional space Our goal is to approximate IE and o(X,Y ) from the available dataset, for random variables X and Y with unknown densities, using the kNN method. The kNN estimators represent an attempt to adapt the amount of smoothing to the “local” density of data. The degree of smoothing is controlled by an integer k, chosen to be considerably smaller than the sample size. We define the distance Ri,k,n between two points on the line as |xi −xk| out of n available points, and for each xi we define the distances Ri,1,n ≤ Ri,2,n ≤ . . . ≤ Ri,n,n, arranged in ascending order, from xi to the points of the sample. The kNN density estimate f(xi) is defined by [21]: f̂(xi) = k 2nRi,k,n 478 A. Caţaron, R. Andonie, Y. Chueh The kNN was used for non-parametrical estimate of the entropy based on the k-th nearest neighbor distance between n points in a sample, where k is a fixed parameter and k ≤ n − 1. Based on the first nearest neighbor distances, Leonenko et al. [13] introduced an asymptotic unbiased and consistent estimator Hn of the entropy H(f) in a multidimensional space. When the sample points are very close one to each other, small fluctuations in their distances produce high fluctuations of Hn. In order to overcome this problem, Singh et al. [22] defined an entropy estimator based on the k-th nearest neighbor distances. A kNN estimator of the Kullback-Leibler divergence was introduced by Wang et al. [24]. Faivishevsky et al. used a mean of several kNN estimators, corresponding to different values of k, to increase the smoothness of the estimation [9]. The kNN method works well if the value of k is optimally chosen, and it outperforms his- togram methods [23]. However, there is no model selection method for determining the number of nearest neighbors k, and this is a known limitation. 2.2 kNN estimation of the IE and the o(X, Y ) measure In [5], we introduced a kNN IE estimator, and we stated in [4], without mathematical proofs, that this estimator is consistent. By definition [12], a statistic whose mathematical expectation is equal to a parameter is called unbiased estimator of that parameter. Otherwise, the statistic is said to be biased. A statistical estimator that converges asymptotically to a parameter is called consistent estimator of that parameter. In the following, we will review this result, introducing also the basic mathematical notations used in Section 3. The IE is the average of f(x), therefore we have to estimate f(x). The n observations from our samples have the same probability 1 n . A convenient estimator of the IE is: ˆIE (n) k (X) = 1 n n∑ i=1 f̂(xi). (3) The mass probability at value xi determined by its k nearest surrounding points and stan- dardized by the sphere volume Vk(xi) these k nearest observations occupy measures the density of probability at value xi. The k-th nearest neighbor estimate is defined by [4], [21]: f̂(xi) = k n Vk(xi) , i = 1...n (4) where Vk(xi) is the volume of the d-dimensional sphere centered in xi, with the radius Ri,k,n measuring the Euclidean distance from xi to the k-th nearest data point. The estimate (4) can be re-written as: f̂(xi) = k nVdR d i,k,n , i = 1...n. (5) given Vk(xi) = VdRdi,k,n where Vd is the radius of the unit sphere in d-dimensions, that is V1 = 2, V2 = π, V3 = 4π3 , etc. By substituting f̂(xi) in (3), we finally obtain the following IE approximation: ˆIE (n) k (f) = 1 n n∑ i=1 k nVdR d i,k,n . (6) In [6], we showed how to estimate the o(X,Y ) dependence measure from an available sample set of discrete or continuous variables and we experimentally compared the results of the kNN estimator with the naïve histogram estimator. Asymptotically Unbiased Estimation of A Nonsymmetric Dependence Measure Applied to Sensor Data Analytics and Financial Time Series 479 3 Asymptotic behavior of the IE approximator Consistency of an estimator means that as the sample size gets large the estimate gets closer and closer to the true value of the parameter. Unbiasedness is a statement about the expected value of the sampling distribution of the estimator. The ideal situation, of course, is to have an unbiased consistent estimator. This may be very difficult to achieve. Yet unbiasedness is not essential, and just a little bias is permitted as long as the estimator is consistent. Therefore, an asymptotically unbiased consistent estimator may be acceptable. In [4], we stated (without proofs) that the IE approximator is asymptotically unbiased and consistent. In the this section, we provide the formal proofs for this statement. Lemma 1. Let us consider the identically distributed random variables T (n)1 , T (n) 2 , ..., T (n) n defined by T (n) i = k nVdR d i,k,n . (7) For a given x, the random variable Tx has the following probability density function: hTx(y) = f(x) [ k y f(x) ]k y2(k − 1)! e −k y f(x) , 0 < y < ∞. Proof: See Appendix 6. 2 Theorem 2 (Asymptotically Unbiasedness). The information energy estimator ˆIE (n) k (f) is asymp- totically unbiased. Proof: We will find the asymptotic value of the average: lim n→∞ E [ ˆIE (n) k (f) ] . From Lemma 1, for a given x, the random variable Tx has the pdf: hTx(y) = f(x) [ k y f(x) ]k y2(k − 1)! e −k y f(x) , 0 < y < ∞. The conditional expectation of T (n)1 |X1 = x, given the fixed center X1 = x is: limn→∞E [ T (n) 1 |X1 = x ] = ∫ ∞ 0 yh(y)dy = ∫ ∞ o f(x) (kf(x)) k (k − 1)! ·y−1−ke− kf(x) y dy = f(x) (kf(x)) k (k − 1)! ∫ ∞ 0 y−1−ke −kf(x) y dy. We make the following change of variable: u = kf(x) y , with y ∈ (0, +∞). The corresponding u ∈ (+∞, 0) is decreasing. We have: du = − 1 y2 kf(x)dy 480 A. Caţaron, R. Andonie, Y. Chueh and y−1−k = ( u kf(x) )k+1 . We obtain:∫ ∞ 0 yh(y)dy = f(x) (kf(x)) k (k − 1)! ∫ 0 ∞ −uk+1e−udu · 1 (kf(x))k+1 ( kf(x) u )2 · 1 kf(x) . The negative sign before uk+1 allows us to change the limits of the integral: ∫ ∞ 0 yh(y)dy = f(x) (kf(x)) k (k − 1)! ∫ ∞ 0 uk−1e−udu · 1 (kf(x))k = f(x) (k − 1)! ∫ ∞ 0 uk−1e−udu = f(x) (k − 1)! Γ(k) = f(x). We take the complete expectation of the above, then we have: limn→∞E [ E [ T (n) 1 |X1 = x ]] = limn→∞E [ T (n) 1 ] = E[f(x)]. Let us define Ĝ(n)(f) = 1 n ∑n i=1 T (n) i , the unbiased estimator of f(x) (like in [22]). We obtain: E [ Ĝ(n)(f) ] = 1 n n∑ i=1 E [ T (n) i ] = 1 n [E [f(x)] + · · · + E [f(x)]] = E [f(x)] . limn→∞E [ Ĝ(n)(f) ] = E [f(x)] = ∫ f(x) ·f(x)dx = IE(f). This proves that our IE estimator is asymptotically unbiased. 2 Next, we aim to prove that it has asymptotic zero variance. Lemma 3. Given the identically distributed random variables T (n)1 , T (n) 2 , ..., T (n) n defined by (7), we have: lim n→∞ 1 n E [ T (n)2 1 ] − lim n→∞ 1 n E [ T (n) 1 ]2 = 0. Proof: See Appendix 6. 2 Lemma 4. Given the identically distributed random variables T (n)1 , T (n) 2 , ..., T (n) n defined by (7), the following relation is true: lim n→∞ n(n− 1) n2 ( E [ T (n) 1 T (n) 2 ] −E [ T (n) 1 ] E [ T (n) 2 ]) = 0. Proof: See Appendix 6. 2 Theorem 5 (Asymptotic Zero Variance). lim n→∞ V ar [ Ĝ (n) k (f) ] = 0. Asymptotically Unbiased Estimation of A Nonsymmetric Dependence Measure Applied to Sensor Data Analytics and Financial Time Series 481 Proof: T (n)1 ,T (n) 2 , . . . ,T (n) n are identically distributed. V ar [ Ĝ (n) k (f) ] = 1 n2   n∑ i=1 V ar [ T (n) i ] + ∑ i 6=j i r|X1 = x ] = P [ρr,n > R1,k,n|X1 = x] , (18) because: T (n) 1 > r ⇔ k nVdR d 1,k,n > r ⇒ k nVdr > Rd1,k,n ⇒ ( k nVdr )1 d > R1,k,n We write the probability P [ T (n) 1 > r|X1 = x ] as a binomial distribution, and approximate it with the Poisson probability function. The binomial distribution is given by f(k; n,p) = ( n k ) pk(1 −p)n−k, and the binomial distri- bution of the cumulative distribution function is P(X ≤ x) = ∑ xi≤x P(X = xi) = ∑ xi≤x f(x). The probability expressed by the Poisson formula is P(x) ≈ e −np(np)x x! , where p is the successful probability out of n samples. Using the Poisson approximation for the binomial distribution by letting the sample size n →∞ we get: P[ρr,n > R1,k,n|X1 = x] = 1 −P [R1,k,n ≥ ρr,n] = 1 − k∑ i=1 ( n− 1 i )[ P(Sρr,n)(1 −P(Sρr,n)) ]n−i−1 = P [Tx > r]. We note that: P(Sρr,n) = ∫ Sρr,n f(t)dt, ρr,n = ( k nVdr )1 d , Vρ = Vdρ d = Vd k nVdr = k nr ⇒ n = k rVρ . 488 A. Caţaron, R. Andonie, Y. Chueh Let n →∞, then: lim n→∞ nP(Sρr,n) = k r lim n→∞ P(Sρ) Vρ = k r f(x). Thus, from the Poisson approximation we obtain: P [Tx > r] = 1 − k∑ i=0 ( k r f(x) )i i! e− k r f(x). To calculate the pdf for random variable Tx, we differentiate its cumulative density function with respect to r: d dr P [Tx ≤ r] = d dr [1 −P [Tx > r]] = d dr [ 1 − ( 1 − k∑ i=0 [ k r f(x) ]i i! e− k r f(x) )] = d dr k∑ i=0 [ k r f(x) ]i i! e− k r f(x) = d dr [ e− k r f(x) + ∑ i=1 k [ k r f(x) ]i i! e− k r f(x) ] = kf(x) r2 e− k r f(x) + k∑ i=1 [ k r f(x) ]i−1 (i− 1)! ( − kf(x) r2 ) e− k r f(x) + k∑ i=1 [ k r f(x) ]i i! · kf(x) r2 e− k r f(x) = kf(x) r2 e− k r f(x) − k∑ i=1 kf(x) r2 · [ k r f(x) ]i−1 (i− 1)! e− k r f(x) + k∑ i=1 kf(x) r2 · [ k r f(x) ]i i! e− k r f(x) = kf(x) r2 e− k r f(x) − kf(x) r2 e− k r f(x) + kf(x) r2 · [ k r f(x) ]k k! e− k r f(x) = e− k r f(x)f(x) [ k r f(x) ]k r2(k − 1)! . Then, for a given x, the random variable Tx has the pdf: hTx(y) = f(x) [ k y f(x) ]k y2(k − 1)! e −k y f(x) , 0 < y < ∞. Proof of Lemma 3 We have: lim n→∞ E [ T (n)2 1 ∣∣∣X1 = x] = ∫ ∞ 0 y2h(y)dy = ∫ ∞ 0 f(x) (kf(x)) k (k − 1)! y−ke −kf(x) y dy = f(x) (kf(x)) k (k − 1)! ∫ ∞ 0 y−ke −kf(x) y dy. We make the following substitution: u = kf(x) y , with y ∈ (0, +∞), u ∈ (+∞, 0) decreasing, and y = kf(x) u . Then du = − 1 y2 kf(x)dy and y−k = ( u kf(x) )k . Asymptotically Unbiased Estimation of A Nonsymmetric Dependence Measure Applied to Sensor Data Analytics and Financial Time Series 489 We obtain: lim n→∞ E [( T (n) 1 )2 |X1 = x ] = f(x) (kf(x)) k (k − 1)! ∫ 0 ∞ − ( u kf(x) )k e −kf(x) y y2 1 kf(x) du = f(x) (kf(x)) k (k − 1)! ∫ 0 ∞ −uk−2e−u ( 1 kf(x) )k−1 du = f(x) (kf(x)) k (k − 1)! ∫ ∞ 0 −uk−2e−udu = kf(x)2 (k − 1)! Γ(k − 1) = k k − 1 f(x)2. lim n→∞ E [( T (n) 1 )2] = lim n→∞ E [ E [( T (n) 1 )2 |X1 = x ]] = lim n→∞ E [ k k − 1 f(x)2 ] = k k − 1 E [ f(x)2 ] = k k − 1 [ V ar [f(x)] + (E [f(x)]) 2 ] = k k − 1 [ V ar [f(x)] + IE(f)2 ] . lim n→∞ V ar [ T (n) 1 ] = lim n→∞ E [( T (n) 1 )2] − lim n→∞ ( E [ T (n) 1 ])2 = k k − 1 V ar [f(x)] + k k − 1 IE(f)2 − IE(f)2 = k k − 1 V ar [f(x)] + 1 k − 1 IE(f)2. lim n→∞ 1 n V ar [ T (n) 1 ] = lim n→∞ 1 n [ k k − 1 V ar [f(x)] + 1 k − 1 IE(f)2 ] = 0. Proof of Lemma 4 The relation above can be proved if the limiting covariance between T (n)1 and T (n) 2 is zero. This can be achieved by showing limiting probability distributions of T (n)1 and T (n) 2 are independent. 490 A. Caţaron, R. Andonie, Y. Chueh Given the real number r, s > 0: P [ T (n) 1 > r,T (n) 1 > s|X1 = x,X2 = y ] = P [R1,k,n < ρr,n,R2,k,n < ρs,n|X1 = x,X2 = y] = P [ at least k of X3,X4, . . . ,Xn ∈ Sρr,n;x and at least k of X3,X4, . . . ,Xn ∈ Sρs,n;y ] . Note: For x 6= y, ρr,n → 0 and ρs,n → 0 as n → +∞. For large enough n, we have Sρr,n;x ∩Sρs,n;y = Φ. We obtain: P [ T (n) 1 > r,T (n) 1 > s|X1 = x,X2 = y ] = ∑ k≤i,j i+j≤n−2 (n− 2)! i!j!(n− 2 − i− j)! [ P(Sρr,n;x) ]i [ P(Sρs,n;x) ]j · [ 1 −P(Sρr,n;x) −P(Sρs,n;y ) ]n−2−i−j where P(Sρr,n;x) = ∫ Sρr,n;x f(t)dt. Recall: lim n→∞ nP(Sρr,n;x) = k r f(x) = lim n→∞ k r · P(Sρr,n;x) Vρ . Then, we have: P [ T (n) 1 > r,T (n) 1 > s|X1 = x,X2 = y ] = ∑ k≤i,j i+j≤n−2 (n− 2)! i!j!(n− 2 − i− j)! ( k nr )i ( k ns )j · ( P(Sρr,n;x) Vρr,n )i ( P(Sρs,n;x) Vρs,n )j ·[ 1 − 1 n ( k r · P(Sρr,n;x) Vρr,n + k s P(Sρs,n;x) Vρs,n )]n−2−i−j Since: (n− 2)! (n− 2 − i− j)!ninj → 1 as n →∞[ 1 − 1 n ( k r · P(Sρr ) Vρr + k s P(Sρs) Vρs )]n−2−i−j → e−( k r f(x)+ k s f(y)) as n →∞ lim n→∞ ( 1 − 1 n )n = e−1. Asymptotically Unbiased Estimation of A Nonsymmetric Dependence Measure Applied to Sensor Data Analytics and Financial Time Series 491 we obtain: lim n→∞ P [ T (n) 1 > r,T (n) 2 > s|X1 = x,X2 = y ] = n−2−k∑ i=k n−2−k∑ j=k ki i!ri [f(x)] i · kj j!sj [f(y)] j ·e−( k r f(x)+ k s f(y)) = [ n−2−k∑ i=k ki i!ri [f(x)] i e− k r f(x) ] ·  n−2−k∑ j=k kj j!sj [f(y)] j e− k s f(y)   = P [Tx > r] ·P [Ty > s], where for given z, the random variable Tz has the pdf: hTx = f(z) ( k y f(z) )k (k − 1)!y2 e −kf(z) y = kkf(z)k+1 (k − 1)!yk+2 e −kf(z) y . Therefore, by the theorem of independent variables: lim n→∞ E[T (n) 1 T (n) 2 ] = limn→∞ [ E[T (n) 1 ] ·E[T (n) 2 ] ] ⇒ lim n→∞ Cov[T (n) 1 ,T (n) 2 ] = 0.