RATIO MATHEMATICA
ISSUE N. 30 (2016) pp. 3-21

ISSN (print): 1592-7415
ISSN (online): 2282-8214

A geometric view on Pearson’s correlation
coefficient and a generalization of it to

non-linear dependencies

Priyantha Wijayatunga
Department of Statistics, Umeå School of Business and Economics,

Umeå University, Umeå 901 87, Sweden

priyantha.wijayatunga@umu.se

Abstract

Measuring strength or degree of statistical dependence between two ran-
dom variables is a common problem in many domains. Pearson’s correlation
coefficient ρ is an accurate measure of linear dependence. We show that ρ is
a normalized, Euclidean type distance between joint probability distribution
of the two random variables and that when their independence is assumed
while keeping their marginal distributions. And the normalizing constant
is the geometric mean of two maximal distances; each between the joint
probability distribution when the full linear dependence is assumed while
preserving respective marginal distribution and that when the independence
is assumed. Usage of it is restricted to linear dependence because it is based
on Euclidean type distances that are generally not metrics and considered
full dependence is linear. Therefore, we argue that if a suitable distance
metric is used while considering all possible maximal dependences then it
can measure any non-linear dependence. But then, one must define all the
full dependences. Hellinger distance that is a metric can be used as the dis-
tance measure between probability distributions and obtain a generalization
of ρ for the discrete case.

Keywords: metric/distance; probability simplex; normalization.

2010 AMS subject classifications: 62H20

doi: 10.23755/rm.v30i1.5

3


Priyantha Wijayatunga

1 Introduction

Measuring association between two random quantities is of interest in many
types statistical analyses and applications in various disciplines. Pearson’s product
moment correlation coefficient is the standard in statistical textbooks and appli-
cations for measuring linear association. And Spearman’s rank correlation coef-
ficient is capable of measuring any monotonic dependence between two random
variables. For two ordinal variables Cramér’s V-statistic is widely used whereas
Tchuprow’s T-statistic is less-known and therefore less often used (see [14] and
references therein). Furthermore, there are many other kinds of dependence mea-
sures used in statistical literature, especially in applied statistical analyses. In sta-
tistical genetics for evaluation of linkage disequilibrium between genetic markers,
authors of [2] use volume tests that are discussed in [10] as a measures of depen-
dence between ordinal variables with fixed margins. For massive datasets in [8]
it is used mutual information dimension that is defined in terms of information
dimension descried in [1].

In [9] it is said that “although it is customary in bivariate data analysis to com-
pute a correlation measure of some sort, one number (or index) alone can never
fully reveal the nature of dependence; hence a variety of measures are needed”.
It is also stated therein that “if (two quantities are) not totally dependent, then it
may be helpful to find some quantities that can measure the strength or degree of
dependence between them”. In this article we try to develop a measure that can in-
dicate ‘the’ degree or strength of association between two discrete variables. Our
measure can be seen as a generalization of the Pearson’s correlation coefficient ρ
using a suitable distance metric between joint probability distributions, instead of
simple Euclidean type distances that are used in ρ (see below). Given the joint
probability distribution (jpd) of two discrete variables, say, X and Y , the degree
of dependence (also called association) between them is expressed as the normal-
ized distance between the jpd of them and that of when the independence of them
is assumed. The associated normalizing constant is geometric mean of distances
between the latter and all possible jpds where full dependence between X and
Y is assumed while retaining each marginal distribution at a time. These latter
distances are in fact the maximal distances since we obtain them by assuming full
dependence. In the following we show that the Pearson’s correlation coefficient
is measure of this nature based on some Euclidean type distances. That is, it is
the ratio of the distance between dependence and independence, and the geomet-
ric mean of the distances that are between full linear dependences and indepen-
dence. Therefore, our measure can be regarded as a generalization of ρ using
a suitable distance between probability distributions and considering non-linear
dependencies. One thing that ρ shows us is that if we need to define a strength
of a dependence then we must find or hypothesize the full dependence(s) corre-

4


A geometric view on Pearson’s correlation coefficient and a generalization of it
to non-linear dependencies

sponding to the given dependence. This aspect can make numerical evaluation of
the measure algorithmic or computational since sometimes it may not be possible
to obtain the full dependences easily. However, here we do not deal with such
computational issues but our consideration is on defining a measure following the
structure of ρ. For a given dependence (in terms of a jpd) finding efficiently re-
lated jpds representing the full dependences that preserve either of the marginal is
an open problem.

First we show that, in the simple case of binary X and Y , the ρ measures the
degree of dependence with a certain type of Euclidean distance, but for multi-
nary case (and also for continuous variables) a distance in terms another type of
Euclidean area is used. But these Euclidean type distances are appropriate for
measuring only linear dependences. Since we are interested in measuring any
non-linear dependence we propose to use Hellinger distance between joint prob-
ability distributions, that is called as Matsusita distance in the discrete (see [6]).
The Hellinger distance is a metric and it possesses the so-called linear invariance
properties, so it is more suitable for measuring distances between the probability
distributions. Therefore, it can be used to measure any type of dependence.

2 Pearson’s correlation coefficient ρ

For random variables X and Y, the Pearson’s correlation coefficient ρ(X,Y )
is such that |ρ(X,Y )| ≤ 1. The equality holds if and only if X and Y are fully
linearly dependent and ρ(X,Y ) = 0 if they are linearly independent. And the
converse of the latter is not always true unless X and Y are binary. Note that
the full dependence is linear in the binary (also called 2 × 2) case where then the
ρ(X,Y ) is often called φ-coefficient.

2.1 2 × 2 case: φ-coefficient

Let X and Y be two binary variables with a common state space {0, 1} where
their jpds and marginal probability distributions are written as pxy = p(X =
x,Y = y), px = p(X = x) and qy = p(Y = y) for x,y = 0, 1. Let P =(
p00 p01
p10 p11

)
for short. As shown in [12], any such P can be represneted as a

point in the probability simplex shown in the Figure 1. The jpd of X and Y under
the assumption that they are independent while keeping the marginal distributions

fixed is PI =
(
p0q0 p0q1
p1q0 p1q1

)
and the set of such probability distributions for all

P makes a surface (shown by lines) in the probability simplex. The φ-coefficient

5


Priyantha Wijayatunga

of X and Y is defined by

φ =
p11 −p1q1√

p1(1 −p1)q1(1 − q1)
,

which is a measure of degree of association between X and Y . Now let X
and Y be positively correlated, then there are two jpds under the assumption

that the two variables are fully dependent. They are PX =
(
p0 0
0 p1

)
and

PY =

(
q0 0
0 q1

)
, where PX is when the marginal distribution of X is pre-

served and PY is when the marginal distribution of Y is preserved. Note that each
full dependence is obtained from P while preserving respective marginal distri-
bution, then the marginal distribution of the other variable should be assumed by
it. Therefore in these cases, the full dependence is essentially linear.

For a generalization of ρ to measure ‘any’ type of dependence we need to look
at its structure and construction. First we consider the case of two binary vari-
ables by examining the φ-coefficient. Let DP I,P be p11 −p1q1 that is the (2, 2)th
component Euclidean distance between the two probability distributions PI and
P . It is a measure of how far the dependence (under P ) from the independence
(under PI ) when marginals of X and Y are fixed. Note that in the 2 × 2 case it is
sufficient to consider a single component difference (between the two probability
matrices) since all the components have same absolute difference. Similarly, we
have DP I,P X = p1(1 − q1) and DP I,P Y = q1(1 − p1). Since PX and PY are
the two full dependences that we can obtain from P while preserving respective
marginal in each case, we have that DP I,P ≤ DP I,P X and DP I,P ≤ DP I,P Y . In
fact DP I,P = p11−p1q1 = p1(p11/p1−q1) ≤ p1(1−q1) = DP I,P X since p1 ≥ p11
and similarly the other inequality. It is easy to see that the denominator of the
φ-coefficient is the geometric mean of DP I,P X and DP I,P Y (the two maximal dis-
tances) and the numerator is DP I,P . Therefore, the φ-coefficient can be thought
of as the normalized distance between P and PI where the normalizing constant
is the geometric mean of the two maximal distances. Hence the φ-coefficient is 1
if and only if P = PX = PY (full dependence) and it is 0 if and only if PI = P
(independence).

2.2 n×m case
Let X and Y be two multinary random variables where their state spaces are

{0, 1, ..,n − 1} and {0, 1, ..,m − 1} respectively for n,m > 2. For any given
jpd of X and Y, P = (p00, ...,p0(m−1); p10, ...,p1(m−1); ...; p(n−1)1, ...,p(n−1)(m−1))
where pij = p(X = i,Y = j) for i = 0, ..,n − 1 and j = 1, ...m − 1, we de-
fine the probability simplex, ∆ = {P = (pij)n×m :

∑
ij pij = 1,pij ≥ 0; i =

6


A geometric view on Pearson’s correlation coefficient and a generalization of it
to non-linear dependencies

(0, 1, 0, 0)

(1, 0, 0, 0)

(0, 0, 1, 0)

(0, 0, 0, 1)

Figure 1: Probability simplex for binary X and Y where their jpd P =
(p00,p10,p01,p11) is a point in it. Any jpd on surface shown by lines represents
independence of X and Y.

0, 1, ..,n − 1; j = 0, 1, ...,m − 1} similar to the case of two binary random
variables. But here visualization of it is more difficult. Recall that ρ(X,Y ) =
cov(X,Y )/

√
var(X)V ar(Y ), where

cov(X,Y ) =
∑
x,y

xyp(x,y) −
∑
x

xp(x)
∑
y

yp(y)

and
var(X) =

∑
x

x2p(x) −{
∑
x

xp(x)}2.

In the following we try to visualize the ρ and its structure for understanding how
it measures the dependence.

Let us take the case where n = m, thus allowing us to have perfect (one-to-
one) dependence between X and Y, linear or non-linear. It can be seen that when
X and Y are assigned to two perpendicular axes, cov(X,Y ) is area difference
between two rectangular Euclidean areas, that is shown as the dark area in the
Figure 2. The first area (i.e.,

∑
x,y xyp(x,y)) is the weighted average area created

by the values of X and Y, where, for each component area that is being weighted
is with side lengths X = x and Y = y and its weight is the respective joint
probability of X = x and Y = y, i.e., p(X = x,Y = y). This area represents the

7


Priyantha Wijayatunga

X

Y

x1 x2 xn

y1

y2

yn

E{XY}

E{X}

E{Y}
cov(X,Y )

Figure 2: Covariance of X and Y is the weighted averaged Euclidean area differ-
ence.

dependence between X and Y . And the second area (i.e.,
∑

x xp(x)×
∑

y yp(y))
is the area created by the side lengths that are the weighted average of values of
X (i.e., E{X}) and that of Y (i.e., E{Y}) where the weights are the respective
marginal probabilities. Since the lengths or values E{X} and E{Y} are also on
same axes as X and Y are, respectively, we can see the difference of the two
areas. Note that it can be seen that the second area (i.e.,

∑
x,y xyp(x)p(y)) is also

calculated in the similar way as the first, but assuming the independence of X and
Y , i.e., it is the weighted average area created by the values of X and Y , where for
each component area that is being weighted is with side lengths X = x and Y = y
and the weight associated with it is the respective joint probability of X = x and
Y = y assuming independence p(X = x,Y = y) = p(X = x)p(Y = y). So the
second area represents the scenario of the independence of X and Y . Therefore
one can view that the two areas refer to those when a dependence between X and
Y is assumed and when their independence is assumed while keeping the marginal
distributions fixed, therefore cov(X,Y ) is a ‘distance’ in terms of a Euclidean area
difference between dependence and independence of the two variables.

Moreover var(X) can be interpreted in the same way. Now X is assumed to
be on both axes meaning that Y is replaced by X (taken as if Y were X). This is
a context of assuming a full dependence of X and Y when the marginal of X is

8


A geometric view on Pearson’s correlation coefficient and a generalization of it
to non-linear dependencies

preserved. Assuming one variable by the other is ‘a way’ to consider a case of full
dependence between the two variables. Then we are assuming the marginal of Y
by that of X. This assumption is easily seen when both variables have same sizes
in their state spaces but it is hard to see when they are different. So the E{X2} is
indicated by the weighted average area that we obtain when Y is X where weight
for each component area x2 is p(x,y) = p(x), i.e., when the marginal of X is
preserved. This is a sensible area under full dependence. And E{X}2 is indicated
by the area when the respective weight is p(x)p(y) = p(x)2 where x = y. This is
a hypothetical case where it is taken as if Y were X, yet their joint probability is
taken as if they were independent. So, var(X) is deviation of the full dependence
from independence if Y were X. And the same interpretation applies for var(Y ).

Thus, ρ(X,Y ) is the normalized area difference referring to cov(X,Y ) with
the normalizing constant being the geometric mean of the two maximal area differ-
ences referring to cov(X,Y ) where they are such that, one is when Y is assumed
to be X (i.e., var(X)) and the other is when X is assumed to be Y (i.e., var(Y )).
That is, the normalizing constant is obtained by assuming the full dependence be-
tween X and Y. However the full dependence quantified in this way is appropriate
only for doing so for linear dependences. Since there are two such cases of full
linear dependence the geometric mean of these two maximal area differences is
taken. Note that the above interpretation is valid for the case of X and Y have
continuous state spaces.

One thing that we need to show is that cov(X,Y ) is maximal (or minimal)
when X and Y are strictly monotonically related, for example, linearly related
positively (negatively), among all cases of full ono-to-one dependencies between
X and Y for fixed maginals of X and Y . This indicates that ρ is not able to
identify non-monotonic relations since their covariance values can not be ordered.
To see that cov(X,Y ) is maximal when Y is strictly increasing with X, let X =
{a1 < ... < an} be the state space of X and Y = {b1 < ... < bn} be that of Y .
Then considering inequalities (ai−aj)(bi−bj) > 0 for i,j = 1, ...,n (i.e., we have
aibi + ajbj > aibj + ajbi) it can be shown that

∑
i aibi >

∑
i,j:j=f(i) aibj where f

is any one-to-one function from X to Y such that f(i) 6= i for at least two distinct
values of i (i.e., f is not a strictly increasing function of i). Now if the marginals of
X and that of Y are (p1, ...,pn) and (q1, ...,qn), where pi = qi for all i = 1, ...,n
when Y is monotonically increasing with X and otherwise pi = qj for some
appropriate i 6= j for i,j = 1, ....,n, then

∑
i aibipi >

∑
i,j:j=f(i) aibjpi meaning

that E{XYM} ≥ E{XY} where YM is Y when it is strictly increasing with
X. This implies that cov(X,YM ) ≥ cov(X,Y ) for fixed marginals of X and Y .
Therefore, for discrete X and Y , ρ(X,Y ) is maximal when Y is strictly increasing
in X, among all one-to-one relationships between them. So, if this is the case
ρ(X,Y ) = 1 (maximal) since cov(X,Y ) ≤ var(X) and cov(X,Y ) ≤ var(Y ).

9


Priyantha Wijayatunga

3 Some other popular measures of dependence
There are a few popular measures of dependence that have similar structure

in their definition. We review them briefly by giving some interpretations that
support our definition of dependence measure.

3.1 Spearman’s rank correlation coefficient ρs

In many statistical analyses, especially for non-normal data a popular measure
of dependence between two random variables, say, X and Y , is the Spearman’s
rank correlation coefficient.

ρs = 1 −
6
∑n

i=1 d
2
i

n(n2 − 1)
where di = x(i) − y(i) and x(i) is the ith smallest value in the data sample of X
and similarly for y(i). It is obvious that ρs = 1 if and only if two components
of data pair (xi,yi) has the same ranking, for all data pairs since then di = 0
for all i. And one can see that for a perfect negative dependence

∑n
i=1 d

2
i should

be its maximal value that is n(n2 − 1)/3 in order to get ρsX,Y = −1. Therefore
the normalizing constant is taken as n(n2 − 1)/6 but due to the structure of the
definition of the coefficient it is applied to the term

∑n
i=1 d

2
i . Therefore the ρ

s

is an accurate measure any monotonic dependence between the two variables.
However, when the two variables are not having a strictly monotonic relationship
the measure can not give a correct picture of the dependence.

3.2 Information theoretic measures
Another popular measure of dependence, especially in machine learning lit-

erature and applied statistics is so-called mutual information (see, for example,
[11]). For discrete random variables X and Y , it is defined as

I(X,Y ) =
∑
x,y

p(x,y)log
p(x,y)

p(x)p(y)

and furthermore, conditional mutual information between X and Y given another
variable Z is defined as

CI(X,Y,Z) =
∑
x,y,z

p(x,y,z)log
p(x,y|z)

p(x|z)p(y|z)
(1)

If X and Y are independent then the I(X,Y ) = 0 and if X and Y are condi-
tionally independent given Z then the CI(X,Y,Z) = 0. In fact, these depen-
dence measures are also based on so-called Kullback-Leibler (KL) distance or

10


A geometric view on Pearson’s correlation coefficient and a generalization of it
to non-linear dependencies

rather divergance, [13]. It is easy to see that I(X,Y ) is the KL divergence be-
tween the joint probability distribution of X and Y , and that when independence
is assumed, therefore it measures the dependence in terms of ‘departure’ from
independence. In fact, I(X,Y ) is the weighted average of Euclidean distance be-
tween logarithmic of the joint probability p(x,y) and that when independence is
assumed, where weights are the respective joint probabilities. That is, it is the
expectation, under the joint probability, of the difference between the logarithmic
of the joint probability p(x,y) and that when independence is assumed. Note that
though 0 ≤ I(., .) ≤ 1, there is no normalization (with respect to any maximal
dependence) is involved.

Though these information measures are used to identify respective depen-
dences they are not metrics since KL-divergance is not a true distance (metric),
therefore they can not be used to measure the degree of dependence between
variables. For example, as shown in [7] let p(x,y) and q(x,y) define two de-

pendencies between X and Y where p(x,y) =
(

3/8 1/8
1/8 3/8

)
and q(x,y) =(

1/2 0
1/8 3/8

)
. Obviously probability distribution q shows a higher dependency

than that of p but its mutual information is lower than that of p, (MIp(X,Y ) >
MIq(X,Y )). Note that q is obtained from p without preserving the marginal
distributions of X and Y . Now let r(u,v) and s(u,v) define two dependencies

between random variables U and V where r(u,v) =


 0 1/7 1/71/7 1/7 1/7

1/7 1/7 0


 and

s(u,v) =


 0 0 2/71/7 2/7 0

1/7 1/7 0


 . Then we have that MIr(U,V ) < MIs(U,V ).

Note that s shows a higher dependency than that of r and it is obtained from r by
preserving the marginal distributions of U and V . Furthermore, all zeros in r are
also in s. If this is the case then higher dependency implies higher mutual infor-
mation. So mutual information is restricted measure of degree of dependence.

3.3 Chi squared test statistic χ2

We can see that well-known Chi squared test statistic χ2 that is used for testing
independence of two discrete random variables uses a certain dependence measure
in it for performing the test. Let X and Y take values i = 1, ...,α and j = 1, ...,β,
respectively and let us write the joint probability of X = i and Y = j as pij,
marginal probability of X = i as pi. and that of Y = i as p.j. So, the conditional
probability of X = i given Y = j is pi|j = pij/p.j and similarly pj|i is defined.

11


Priyantha Wijayatunga

Then,

χ2 =
∑
i,j

n
(pij −pi.p.j)2

pi.p.j
= n

{∑
i,j

p2ij
pi.p.j

− 1
}

= n
{∑

i,j

pij
pij −pi.p.j
pi.p.j

}
= n

{∑
i,j

pij
pi|j −pi.
pi.

}
= n

{∑
i,j

pij
pj|i −p.j
p.j

}
= nE{A}

where A is a random variable taking the value
pi|j−pi.

pi.
=

pj|i−p.j
p.j

with probability
pij, for i = 1, ...,α and j = 1, ...,β, and E denotes the expectation. That is, χ2 is
n-multiple of the expectation of a random variable whose (i,j)th value is a ‘nor-
malized’ distance between the probability value pi|j and pi. where the normalizing
constant is pi., for all i,j, and vice versa. Note that

pi|j−pi.
pi.

may be referred to as
the ‘degree’ of dependence between the two events X = i and Y = j. In fact,
it is the certainty factor for the case pi|j < pi., as described in [4] for measuring
the dependency between the two events and it is a symmetric measure. However,
here it is used without the condition. So, E{A} is the expectation of a degree of
dependence between the events X = x and Y = y for all x,y. Therefore, E{A}
can be thought of as measure of degree of dependence between X and Y. And the
term n in χ2 makes it a statistic. That is, a statistic for testing dependence between
two variables can be seen as a product of two factors; one is a quantity related the
degree of dependence between two variables and the other is that of total number
of data cases that are used to estimate the probabilities related to them (i.e., sample
information).

3.4 Test of two proportions
Sometimes one may be interested in testing equality of two proportions to see

if given two variables are independent, for example, when the outcome (Y ) of
interest is binary, such as voting, denoted by Y = 1 (or not, denoted by Y = 0),
for a political candidate in an election for two groups/populations (X) such as
men, denoted by X = 1, and women, denoted by X = 0. Then one can test if two
proportions are equal, i.e., p(Y = 1|X = 1) = p(Y = 1|X = 0) (let us write it as
p1 = q1) by the Z statistics

Z =
1√

1/a + 1/b

p1 − q1√
p(1 −p)

where a and b are the sizes of the two samples of Y when X = 1 and X = 0,
respectively, and p = p(Y = 1). Now we can interpret that the factor p1−q1√

p(1−p)
as a measure of degree of dependence between the two variables due to the term

12


A geometric view on Pearson’s correlation coefficient and a generalization of it
to non-linear dependencies

(p1−q1) in it, where the term
√
p(1 −p) should be taken as the normalizing con-

stant. Note that the latter is constructed assuming full dependence between the two

variables where, then their joint probability distribution is P =
(

1 −p 0
0 p

)
or

similar. Instead of just using p which is the pooled proportion, the geometric mean
of p and (1 − p) should be used as the normalizing constant. This is necessary
to yield the same test statistic value for testing the same hypothesis with com-
plementary probabilities i.e., p(Y = 0|X = 1) and p(Y = 0|X = 0). And the
term 1√

1/a+1/b
which is a function of sample sizes (sample information) makes Z

a statistics. So, similar to χ2 statistic, Z has a measure of degree of dependence
between the two variables in it, in addition to information on the sample sizes.

4 Axioms of an ideal measure of dependence
Before we define our measure of strength/degree of dependence (or rather a

generalization of ρ) it is appropriate to mention axioms that an ideal measure
should possess as shown in [3]. However, it is hard to find dependence measures
satisfying all these axioms. Our generalization of ρ seems to have a bigger poten-
tial in satisfying them, but we omit the discussion here. Following are the axioms;

1. It is well-defined for both continuous and discrete case

2. It is normalized such that its value 0 implies the independence and value 1
implies the full dependence (one variable is a deterministic function of the
other), where all intermediate degrees of dependencies lie between 0 and 1

3. It is equal or has a simple relationship with the Pearson’s correlation coeffi-
cient in the case of a bivariate normal distribution

4. It is a metric, i.e., it is a true measure of distance (between the independence
and dependence of interest) not just a divergence

5. It is invariant under continuous and strictly increasing transformations.

These axioms are straightforward and require no further explanation.
In the following we define our measure following the structure and the con-

struction of ρ but using a true distance metric. We propose to use so-called
Hellinger distance but one may use another suitable distance metric. Since we
are keeping the structure of the ρ the same but replacing its distance measure with
a better one (a metric) when defining our dependence measure, we call it as a
generalization of the ρ. This means that for any given dependence we should be

13


Priyantha Wijayatunga

able to define the corresponding all possible full dependences, since the measure
should be a ratio between a distance from independence to the given dependence
and geometric average of distances from independence to the full dependences.

5 Defining a measure of degree of dependence
As we have seen earlier, in the two binary variables (2 × 2) case where only

the linear dependence exists the dependence can be measured by using a single
component Euclidean distance between joint probability distributions. However,
in the case of two multinary variables (n × n, where n > 2) we can have many
types of dependences, and therefore distances among probability distributions can
not be defined through only a single component or a weighted average area dif-
ference, that are Euclidean type distances and capable of measuring only linear
dependences. Therefore we need to use some other suitable distance to measure
any non-linear dependences. In the following we discuss a possible distance that
is a true metric.

5.1 A metric distance between two probability distributions
We propose to use Hellinger distance between probability distributions (also

called Matsushita distance for the discrete case) which is a metric in the proba-
bility simplex for our task of measuring dependence. Recall that our dependence
measure should be the normalized distance between the given joint probability
distribution of the two variables and that when their independence is assumed
while preserving the marginals, where the normalizing constant is obtained by
considering similar distances related to the all possible maximal dependences but
preserving only one of the marginals at each time. Let Φ and Ψ be two discrete
distribution functions (φ and ψ are probability distributions or mass functions)
then the Hellinger distance between Φ and Ψ is defined as

M(Φ, Ψ) =

{
1

2

∑
x

{√
φ(x) −

√
ψ(x)

}2}1/2
In addition to satisfying properties of a metric M(., .) also satisfies the following
properties: (1) 0 ≤ M(Φ, Ψ) ≤ 1, (2) M(Φ(T), Ψ(T)) = M(Φ(T + a), Ψ(T +
a)) for any constant a, and (3) M(Φ(T), Ψ(T)) = M(Φ(cT), Ψ(cT)) for any
constant c 6= 0 where the last two are called the linear invariance properties of the
probability metric. Note that

(
M(., .)

)2
is not a metric.

First we should have an idea about the furtherest jpd(s) for a given jpd that
may represent independence. In fact we can see that the furtherest probability

14


A geometric view on Pearson’s correlation coefficient and a generalization of it
to non-linear dependencies

distribution to a distribution that represent independence is not useful but those
with fixed marginals, each at a time. For a given distribution function, say, Φ let
us find the maximally Hellinger-distanced distribution function Ψ. The following
proposition shows how to find it.

Proposition 5.1. For positive probability distribution φ maximally Hellinger-distanced
probability distribution ψ is given by

ψ(t) =

{
1, if t = argminu φ(u)
0, otherwise.

and then, M(Φ, Ψ) =
{

1 −
√
min{φ(t) : t ∈T}

}1/2
< 1.

Proof. Let |T | = n, φ(ti) = φi and ψ(ti) = ψi for i = 1, ...,n. Let re-index
all φi’s such that φ(1) ≥ φ(2) ≥ .... ≥ φ(n) and possibly some of the ψi’s can be
zeros. M(Φ, Ψ) is maximal when

∑
t∈T

√
φ(t)ψ(t) is minimal.

n∑
i=1

√
φiψi = (

√
ψ1 + ... +

√
ψn)
√
φ(n)

+(
√
ψ1 + ... +

√
ψn−1)(

√
φ(n−1) −

√
φ(n))

... +
√
ψ1(
√
φ(1) −

√
φ(2)) ≥

√
φ(n)

That is,
∑n

i=1

√
φiψi is minimal when ψ1 = ... = ψn−1 = 0 and ψn = 1. So we

obtain the maximally Hellinger-distanced distribution function Ψ and therefore
M(Φ, Ψ).2

But then T is deterministic variable with respect to Ψ! This theorem says
that for any given probability distribution, bivariate discrete in our case, the maxi-
mally Hellinger-distanced probability distribution is represented by a vertex of the
probability simplex. All its component are zeros except for one place that has 1
that is corresponding to the smallest probability value of the reference probability
distribution. This is a degenerate case as far as dependence of the two variables
are concerned since it represents that both variables are deterministic and hav-
ing full dependence. Therefore, such a full dependence can not be used for the
normalization since it does not generally preserve the marginals.

For a given jpd P of X and Y, the dependence of them that it represents
should be measured with a suitable normalized distance between P and PI . It
is clear from above that the normalizing constant should be the geometric mean
of distances from independence to all possible full dependences where each such

15


Priyantha Wijayatunga

full dependence should be preserving either of marginals. This rule is to follow
the correlation coefficient definition. Therefore, an essential step is to find the
two types of probability distributions PX (jpd(s) representing full dependence
when marginal of X is fixed) and PY (jpd(s) representing full dependence when
marginal of Y is fixed) in order to find the normalizing constant. As you will see
in some cases there may be multiple candidates for each of them. Therefore we
have the following definition. Note that there are some instances such as in [3]
and [5] where Hellinger distance between the jpd and that of when independence
is assumed is used for measuring the dependence, but in such work no normaliza-
tion is done. However, the above proposition implies that distance between any
non-deterministic jpd representing independence and that representing a full de-
pendence can be strictly less than 1 for two discrete random variables, therefore
normalization is necessary if one wants to have a measure that shows strength of
dependence.

Definition 5.1. When M is a metric in the probability simplex of two discrete
random variables X and Y, M-based measure of degree of dependence between
X and Y represented by their joint distribution function P is defined as

ρM (X,Y ) =
M(PI,P)∏

P X∈PXmax

∏
P Y ∈PYmax

{
M(PI,PX )M(PI,PY )

}1/|PYmax+PXmax|
where PI is the joint distribution function of X and Y when their independence is
assumed, PXmax denotes the set of all joint distribution functions, each represent-
ing a maximal dependence while preserving the marginal distribution of X and
similarly for PYmax, |A| is the cardinality of the set A, and M(P,Q) is the distance
metric between two probability distributions P and Q.

Note that the denominator is the geometric mean of the maximal distances be-
tween full dependences and the independence. And we use Hellinger distance as
the distance measure. Since ρM is defined following the structure of the Pearson’s
correlation coefficient it can be regarded as a generalization of it for the case of
discrete variables.

For linear relationships measuring the dependence is relatively easy since both
PX and PY represent perfect linear dependence. This is when they have all their
entries zero except for those, but may not be all, in each diagonal in respective
case. For example, for a positive linear relation, PX is obtained by assigning each
main diagonal entry with the sum of all entries in the respective row. This assures
that the marginal probability of X is preserved when obtaining full dependence,
and similarly for PY . Note that positive linear relationship is selected if main
diagonal entries are generally larger than the other entries in the joint probability
value matrix P . But when we allow non-linear relationships between X and Y

16


A geometric view on Pearson’s correlation coefficient and a generalization of it
to non-linear dependencies

there are no pre-specified PX and PY , therefore multiple candidates may exist
for each of them. We argue that they should be induced from the jpd in a similar
way to the case of linear dependence. So we propose following simple rule for
obtaining PX and PY .

Definition 5.2. For each x, when there exists a single value y′ such that y′ =
argmaxy p(X = x,Y = y), then let p

X (X = x,Y = y′) = p(X = x) and
pX (X = x,Y 6= y′) = 0 to obtain PX . If there are multiple such y′ values then
obtain multiple PX , each refering to one of those y′ values, assuming that it is the
only value where maxima exists. And similarly PY is defined.

By this way, we get one or more jpds each representing a maximal dependence
that preserves respective marginal.

6 Examples of n×n case where n ≥ 2
Now we consider some different cases of P and demonstrate how we can

calculate our measure and compare its value to those of some trational measures.

Case 1 Suppose a simple case of each row and column of P having a single
maximal entry that is common to both its row and column. Then the other entries
in the row are summed onto the maximal entry in the row for each row to yield
PX and similarly PY is obtained. Therefore, PX and PY are on the boundary of
∆, so they are the furtherest probability distributions from PI while preserving
respective marginals. Then the degree of dependence between X and Y is defined
as (since |PXmax| = |PYmax| = 1 )

ρM (X,Y ) =
M(PI,P)√

M(PI,PX )M(PI,PY )

Example 6.1. For binary X and Y with P =
(

0.3 0.2
0.1 0.4

)
, φ = 0.4082 and

ρM = 0.2783 (Cramer’s V and Tschuprow’s T are 0.4082). And interchanging
off-main diagonal entries but keeping the main diagonal entries as they were, i.e.,

having P =
(

0.3 0.1
0.2 0.4

)
, gives the same results for all measures.

Example 6.2. Let state spaces of X and Y be {1, 2, 3} and their joint proba-

bility P =


 0.05 0.03 0.200.30 0.07 0.05

0.04 0.20 0.06


 that is a non-linear dependence and then

17


Priyantha Wijayatunga

PI =


 0.1092 0.084 0.08680.1638 0.126 0.1302

0.1170 0.090 0.0930


, PX =


 0.00 0.00 0.280.42 0.00 0.00

0.00 0.30 0.00


 and PY =


 0.00 0.00 0.310.39 0.00 0.00

0.00 0.30 0.00


. And then ρ = −0.2025 but ρM = 0.4113 (Cramer’s

V and Tschuprow’s T are 0.5472). But had that P ==


 0.05 0.03 0.200.04 0.20 0.05

0.30 0.07 0.06




which is a linear dependence then ρ = −0.5474 and ρM = 0.4075 (Cramer’s V
and Tschuprow’s T are 0.5467). Note the change in the degree of dependence is
small since linear dependence is obtained from nonlinear case by just interchang-
ing probability values in P .

Case 2 When each row and column of P has a single maximal entry that may
not be common to both its row and column we still can obtain a single PX and a
single PY . Therefore, we can apply the above definition.

Example 6.3. When P =


 0.30 0.03 0.200.05 0.07 0.05

0.04 0.20 0.06


 we have ρ = 0.1383 and

ρM = 0.450011. Note that here we have that Cramer’s V and Tschuprow’s T
are 0.4257843 that are lesser than our measure.

Case 3 When there are more than one maximal entry in a row or a column we
have multiple PX ’s and multiple PY ’s. Note that here we try to obtain a similar
situation in the above two cases. That is, each row of PX has only one non-zero
element (it is obtained by summing up all entries in the corresponding row of P ,
thereby preserving the marginal probability distribution of X). Assume that we
get a number of PX ’s, say, PX1, ...,PXa and b number of PY , say, PY1, ...,PYb .
Let us consider the following example.

Example 6.4. When P =




0.11 0.01 0.01 0.01 0.01
0.01 0.01 0.01 0.01 0.25
0.01 0.10 0.10 0.01 0.01
0.01 0.01 0.01 0.15 0.01
0.01 0.10 0.01 0.01 0.01


 then we make two

PX ’s;

18


A geometric view on Pearson’s correlation coefficient and a generalization of it
to non-linear dependencies

PX1 =




0.15 0.000 0.000 0.00 0.00
0.00 0.000 0.000 0.00 0.29
0.00 0.230 0.000 0.00 0.00
0.00 0.000 0.000 0.19 0.00
0.00 0.140 0.000 0.00 0.00


 and

PX2 =




0.15 0.000 0.000 0.00 0.00
0.00 0.000 0.000 0.00 0.29
0.00 0.000 0.230 0.00 0.00
0.00 0.000 0.000 0.19 0.00
0.00 0.140 0.000 0.00 0.00


.

Therefore we have two maximal distances to these two full dependences. They are
M(PI,PX1 ) and M(PI,PX2 ) and similarly we obtain another two full depen-
dences when marginal of Y is preserved. Therefore,

ρM (X,Y ) =
M(P,PI )∏2

i=1

∏2
j=1

{
M(PXi,PI )M(PYj,PI )

}1
4

Then ρ = −0.0491 and ρM = 0.5731. Note that here we have that Cramer’s V
and Tschuprow’s T are 0.6652.

7 Conclusion
We have looked at the structure and the construction of the Pearson’s cor-

relation coefficient ρ in order to have a generalization of it for measuring any
non-linear dependence between two random variables. We have shown that it is
simple do it geometrically for discrete variables. It can be shown that ρ is a nor-
malized ‘Euclidean’ type distance between the joint probability distribution of the
two random variables and that when their independence is assumed in the prob-
ability simplex of the two variables where normalizing constant is the geometric
mean of two maximal such distances; each between full linear dependence of the
two variables and their independence while preserving the marginal distribution
of respective variable. So, we have shown that if we consider all possible full
dependences and use an appropriate distance such as Hellinger then we can have
a genaralization of ρ. But generally it is not easy to find all possible maximal
distances, which is an open problem that may need algorithmic or computational
solutions. However we have shown some examples after having defined a gener-
alization.

Acknowledgments: Financial support for this research is from Swedish Re-
search Council for Health, Working Life and Welfare (FORTE) and Swedish Ini-
tiative for Microdata Research in the Medical and Social Sciences (SIMSAM).

19


Priyantha Wijayatunga

References

[1] A. Rényi, Probability Theory North-Holland Publishing Company and
Akadémiai Kiadó, Publishing House of the Hungarian Academy of Sciences.
Republished Dover USA, 2007.

[2] C. Sabatti, Measuring dependency with volume tests, The American Statis-
tician 56 3 (2002), 191-195. DOI: 10.1198/000313002128.

[3] C. W. Granger, E. Maasoumi and J. Racine, A Dependence Metric for Possi-
bly Nonlinear Processes, The Journal of Time Series Analysis 25 5 (2004),
649-669.

[4] F. Berzal, I. Blanco, D. Sanchez and M. -A. Vila, Measuring the Accuracy
and Interest of Association Rules: A New Framework, Intelligent Data Anal-
ysis 6 3 (2002), 221-235.

[5] H. Skaug and D. Tjostheim, Testing for serial independence using measures
of distance between densities, P. M. Robinson and M. Rosenblatt (Eds):
Athens Conference on Applied Probability and Time Series, Volume II:
Time Series Analysis In Memory of E.J. Hannan, Springer Lecture Notes
in Statistics 115 (1996), 363-377.

[6] K. Matsusita, Decision rules, based on distance, for problems of fit, two sam-
ples, and estimation, Annals of Mathematical Statistics 26 4 (1955), 631-
640.

[7] M. Studeny and J. Vejnarova, The Multiinformation Function as a Tool for
Measuring Stochastic Dependence, M. I. Jordan (Eds): Learning in Graphi-
cal Models, Kluwer Academic Publishers (1998), 261-297.

[8] M. Sugiyama and K. M. Borgwardt, Measuring Statistical Dependence via
the Mutual Information Dimension, Proceedings of the Twenty-Third Inter-
national Joint Conference on Artificial Intelligence (IJCAI’13) AAAI Press
(2013), 1692-1698.

[9] N. Balakrishnan and C. -D. Lai, Continuous Bivariate Distributions,
Springer, 2009.

[10] P. Diaconis and B. Efron, Testing for independence in a two-way table: new
interpretations of Chi-square statistics, The Annals of Statistics 13 (1985),
845-874.

20


A geometric view on Pearson’s correlation coefficient and a generalization of it
to non-linear dependencies

[11] P. Wijayatunga, S. Mase and M. Nakamura, Appraisal of Companies with
Bayesian Networks, International Journal of Business Intelligence and Data
Mining 1 3 (2006), 326-346.

[12] S. E. Fienberg and J. P. Gilbert, The Geometry of a Two by Two Contingency
Table, Journal of the American Statistical Association 65 (1970), 694-701

[13] S. Kullback and R. A. Leibler, On information and sufficiency, The Annals
of Mathematical Statistics 22 1 (1951), 79-86

[14] W. Bergsma, A bias-correction for Cramér’s V and Tschuprow’s
T, Journal of the Korean Statistical Society 42 3 (2013), 323-328.
http://dx.doi.org/10.1016/j.jkss.2012.10.002.

21