Int. J. Anal. Appl. (2023), 21:57 A PDE Approach to the Problems of Optimality of Expectations Mahir Hasanov∗ Department of Mathematics, Istanbul Beykent University, Türkiye ∗Corresponding author: hasanov61@yahoo.com Abstract. Let (X,Z) be a bivariate random vector. A predictor of X based on Z is just a Borel function g(Z). The problem of "least squares prediction" of X given the observation Z is to find the global minimum point of the functional E[(X−g(Z))2] with respect to all random variables g(Z), where g is a Borel function. It is well known that the solution of this problem is the conditional expectation E(X|Z). We also know that, if for a nonnegative smooth function F : R×R → R, arg ming(Z)E[F(X,g(Z))]= E[X|Z], for all X and Z, then F(x,y) is a Bregmann loss function. It is also of interest, for a fixed ϕ to find F(x,y), satisfying, arg ming(Z)E[F(X,g(Z))]=ϕ(E[X|Z]), for all X and Z. In more general setting, a stronger problem is to find F(x,y) satisfying arg miny∈RE[F(X,y)] = ϕ(E[X]), ∀X. We study this problem and develop a partial differential equation (PDE) approach to solution of these problems. 1. Introduction and Preliminary Facts Best approximation problems in Mathematics have long history of study. It is known that for every given x in a Hilbert space H and every given closed subspace L of H there is a unique best approximation to x out of L (namely, y = Px, where P is the orthogonal projection of H onto L) (see [8] and [11]). Theorem 1.1 below, regarding the optimality of conditional expectations with respect to L2 loss function F (x,y) = (x −y)2 follows from this result. Theorem 1.1. (see [1], [9], [13] ) Let (X,Z) be a bivariate random vector and LZ = {g(Z)|g(Z) ∈ L2(Ω), g is a Borel function}. Let E[X2] < ∞. Then there exists a Borel function g0 : R → R with E[(g0(Z)2] < ∞, such that E[(X − g0(Z))2] = inf{E[(X − g(Z))2 ∣∣g(Z) ∈ LZ}. Moreover, g0(Z) = E[X|Z]. Received: Apr. 14, 2023. 2020 Mathematics Subject Classification. 60A05, 49K45 . Key words and phrases. expectation; conditional expectation; random variables; Bregman loss functions: partial differential equations. https://doi.org/10.28924/2291-8639-21-2023-57 ISSN: 2291-8639 © 2023 the author(s). https://doi.org/10.28924/2291-8639-21-2023-57 2 Int. J. Anal. Appl. (2023), 21:57 This theorem means that the distance function ||X−Y ||22 attains its minimum value at Y = ψ(Z) = E[X|Z]. Thus, arg minY∈LZ||X −Y || 2 2 = E[X|Z]. (1.1) We recall some basic notions and facts from probability theory in the form we use in this paper ( [1], [9], [13]). Expectation. Let (Ω,F,P) be a probability space and X : Ω → R be a random variable. By the definition, a random variable is measurable, i.e., X−1(σB) ⊂ F, where σB is the Borel algebra, consisting of all Borel sets in R. The expectation of a random variable X is defined by the following integral, which is Lebesgue integral with respect to the probability measure. E[X] = ∫ Ω X dP. Particularly, for a simple random variable X(w) = ∑n i=1 aiχAi (w), E[X] = n∑ i=1 aiP (Ai). (1.2) L2(Ω) = {X| ∫ Ω |X|2 dP < ∞}. The norm in L2(Ω) is defined by ||X||2 = (∫ Ω |X|2 dP )1 2 . Conditional Expectation. Let (X,Z) be a bivariate random vector. The conditional expectation of X given Z is denoted by E[X|Z], which is a random variable, defined by ψ(Z)(w) = ψ(Z(w)) = E[X|Z = Z(w)],∀w ∈ Ω. The following problem is a natural generalization of the problem (1.1), which has very important applications (see [2] and references therein); find a loss function F (x,y) satisfying the following condition arg miny∈RE[F (X,y)] = ϕ(E[X]), ∀X, (1.3) where ϕ is a Borel function. In this paper our main concern will be the problem (1.3). Such problems arise in different contexts of statistics and probability theory (see [4]). In the case of ϕ(x) = x; F (x,y) = C(x−y) and F (x,y) = (x−y)2 the optimality of conditional expectations have been studied by many authors (see [1], [9], [10], [13]). For ϕ(x) = x and arbitrary function F (x,y) the Bregman loss functions play an important role ( [5], [6], [7]). Particularly, it was proved in [2] (see Theorem 1.2 below) that if for a nonnegative smooth function F : R×R → R, arg ming(Z)E[F (X,g(Z))] = E[X|Z], for all X and Z, then F (x,y) is a Bregmann loss function. Definition 1.1. Let f : R → R be a strictly convex differentiable function. Then the Bregman Loss Function (BLF) Df : R×R→R is defined as Df (x,y) = f (x) − f (y) − f ′(y)(x −y) Int. J. Anal. Appl. (2023), 21:57 3 In general, Bregman loss functions are defined by using strictly convex differentiable functions f : Rn → R. In this paper, for convenience we consider the case n = 1. All results can easily be extended to the case n > 1. For more information on Bregman loss functions see [3] and [12]. The following theorem contains the most general result, regarding problem 1.3 in the case of ϕ(x) = x. Theorem 1.2. ( [2]) Let Df : R×R→R be a BLF. Then, arg minY∈LZE[Df (X,Y )] = E[X|Z]. Moreover, if F : R×R → R, F ≥ 0, F (x,x) = 0, F and Fx are continuous functions and for all X and Z, arg minY∈LZE[Df (X,Y )] = E[X|Z] then F is a BLF. The rest of this paper will be organized as follows. In Section 2 we present a theorem about optimality of expectations. Section 3 consists of two subsections. In subsection 3.1 we develop a partial differential equation approach for critical points of E[F (X,y)]. The main problem studied in this subsection is: when y = ϕ(E[X]) is a critical point of the function E[F (X,y)] for every X ∈ L1(Ω)? We present a partial differential equation approach for solving this problem and give a necessary and sufficient condition. In subsection 3.2 we study extreme problems. Our main goal is to find the class of all F such that y = ϕ(E[X]) is a unique extremum point for E[F (X,y)], for all X ∈ L1(Ω). 2. On the Optimality of Expectations We start with a slightly stronger version of Theorem 1.2. Theorem 2.1. Let F : R×R → R, F ≥ 0, F (x,x) = 0, Fx and Fy are continuous. Suppose that there exists a function ϕ : R→R such that ϕ(E[X]) is a unique minimizer for E[F (X,y)] in R for all X ∈ L1(Ω),i.e., arg miny∈RE[F (X,y)] = ϕ(E[X]),∀X ∈ L1(Ω), provided that F (X,y) ∈ L1(Ω). Then F (x,y) is a BLF if and only if ϕ(x) = x. Proof. let F (x,y) be a BLF. Then, F (x,y) = f (x) − f (y) − f ′(y)(x −y). We can write F (X,y) = f (X) − f (y) − f ′(y)(X −y) and F (X,E[X]) = f (X) − f (E[X]) − f ′(E[X])(X −E[X]). Hence, F (X,y) −F (X,E[X]) = f (E[X]) − f (y) + f ′(E[X])(X −E[X]) − f ′(y)(X −y). 4 Int. J. Anal. Appl. (2023), 21:57 Obviously, E [ f ′(E[X])(X −E[X]) ] = 0 and E [ f ′(y)(X −y) ] = f ′(y)(E[X] −y). Then, E [ F (X,y) −F (X,E[X]) ] = f (E[X]) − f (y) − f ′(y)(E[X] −y). Consequently, E [ F (X,y) −F (X,E[X]) ] = Df (E[X],y) ≥ 0. (2.1) Since F (x,y) = Df (x,y) is a BLF, Df (E[X],y) = 0 ⇔ y = E[X]. Thus y = E[X] is a minimum point of E[F (X,y)]. By the condition ϕ(E[X]) is a unique minimizer. Then, it follows immediately that ϕ(x) = x. Now let ϕ(x) = x. and arg miny∈RE[F (X,y)] = E[X],∀X ∈ L1(Ω). Then it follows from this condition that F is a BLF. This case was proved in [2] (see Theorem 3). � 3. A PDE Approach to Optimality Problems 3.1. Critical Points. In this section we develop a partial differential equation (PDE) approach for critical points of E[F (X,y)]. More precisely, the main question is: when y = ϕ(E[X]) is a critical point of the function E[F (X,y)] for every X? We give a necessary and sufficient condition for this question. The following assumption will be needed throughout this section. F : R×R → R, F (x,x) = 0, and the function F has first and second derivatives. Now we prove a critical point theorem. Theorem 3.1. Let ϕ : R→R be an invertible function. Then, y = ϕ(E[X]) is a critical point of the function E[F (X,y)] for all X ∈ L1(Ω), if and only if F (x,y) is a solution of the following PDE Fxy(ϕ −1(y) −x) + Fy = 0. (3.1) Proof. Let y = ϕ(E[X]) be a critical point of the function E[F (X,y)] for all X ∈ L1(Ω). Consider a simple random variable X such that P (X = a) = p, P (X = b) = q and p + q = 1. By (1.2) E[F (X,y)] = pF (a,y) + qF (b,y). and ϕ(E[X]) = ϕ(pa + qb). Then pFy(a,ϕ(pa + qb)) + pFy(b,ϕ(pa + qb)) = 0. Int. J. Anal. Appl. (2023), 21:57 5 It means that Fy(a,ϕ(pa + qb)) q = − Fy(b,ϕ(pa + qb)) p ⇔ Fy(a,ϕ(pa + qb)) q(b−a) = − Fy(b,ϕ(pa + qb)) p(b−a) . (3.2) y = ϕ(E[X]) ⇒ y = ϕ(pa + qb) ⇒ pa + qb = ϕ−1(y). Note that pa + qb−a = q(b−a) and pa + qb−b = −p(b−a). Hence, ϕ−1(y) −a = q(b−a) and ϕ−1(y) −b = −p(b−a). It follows from equation (3.2) that Fy(a,y) ϕ−1(y) −a = Fy(b,y) ϕ−1(y) −b . Therefore, the function Fy (x,y) ϕ−1(y)−x does not depend on x. Then ∂ ∂x [ Fy(x,y) ϕ−1(y) −x) ] = 0 and Fxy(ϕ −1(y) −x) + Fy (ϕ−1(y) −x)2 = 0. Consequently, Fxy(ϕ −1(y) −x) + Fy = 0. To finish the proof of this theorem, we need to show that the (3.1) implies y = ϕ(E[X]) is a critical point of the function E[F (X,y)] for all X ∈ L1(Ω). Thus, by (3.1) Fxy(ϕ −1(y) −x) + Fy = 0. Multiplying, this equation by the integrating factor µ(x,y) = 1 (ϕ−1(y)−x)2 we get 1 ϕ−1(y) −x Fxy + 1 (ϕ−1(y) −x)2 Fy = 0. Then, ( 1 ϕ−1(y) −x Fy ) x = 0 and Fy ϕ−1(y) −x = C(y) ⇒ Fy = (ϕ−1(y) −x)C(y). Setting y = ϕ(E[X]) we get Fy(X,ϕ(E[X])) = ( E[X] −X ) C(ϕ(E[X])) and E [ Fy(X,ϕ(E[X]) ] = ( E[X] −E[X] ) C(ϕ(E[X])) = 0. � 6 Int. J. Anal. Appl. (2023), 21:57 We next give an application of this theorem. Example 3.1. Let us find a general solution of the following problem Fxy(ϕ −1(y) −x) + Fy = 0, F (x,x) = 0 in the case of ϕ(y) = y. Solution. We can write the equation in the form Fxy + 1 y −x Fy = 0. Multiplying, this equation by the integrating factor µ(x,y) = 1 y−x we get 1 y −x Fxy + 1 (y −x)2 Fy = 0. Then, ( 1 y −x Fy ) x = 0 and Fy y −x = C(y). Let C(y) = f ′′(y). By using integration by parts we obtain that∫ y x Fy(x,t) dt = ∫ y x f ′′(t)(t −x))dt = [ f ′(t)(t −x) ]t=y t=x − ∫ y x f (t) dt. Consequently, F (x,y) = f (x) − f (y) − f ′(y)(x −y). The following corollary immediately follows from this example and Theorem 3.1. Corollary 3.1. If F (x,x) = 0 and y = E[X] is a critical point of the function E[F (X,y)] for all X ∈ L1(Ω), then F (x,y) can be written in the form F (x,y) = f (x) − f (y) − f ′(y)(x − y) for a differentiable function f . Not. By imposing additional conditions: F (x,y) ≥ 0 and E[X] is the unique minimizer, it was proved in [2] that F is a BLF. 3.2. Extreme Points. Let ϕ : R→R be an invertible function. In this subsection the main problem is to find the class of all F such that y = ϕ(E[X]) is a unique extremum point for E[F (X,y)], for all X ∈ L1(Ω). We first prove the following theorem. Theorem 3.2. Let arg miny∈RE[F (X,y)] = ϕ(E[X]),∀X ∈ L1(Ω). then F (x,y) = ( ϕ−1(y) −x ) f ′(y) − ( ϕ−1(x) −x ) f ′(x) − ∫ y x f ′(t) ( ϕ−1(t) )′ dt, (3.3) Int. J. Anal. Appl. (2023), 21:57 7 where f is a differentiable function satisfying the following condition ( ϕ−1(y) −x ) f ′(y) + ∫ ϕ(x) y f ′(t) ( ϕ−1(t) )′ dt > 0, ∀y 6= ϕ(x), (3.4) Proof. By Theorem 3.1 Fxy(ϕ −1(y) −x) + Fy = 0, F (x,x) = 0. Then, ( 1 ϕ−1(y) −x Fy ) x = 0 and Fy ϕ−1(y) −x = C(y). Setting C(y) = f ′′(y) we can write Fy = ( ϕ−1(y) −x ) f ′′(y). (3.5) Using integration by parts in (3.5) we obtain that∫ y x Fy(x,t) dt = ∫ y x (ϕ−1(t) −x)) df ′(t) = [ f ′(t)(ϕ−1(t) −x) ]t=y t=x − ∫ y x f ′(t) ( ϕ−1(t) )′ dt. Consequently, F (x,y) = ( ϕ−1(y) −x ) f ′(y) − ( ϕ−1(x) −x ) f ′(x) − ∫ y x f ′(t) ( ϕ−1(t) )′ dt and (3.3) holds. Now we use the condition arg miny∈RE[F (X,y)] = ϕ(E[X]). This condition means that E [ F (X,y) −F (X,ϕ ( E[X] )] > 0, provided that y 6= ϕ ( E[X] ) . Using (3.3) we obtain that E [ F (X,y) −F (X,ϕ ( E[X] )] = ( ϕ−1(y) −E[X] ) f ′(y) + ∫ ϕ(E[X]) y f ′(t) ( ϕ−1(t) )′ dt > 0. Thus, ( ϕ−1(y) −x ) f ′(y) + ∫ ϕ(x) y f ′(t) ( ϕ−1(t) )′ dt > 0, ∀y 6= ϕ(x). � Note. In case of ϕ(x) = x, ( ϕ−1(y) −x ) f ′(y) + ∫ ϕ(x) y f ′(t) ( ϕ−1(t) )′ dt > 0, ∀y 6= ϕ(x) ⇒ f (x) − f (y) − f ′(y)(x −y) > 0, x 6= y. and 8 Int. J. Anal. Appl. (2023), 21:57 F (x,y) = ( ϕ−1(y) −x ) f ′(y) − ( ϕ−1(x) −x ) f ′(x) − ∫ y x f ′(t) ( ϕ−1(t) )′ dt ⇒ F (x,y) = f (x) − f (y) − f ′(y)(x −y). Therefore, in case of ϕ(x) = x, the condition (3.4) means that f is a is strictly convex function and (3.3) means simply that F (x,y) is a Bregman loss function. Corollary 3.2. Let arg maxy∈RE[F (X,y)] = ϕ(E[X]),∀X ∈ L1(Ω). Then F (x,y) = ( ϕ−1(y) −x ) f ′(y) − ( ϕ−1(x) −x ) f ′(x) − ∫ y x f ′(t) ( ϕ−1(t) )′ dt and ( ϕ−1(y) −x ) f ′(y) + ∫ ϕ(x) y f ′(t) ( ϕ−1(t) )′ dt < 0, ∀y 6= ϕ(x). Finally, we discus the condition (3.4), which is a generalization of the strictly convexity condition. The main question is: are there functions satisfying the following inequality( ϕ−1(y) −x ) f ′(y) + ∫ ϕ(x) y f ′(t) ( ϕ−1(t) )′ dt > 0, ∀y 6= ϕ(x). Regarding this question, we prove the following theorem. Theorem 3.3. If ϕ(x) is an increasing function and f ′′(x) > 0,∀x ∈R. Then ( ϕ−1(y) −x ) f ′(y) + ∫ ϕ(x) y f ′(t) ( ϕ−1(t) )′ dt > 0, ∀y 6= ϕ(x). Proof. Let us define G(x,y) = ( ϕ−1(y) −x ) f ′(y) + ∫ ϕ(x) y f ′(t) ( ϕ−1(t) )′ dt. Then, Gy(x,y) = ( ϕ−1(y) )′ f ′(y) + ( ϕ−1(y) −x ) f ′′(y) − ( ϕ−1(y) )′ f ′(y) ⇒ Gy(x,y) = ( ϕ−1(y) −x ) f ′′(y). We have y > ϕ(x) ⇔ ϕ−1(y) −x > 0 ⇔ Gy(x,y) > 0, y < ϕ(x) ⇔ ϕ−1(y) −x < 0 ⇔ Gy(x,y) < 0 and Gy(x,ϕ(x)) = 0. Consequently, G(x,y) = ( ϕ−1(y) −x ) f ′(y) + ∫ ϕ(x) y f ′(t) ( ϕ−1(t) )′ dt > 0, ∀y 6= ϕ(x). � Int. J. Anal. Appl. (2023), 21:57 9 Corollary 3.3. If ϕ(x) is a decreasing function and f ′′(x) > 0,∀x ∈R. Then ( ϕ−1(y) −x ) f ′(y) + ∫ ϕ(x) y f ′(t) ( ϕ−1(t) )′ dt < 0, ∀y 6= ϕ(x). Conflicts of Interest: The author declares that there are no conflicts of interest regarding the publi- cation of this paper. References [1] K.B. Athreya, S.N. Lahiri, Measure Theory and Probability Theory, Springer Texts in Statistics, Springer, New York, 2006. [2] A. Banerjee, X. Guo, H. Wang, On the Optimality of Conditional Expectation as a Bregman Predictor, IEEE Trans. Inform. Theory. 51 (2005), 2664–2669. https://doi.org/10.1109/tit.2005.850145. [3] H.H. Bauschke, M.S. Macklem, J.B. Sewell, X. Wang, Klee Sets and Chebyshev Centers for the Right Bregman Distance, J. Approx. Theory. 162 (2010), 1225–1244. https://doi.org/10.1016/j.jat.2010.01.001. [4] A. Ben-Tal, A. Charnes, M. Teboulle, Entropic Means, J. Math. Anal. Appl. 139 (1989), 537–551. https://doi. org/10.1016/0022-247x(89)90128-5. [5] Y. Censor, S. Zenios, Parallel Optimization: Theory, Algorithms, and Applications, Oxford University Press, London, 1998. [6] I. Csiszar, Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference for Linear Inverse Problems, Ann. Stat. 19 (1991), 2032–2066. https://doi.org/10.1214/aos/1176348385. [7] I. Csiszar, Generalized Projections for Non-Negative Functions, in: Proceedings of 1995 IEEE International Sym- posium on Information Theory, IEEE, Whistler, BC, Canada, 1995: p. 6. https://doi.org/10.1109/ISIT.1995. 531108. [8] F. Deutsch, Best Approximation in Inner Product Spaces, Springer-Verlag, New York, 2021. [9] G. Grimmett, D. Stirzaker, Probability and Random Processes, Oxford University Press, Oxford, 2004. [10] S. Karlin, H.M. Taylor, A Second Course in Stochastic Processes, 2nd ed. Academic Press, San Diego, 1991. [11] M. Hasanov, The Spectra of Two-Parameter Quadratic Operator Pencils, Math. Computer Model. 54 (2011), 742–755. https://doi.org/10.1016/j.mcm.2011.03.018. [12] D. Reem, S. Reich, A. De Pierro, Re-Examination of Bregman Functions and New Properties of Their Divergences, Optimization. 68 (2018), 279–348. https://doi.org/10.1080/02331934.2018.1543295. [13] D. Williams, Probability with Martingales, Cambridge Mathematical Textbooks, Cambridge University Press, Cam- bridge, 2001. https://doi.org/10.1109/tit.2005.850145 https://doi.org/10.1016/j.jat.2010.01.001 https://doi.org/10.1016/0022-247x(89)90128-5 https://doi.org/10.1016/0022-247x(89)90128-5 https://doi.org/10.1214/aos/1176348385 https://doi.org/10.1109/ISIT.1995.531108 https://doi.org/10.1109/ISIT.1995.531108 https://doi.org/10.1016/j.mcm.2011.03.018 https://doi.org/10.1080/02331934.2018.1543295 1. Introduction and Preliminary Facts 2. On the Optimality of Expectations 3. A PDE Approach to Optimality Problems 3.1. Critical Points 3.2. Extreme Points References