Microsoft Word - STAT991020.doc Science and Technology, 5 (2000) 115-123 © 2000 Sultan Qaboos University 115 A Generalization of the Hypergeometric Distribution E.K. Elsheikh* and A. Benmerzouga** *Salalah Teachers College of Education, Salalah, Sultanate Of Oman **Department of Mathematics and Statistics, Sultan Qaboos University, P. O. Box 36 Al-Khod, P. C. 123 Muscat, Sultanate of Oman تعميم للتوزيع فوق الهندسي على بن مرزوقة , الفضل خليفة الشيخ نقدم في هذا البحث تعديال للتوزيع فوق الهندسي يأخذ بعين االعتبار الحالة التي تكون فيها وسيلة المعاينة منحازة إلى :خالصـة ندرس .تكون تحته كل العينات متساوية االحتمال أحد النوعين الذين يجري منهما االختيار وذلك في مقابلة التوزيع الهندسي الذي خصـائص هـذا التوزيع الذي أسميناه التوزيع فوق الهندسي المعمم بما يشمل استخالص وتقييم نوعاً من التقريب الطبيعي لهذا .التوزيع ABSTRACT : In this paper we introduce a modification of the hypergeometric distribution that caters for the case when the sampling scheme favours the inclusion of units of one of the two types involved, as opposed to the hypergeometric distribution under which all samples are equally likely. The properties of the resulting distribution, termed the generalized hypergeometric, are studied, including the derivation and numerical assessment of a normal approximation of the distribution. KEYWORDS: Hypergeometric Distribution; Recurrence Relations; Normal Approximation. he celebrated capture re- capture scheme ( see , for example, Tuckwell 1995 ), as applied to the estimation of the number of fish in a lake, runs as follows. Catch a number of fish S, say, from the lake, mark them and then release them in the lake. Now, re-catch another sample of fish, of size N, say, from the lake. If X is the number of marked fish in this latter sample, then assuming X has a hypergeometric distribution, the total number of fish in the lake can be estimated. However, after the trauma of being caught and marked, the marked fish might, perhaps, be more likely to be caught again compared to the fish that escaped that experience. If that were the case, the hypergeometric model would not be applicable. The correct model should make samples with few marked fish less likely and samples with many marked fish more likely as compared to the hypergeometric model. One such model is the generalization of the hypergeometric distribution that we consider here. The distribution first arose when considering a stochastic model for two species competing on a fixed number of sites. This model will be described in section 2. The assumptions of the model and the resulting equilibrium distribution resemble their counterparts in the stochastic approach to the kinetics of chemical reactions (See, for example, McQuarrie 1967, Formosinho and Miguel 1979, and Hall 1983). We make full use of the techniques used in studying those distributions to investigate the properties of the generalized hypergeometric distribution. A Stochastic Model for Two Competing Species Consider two species A and B with total fixed sizes S and M, respectively, competing on N fixed sites. Initially some of the sites, r0 say, are occupied by species A and the rest by species B. We assume that all of the sites will continuously be occupied, but the number of sites occupied by either species may increase or decrease, respectively, by pushing out, or being pushed out by, members of the other E.K.ELSHEIKH AND A BENMERZOUGA 116 species. More specifically, denoting by X(t) the number of sites occupied by species A at time t, we assume that: (i) The probability that X(t) increases by one in (t, t+h) is )())())(((1 ho+htX-NtX-SK ; (ii) The probability that X(t) decreases by one in (t, t+h) is K X t M - N + X t h + o h2 ( )( ( )) ( ) ; and (c) The probability of more than one event in (t, t+h) is o(h), where K1 and K2 are positive constants, and o(h) is such that .0 )( lim 0 = → h ho h It might help the argument to think, tentatively, of the two species as political parties and the N sites as seats of a parliament that is being continuously updated according to the rules specified in (i), (ii), and (iii). The parameters K1 and K2 are then measures of competitiveness of the two parties. Differential Difference Equation and Equilibrium Solution Let pr(t) = P( X(t) = r X(0) = r0). It easily follows from the assumptions that pr satisfies the differential difference equation ).())())((( )()1)(1()()1)(1( 21 1211 tprNMrKrNrSK tp+r+N-M+r+Ktp+r-N+r-S K= dt (t)pd r +r-r r +−+−−− (2.1) The equilibrium solution is obtained by putting the derivative equal to zero and solving for pr = pr(∞). It is easy to see that it satisfies pr-Nr-SK=p+r+N-M+r r+r ))(()1)(1( 1 , (2.2) where K=K1/K2. (2.3) Putting r = 0,1,..., and making successive substitutions leads to the equilibrium solution r r 0p = K S!N!(M - N)! r!(S - r)!(N - r)!(M - N + r)! p . (2.4) The constant p0 can be obtained by noting that the summation of pr over all values of r is equal to 1. This distribution is a generalization of the hypergeometric distribution and can justifiably be called a generalized hypergeometric (gh) distribution. However, the term generalized distribution may mean a different thing in the literature. See for example, Johnson and Kotz (1969), pp. 158-60. The hypergeometric distribution corresponds to the special case K = 1. In fact, stressing the dependence on K, pr can be written as             −       = N M Kp rN M r S KKp rr )()( 0 (2.5) Putting K = 1 and summing over all r gives       +       = N MS N M p )1(0 . (2.6) A GENERALIZATION OF THE HYPERGEOMETRIC DISTRIBUTION 117 That is , the distribution reduces to the hypergeometric distribution. Thus, if K =1, the number of sites occupied by species A is effectively determined by taking a simple random sample of size N from the S+M units in the two species and counting the units that belong to species A. In this case all samples of size N are equally likely. The case K > 1 is more favorable to inclusion of units of type A than the case K = 1. One should expect that situations in which a few sites are occupied by A are less likely, and those in which many sites are occupied by A are more likely, compared to the hypergeometric case. The reverse should be expected when K<1. It is reasonable to interpret the number of sites r occupied by species A as few or many according as r is less than or greater than the average number of sites, µ, occupied by species A. One obvious result to expect is that µ(K) should be an increasing function of K. We now see how these expectations are met by the gh distribution. Let G x K( , ) and H x( ) denote, respectively, the probability generating functions (pgf) of the gh distribution and the hypergeometric distribution. Thus H x G x( ) ( , )= 1 . Now, from (2.5), we have (1)p(Kp(1)pK=(K)p 00r r r ) . (2.7) Summing over all r gives )(Kp(1)p=H(K) 00 (2.8) Likewise, multiplying both sides of (2.7) by r, summing over all r and using the above result gives H(K)(KHK=(K) )′µ . (2.9) Note also that H(K)(1)pK=(K)p r r r . (2.10) Multiplying both sides by xr and summing over all r expresses the pgf G in terms of H as )H(K(Kx)H=K)G(x, . (2.11) Using (2.11), or multiplying both sides of (2.2) by xr and summing over all r, leads to the following differential equation of G(x, K) KSNG - (K(S + N - 1)x + M - N + 1) dG dx + (Kx - 1)x d G dx = 0 2 2 . (2.12) Let )()1()( KppK rrr =ψ for r a non-negative integer in the support of rp . It follows from (2.10) that r r KKHK /)()( =ψ (2.13) We can now prove the following theorem. Theorem 1: Keeping the other parameters fixed: (a) )(Krψ is an increasing or decreasing function of K according as r < µ(K) or r > µ(K); (b) µ(K) is an increasing functions of K. Proof: The first assertion follows by differentiating (2.13) with respect to K, noting where the derivative is positive or negative, and expressing the condition in terms of )(Kµ , using (2.9). The assertion concerning µ(K) follows if we can establish that the derivative with respect to K is positive. It follows from (2.9) that ( )[ ]22 }][][{ KH(K)HK-(K)HK(K)HH(K)= dK (K)d ′′′+′ µ (2.14) Differentiating both sides of (2.11) twice with respect to x and putting x = 1 gives µµσ -+=1))-E(X(X=H(K(K)HK 22)2 ′′ . (2.15) E.K.ELSHEIKH AND A BENMERZOUGA 118 Substituting for µ(K) its expression in (2.9) leads to d (K) dK = (K) K > 0 2µ σ . (2.16) The essence of the comparison between the gh and the hypergeometric distributions is captured in the following corollary that easily follows from Theorem 1 . Corollary: (a) pr(1) > pr(K) if and only if r < µ(K); (b) µ(K) > µ(1) if and only if K >1. Moments of the gh Distribution It is difficult to express the moments of the gh distribution, including the mean and variance, in simple algebraic functions of the parameters. In this section we develop relations that provide convenient means of evaluating the mean and variance without the need to compute the probabilities in (2.4). Note that putting x = 1 in (2.12) leads to the relationship µ µ σ µ µ σ(M - N + )+ = K(S - )(N - )+ K2 2 . (3.1) We need to eliminate σ 2 in order to get a relation involving µ alone. This can be achieved by investigating the variation of the mean with each of the four parameters of the distribution keeping the others fixed. In the case of K this leads to a first order differential equation, while for M, S, and N it leads to recurrence relations. As for other moments, it can be shown, by differentiating (2.12) r times and putting x = 1, that the factorial moments satisfy the relations 0=1)m-+(K1)m+r+N-M+1)-2r-N+-(K(Sr)m-r)(N-K(S rrr 21 ++ , (3.2) r = 0, 1, 2,... Variation of the Mean with K Substituting (2.16) in (3.1) and noting that when K = 1 the mean reduces to that of the corresponding hypergeometric distribution, we get the following theorem. Theorem 2: The mean of the gh distribution varies with K according to the differential equation )-)(N-K(S-)+N-(M= dK d 1)-K(K µµµµ µ , (3.3) with initial value )( M+SNS=(1)µ . (3.4) It is difficult to solve this differential equation analytically but a numerical solution should be possible. Once the mean is computed, σ 2( )K can be evaluated from (3.1). Note also, from (2.16), that σ 2( )K can be obtained from the slope of the curve of µ( )K at K. Recurrence Relations for the Mean First note that, considering the dependence of pr on S, for fixed K, M, and N, we have from (2.4) )11 (Sp)-(Sp(S)pr)-(S=)-(SpS 00rr , (3.5) A GENERALIZATION OF THE HYPERGEOMETRIC DISTRIBUTION 119 true for all r . Summing over all r gives )1][ (Sp)-(Sp(S)-S=S 00µ . (3.6) Multiplying (3.5) by r and summing over all r leads to )1][1 2 (Sp)-(Sp(S)-(S)-(S)S=)-(SS 00 2 σµµµ . (3.7) Substituting (3.6) in (3.7) we get ]1].[[2 )-(S-(S)(S)-S=(S) µµµσ . (3.8) Following similar steps, it can be shown that ]11].[[2 )-N,-(M-N)(M,N)(M,-N=N)(M, µµµσ . (3.9) Finally, by interchanging M and S, and µ and µB, where µB is the average number of sites occupied by species B, it follows from (3.8) that ]1].[[2 )-(M-(M)(M)-M=(M) BBB µµµσ . (3.10) But B (M) = N - (M)µ µ . Thus ]1].[[2 (M)-)-(M(M)+N-M=(M) µµµσ . (3.11) Substituting (3.8), (3.9), and (3.11) in (3.1) and introducing suitable starting values for computational purposes leads to the following theorem. Theorem 3: The mean varies with each of the parameter(s) shown as argument, with the other parameters fixed, according to the following recurrence relations (i) ]11[]11[ )-(SK)-(+KN+S+N-M)-(SKS-)-(SS+KSN=(S) µµµµ , with µ(0) = 0. (3.12) (ii) ( )]11[]11[ -MK)-(+M)+K(S)-(MK)-N)(-(M-KSN=(M) µµµ , with µ(0) = N. (3.13) (iii) ]111[]111[ )-N,-(MK)-(+KS+M)-N,-(MK)-N(+KSN=N)(M, µµµ , (3.14) with µ(0, M) = 0 and µ(N, 0) = N. It is desirable to have a relation that describes the variation of the mean with N alone. This easily follows by suitably substituting (3.14) in either of (3.12) or (3.13). The resulting relation is given in the following corollary. Corollary: The mean varies with N according to the following relation ][][ 1)-(NK)-(1+1+N-M+KS1)-(N-SKN=(N) µµµ , with initial value µ(0) = 0. (3.15) Elsheikh (1997) has derived similar recurrence relations for the equilibrium means of some distributions arising in chemical reactions. These relations are best illustrated by an example. E.K.ELSHEIKH AND A BENMERZOUGA 120 Example Consider a gh distribution with parameters M = 6, S = 5, N = 4, and K = 2. The distribution can be worked out using (2.2). It turned out that it has mean and variance given by µ = 632/275, and 2σ = 51576/2752. Using relation (3.14), the most feasible in this case, with the given starting value, one can directly verify that µ(1) = 5/8, µ(2) = 28/23, µ(3) = 87/49, and finally, µ(4) = 632/275, in agreement with the correct value. The variance now can be obtained from (3.1). The other relations given by Theorem 3 give exactly the same values. Note that for the latter relations the variance can also be obtained from (3.8), or (3.9), or (3.11) according to the relation that was used to obtain the mean. Approximating the Mean and Variance Note that when the minimum of N, S, M is fairly large then µ(S) should not differ much from µ(S-1). Using this in (3.12), the mean can be expected to be well approximated by the positive root of the equation µ µ µ µ(M - N + ) = K(S - )(N - ) . (3.16) Differentiating (3.16) with respect to K and using (2.16) gives )-)(N-K(S=+)-K(N+)-K(S++N-(M 2 µµσµµµµ ][ . (3.17) Using (3.1) again we get µµµµσ +N-M + -S + -N += 2 11111 . (3.18) Numerical computation using the recurrence equations derived in section 3 indicates that the error in these approximations does not exceed 1, irrespective of the values of the parameters. Noting that the random variable is integer-valued, the approximations are thus very good. Maximum Likelihood Estimation First assume that all parameters of the distribution are known except K. This situation may arise when we suspect that the sampling scheme is favorable to one of the two types concerned and we want to quantify the extent of that. The likelihood function is conveniently expressed in (2.10). Differentiating with respect to K and equating the derivative to zero gives the maximum likelihood estimate (MLE) of K as the solution of the equation H(K)K=rH(K) ′ . (4.1) Using (2.9), we have the MLE of K as the solution of µ(K) = r (4.2) Using (4.2) with (3.16), we get an approximate expression of the MLE of K as ( )r-Nr)-(Sr)+N-r(M=K̂ . (4.3) The second situation to consider is when all parameters are known except M. We think of M here as the number of unmarked fish, and we assume that K is known from previous experiences. It is straightforward from (2.4) that (M))P-(Mr)P+N-(M=)-(M(M)PN)P-M(M rr 11 00 . (4.4) Summing over all r gives A GENERALIZATION OF THE HYPERGEOMETRIC DISTRIBUTION 121 )-(M)P+N-(M=(M)N)P-M(M 100 µ . (4.5) Using (4.5) in (4.4) we can express the likelihood ratio (LR) as r+N-M +N-M = )-(MP (M)P=LR(M) r r µ 1 . (4.6) The MLE of M is given as }LR(M):{M=M 1maxˆ ≥ . (4.7) This simplifies, in light of (4.6), to $ maxM = {M: (M) r}µ ≥ . (4.8) Since µ obviously increases with M (see (3.10)), it follows that $M is given by the integer part of the solution for M of the equation µ(M) = r (4.9) Using (4.9) with (3.16) gives an approximate expression of $M as     r r)+r)-r)(K(S-(N =M̂ . (4.10) where [ x ] is the integer part of x. It is interesting to note that when K = 1 the approximate MLE of M reduces to that given by the hypergeometric distribution. It is to be noticed from the preceding derivation that the same likelihood equation resulted in the estimation of K and M. If both of them are unknown, it is necessary to take more than one observation from the distribution. Binomial and Normal Approximations of the Distribution Assume K and N are finite, S → ∞, M → ∞, such that S/M = λ . It follows from (2.2) that pr)-(NK=p)+(r r1+r λ1 . (5.1) This corresponds to a binomial distribution with parameters N and p = Kλ /(1+Kλ). If N is large the normal approximation of the binomial distribution should be in effect. A normal approximation of the distribution, not necessarily through the binomial, is also possible. The derivation that follows is essentially due to Dunstan and Reynolds (1981). We have , from (2.2), r)+N-r(M )+r-)(N+r-K(S = p p 1-r r 11 . (5.2) If the minimum of N, S, and M is large, the mode m defined by }{ 1>pp:r=m rr 1max − , (5.3) approximately satisfies the equation )+m-)(N+m-K(S=m)+N-m(M 11 . (5.4) It follows from (5.2) that, for j > 0 p p p p Tm j m m i m ii j i i j + + + −= = = =∏ ∏ 11 1 (5.5) where i -1 -1 T (1 - i S - m + 1 )(1 - i N - m + 1 )(1+ i m ) (1+ i M - N + m )≅ . (5.6) Thus, E.K.ELSHEIKH AND A BENMERZOUGA 122 ]exp[ ) m+N-M 1 + m-S 1 + m-N 1 + m 1 -i(T i ≅ . (5.7) It follows that m+ j m 2p p (- j )≅ exp 2 2σ , (5.8) where 1 = 1 m + 1 N - m + 1 S - m + 1 M - N + m2σ . (5.9) Hence the distribution can be approximated by a normal distribution with mean µ and variance σ2 as given by (5.4) and (5.9). As the mean of the distribution is well approximated by the solution of (3.16), the approximate mean can be used in place of the mode. As noticed by Hall (1983), the derivation above is rather vague about the relative sizes of the parameters for which the approximation holds, apart from the requirement that the values of M, S, and N should be large, which is almost always met as far as chemical reactions are concerned. Hall has provided conditions on the parameters under which the distribution converges to normality. The approach we adopt here is to resort to numerical computations. Numerical Assessment of the Normal Approximation The analysis is restricted to N S M≤ , . We considered values of N from as small as 10. The cumulative distribution function corresponding to (2.4) was computed at each possible integral value and compared to the value given by the normal distribution function with mean and variance given by (3.16) and (3.18), respectively. A continuity correction was used. For each set of parameters the maximum difference was recorded. The approximation was taken as satisfactory so long as the maximum absolute difference does not exceed 0.001. Initially, extensive exploratory computations were carried out by varying the parameters N, M, S, and K. It was noticed that the quality of the approximation was closely related to the values of N and µ/N, where µ is the approximate mean given by the positive solution of (3.16). A program was then written to facilitate the investigation of the range of values of µ/N, for each N, under which the approximation is satisfactory. The calculations were run separately for K K< >1 1and . It was found that the approximation can be good for values of N as small as 30. The results can be summarized for the two cases of K as follows. Case of K > 1: (i) For N ≥ 30 the approximation is found to be satisfactory for µ/N ∈ [0.23, 0.59] and it gets better as K approaches 1. (ii) For N ≥ 50 also the approximation is found to be satisfactory provided µ/N ∈ [0.15, 0.69] and it gets better as K approaches 1. (iii) For N ≥ 100 also the approximation is found to be satisfactory provided µ/N ∈ [0.086, 1.0] and for all values of K. Case of K < 1: (i) For N ≥ 30 the approximation is found to be satisfactory for µ/N ∈ [0.399, 1.0) and it gets better as K approaches 1. (ii) For N ≥ 50 also the approximation is found to be satisfactory provided µ/N ∈ [0.282, 1.0) and it gets better as K approaches 1. (iii) For N ≥ 100 also the approximation is found to be satisfactory provided µ/N ∈ [0.092, 1.0) and for all values of K. Acknowledgements Thanks are due to anonymous referees for their helpful comments and suggestions. A GENERALIZATION OF THE HYPERGEOMETRIC DISTRIBUTION 123 References DUNSTAN, F. D. J. and REYNOLDS, J. F. 1981. Normal Approximations for Distributions Arising in the Stochastic Approach to Chemical Reactions Kinetics. J. Appl. Prob. 18: 263-267. ELSHEIKH, E. K. 1997. Recurrence Relations for the Equilibrium Means of Distributions Arising in Chemical Reactions. Sultan Qaboos University Journal for Scientific Research – Science and Technology. 2: 77-85. FORMOSINHO, S. J. and MIGUEL, M. DA G. M. 1979. Markov Chains for Plotting the Course of Chemical Reactions. J. Chem. Education 56: 582-585. HALL, A. 1983. On the Roles of the Bessel and Poisson Distributions in Chemical Reactions Kinetics. J. Appl. Prob. 20: 585- 595. JOHNSON, N. L. and KOTZ, S. 1969. Discrete distributions. John Wiley & Sons. New York. McQUARRIE, D. A. 1967. Stochastic Approach To Chemical Kinetics. J. Appl. Prob. 4: 413-478. TUCKWELL, H. C. 1995. Elementary Applications of Probability Theory. Chapman & Hall. New York. Received 20 January 1999 Accepted 25 June 2000