Dual Binomial EM Examples R. Perry, 25 June 1998 ______________________________________________________________________________ 0. Introduction --------------- A simple binomial example is presented to illustrate the EM algorithm. EM, standing for Expectation Maximization, had been known to statisticians for a long time, but was made better known to others by this paper: A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm", J. Roy. Stat. Soc., vol. 39, No. 1, 1977, pp. 1-38. Since then, the EM algorithm has found wide applicability in various areas. We are particularly interested in its application to communication systems and radar tracking. In section 1 we describe the basic problem for the example under consideration, and derive the obvious direct maximum-likelihood solution. Section 2 provides an informal derivation of an EM algorithm for the problem. In section 3 a formal derivation is given in terms of the Q() function from Dempster et. al. Section 4 presents numerical results. In section 5 we consider a related problem: a Bernoulli process with additive noise. We derive the EM algorithm and present numerical results. ______________________________________________________________________________ 1. Binomial ML -------------- Let the observed data Y be a sequence of N 0's and 1's. For example, with N=4: Y = 1, 1, 1, 0 Assume that each value of Y is a Bernoulli random variable with probability p. We would like to obtain a maximum-likelihood estimate of p. In this case, the parameter of interest from the data is the count of the number of 1's. Let n1 represent that count and let n0 represent the count of the number of 0's, where n1 + n0 = N. Then the probability mass function g(n1;N,p) is binomial: (1.1) g(n1;N,p) = c p^n1 (1-p)^n0 where c = N!/(n1! n0!) The ML estimate of p which maximizes g() can easily be obtained by differentiating log(g()) with respect to p, setting that derivative equal to 0, then solving for p: log(g()) = log(c) + n1 log(p) + n0 log(1-p) d(log(g()))/dp = 0 = n1/p - n0/(1-p) Solve for p: (1.2) p = n1/(n1+n0) ______________________________________________________________________________ 2. EM approach -------------- Although this problem is most easily solved as shown above, directly obtaining the ML estimate of p using differentiation of log(g()), we will apply the EM algorithm to illustrate its use. Assume that the observed data Y and associated counts n1 and n0 represent incomplete data produced by the following process: each value of y was generated by randomly selecting a data generator from a set of two such generators, g1 and g2. The probability of selecting g1 is s; the probability of selecting g2 is (1-s). The data points generated from g1 and g2 are Bernoulli random variables with associated probabilities q1 and q2. If we know s, q1, and q2, then we can derive p as follows: (2.1) p = E[p|generator#] = s q1 + (1-s) q2 Let s be fixed at some value in the interval (0.0,1.0), i.e. 0 < s < 1. (If s was exactly 0 or 1 then one of the generators g1 or g2 would never be selected and the problem would degenerate to a single binomial distribution.) Let q1 and q2 be unknown. If we can use EM to produce ML estimates of q1 and q2, then we can use (2.1) to determine p. Let N represent the total number of data samples, and let n1 and n0 represent the numbers of 1's and 0's, as in the previous section. Let N1 represent the total number of data samples which are produced using g1, and let n11 and n01 represent the numbers of 1's and 0's in these samples. Let N2 represent the total number of data samples which are produced using g2, and let n12 and n02 represent the numbers of 1's and 0's in these samples. Using this notation, g1() and g2() are binomial distributions: (2.2) g1(n11;N1,q1) = c1 q1^n11 (1-q1)^n01 g2(n12;N2,q2) = c2 q2^n12 (1-q2)^n02 where c1 = N1!/(n11! n01!) and c2 = N2!/(n12! n02!) N, N1, N2, n1, n0, n11, n12, n01, n02 are related as follows: (2.3) N = N1 + N2 = n1 + n0 N1 = n11 + n01 N2 = n12 + n02 n1 = n11 + n12 n0 = n01 + n02 Note that n1 and n0 represent the observed or incomplete data. n11, n12, n01, and n02 represent the complete data. q1 and q2 represent the parameters for which we desire maximum likelihood estimates. Given estimates of q1 and q2, we must estimate the the expected values of the binomial probability densities g1() and g2(). To do so, we must estimate n11, n12, n01, and n02 as follows: E-step: n11 = s q1 n1 / (s q1 + (1-s) q2) n12 = (1-s) q2 n1 / (s q1 + (1-s) q2) n01 = s (1-q1) n0 / (s (1-q1) + (1-s) (1-q2)) n02 = (1-s) (1-q2) n0 / (s (1-q1) + (1-s) (1-q2)) This follows because the probability of selecting g1 and obtaining a 1 is s q1; the probability of selecting g2 and obtaining a 1 is (1-s) q2. Considering just the generated 1's, these two events represent a reduced sample space, thus the probabilities are normalized by dividing by their sum. n11 as shown in the E-step represents E[n11|n1,n0,q1,q2]. A similar argument applies for each of the E-step equations. There are two degenerate cases to consider. In the case of q1=q2=0, set n11=n12=0 rather than using the first two equations of the E-step. In the case of q1=q2=1, set n01=n02=0 rather than using the last two equations of the E-step. Given estimates of n11, n12, n01, and n02, the ML estimates for q1 and q2 are: M-step: q1 = n11/(n11+n01) q2 = n12/(n12+n02) This follows from the formulation in (1.1) and (1.2) since the number of 1's produced by g1 and g2 are binomial random variables. Note that in this M-step, the computations of the ML estimates of q1 and q2 are decoupled. To apply EM, initialize q1 and q2 to any values in the range (0.0,1.0) and apply the E-step then M-step until convergence is obtained. If q1=0 at the beginning of the E-step, n11 will be 0 and the M-step will produce q1=0 again. If q1=1 at the beginning of the E-step, n01 will be 1 and the M-step will produce q1=1 again. Similar observations may be made about q2=0 and q2=1. This means that 0 and 1 are fixed points of the iterations. If only one of q1 and q2 is "stuck" at a fixed point, the remaining one will still converge to a correct ML solution. If both q1 and q2 are "stuck", the algorithm will have converged to a possibly wrong solution. I believe that given non-fixed-point initial values of q1 and q2 the algorithm will only converge to one of the fixed points if that is indeed the correct ML solution. However, one should certainly avoid using these fixed points as initial estimates of q1 or q2. Note that although n1 and n0 are integers, n11, n12, n01, and n02 do not have to be integers and the E-step generally produces non-integral values for them during the EM iterations. Dempster et. al. make a similar note about x1 and x2 from their Example 1. This is ok because, for example, n11 here really is an expected value of n11, not an actual value of n11, and expected values (being averages) do not have to lie in the space of possible actual values. ______________________________________________________________________________ 3. EM theory ------------ In Dempster et. al. equation (2.17) defines a function Q(): (3.1) Q(phi'|phi) = E[ log f(X|phi') | Y,phi ] where phi and phi' are instances of the parameter vector for which we desire an ML estimate, X is the complete data, and Y is the observed data. The E-step says to compute Q() and the M-step says to find a parameter vector phi' which maximizes Q(). Their paper proves that iterative application of the E-step and M-step converges to an ML estimate of the parameter vector under certain generally applicable conditions. The fact that there may be degenerate cases in which it does not converge to the correct solution was discussed at the end of the previous section. To prove that the algorithm presented in the previous section is an EM algorithm, we must first produce the appropriate Q() function for the example under consideration. In our case, phi' is (q1,q2), X is (n11,n12,n01,n02), and Y is (n1,n0). For phi we will use (r1,r2) representing the current estimates of (q1,q2): (3.2) Q(q1,q2|r1,r2) = E[ log(g1(n11;N1,q1) g2(n12;N2,q2)) | n1,n0,r1,r2 ] Recall the forms of g1() and g2 from Eq.(2.2): g1(n11;N1,q1) = c1 q1^n11 (1-q1)^n01 g2(n12;N2,q2) = c2 q2^n12 (1-q2)^n02 Evaluating the log of the product of g1() with g2() yields: (3.3) Q(q1,q2|r1,r2) = E[ log(c1) + n11 log(q1) + n01 log(1-q1) + log(c2) + n12 log(q2) + n02 log(1-q2) | n1,n0,r1,r2 ] We can drop the log(c1) and log(c2) terms, since only terms involving q1 and q2 have any effect in the M-step. Continuing, and evaluating the expected value yields: (3.4) Q(q1,q2|r1,r2) = E[n11|n1,n0,r1,r2] log(q1) + E[n01|n1,n0,r1,r2] log(1-q1) + E[n12|n1,n0,r1,r2] log(q2) + E[n02|n1,n0,r1,r2] log(1-q2) Formulas for the expected values of n11, n12, n01, and n02 needed here have already been derived in the E-step of the previous section. There, we used (q1,q2) to represent the current estimates; the same estimates are being used in (3.4) but we are calling them (r1,r2) because this equation is also a function of the estimates to be evaluated in the next step which are called (q1,q2) here. Applying the formulas from the E-step of the previous section: (3.5) Q(q1,q2|r1,r2) = n11 log(q1) + n01 log(1-q1) + n12 log(q2) + n02 log(1-q2) Note that going from (3.4) to (3.5) represents a computation, not a substitution of notation. For example, if we were to fully write out the computation expressed by the n11 log(q1) term in (3.5) it would be: ((s r1 n1)/(s r1 + (1-s) r2)) log(q1) This is why Dempster's E-step says to "compute" Q(). The general form of Q() is fixed for a specific EM algorithm application, e.g. (3.4) here. For each iteration of the E-step, an instantiation of Q() must be computed, given the current estimates. For our example, this simply involves computing the coefficients of the log() terms. In other problems, computing Q() may be more difficult. Continuing, the M-step must produce values for q1 and q2 which maximize Q() from (3.5). In this case, we can simply set the derivatives of Q() with respect to q1 and q2 to zero and solve: (3.6) d(Q(q1,q2|r1,r2)))/dq1 = 0 = n11/q1 - n01/(1-q1) ==>> q1 = n11/(n11+n01) d(Q(q1,q2|r1,r2)))/dq2 = 0 = n12/q2 - n02/(1-q2) ==>> q2 = n12/(n12+n02) The resulting expressions for the optimal values of q1 and q2 are exactly the same as were specified in the M-step of the previous section. Since g1 and g2 are independent, the computations of of the ML estimates of q1 and q2 are decoupled. We should be able to write g() in terms of g1() and g2(), corresponding to Dempster's Eq.(1.1) where the density function of the observed data is written as an integral over the density function of the complete data. Given that n1 = n11 + n12 from Eq.(2.3), the probability mass function of g() is the convolution of g1() with g2(). But recall that generator g1() is selected with probability s for each trial, so the total number of 1's and 0's generated by g1() is a random variable, N1. By conditioning on N1, we can explicitly derive the result: P(n11+n12=n1|N1) = sum k=0 to n1 { P(n11=k|N1) P(n12=n1-k|N1) } = sum k=0 to n1 { g1(k;N1,q1) g2(n1-k;N-N1,q2) } g(n1;N,p) = E[ P(n11+n12=n1|N1) ] = sum j=0 to N { P(n11+n12=n1|N1=j) P(N1=j) } where P(N1=j) is a binomial distribution with probability s. We do not need to evaluate or implement the above equations, so the fact that it appears very messy to do so is of no consequence. But note that if q1 = q2, then g() is binomial with the same probability parameter p = q1 = q2 as g1() and g2(). The implication of this is that the EM algorithm converges in one step if q1 = q2, since the ML estimates of q1 and q2 correspond exactly to the ML estimate of p and in fact use the same formula for their computation (i.e. the M-step and Eq.(1.2)). This is verified by a numerical example in the following section. ______________________________________________________________________________ 4. Numerical Examples --------------------- In the following examples, iter(i) is a function which performs i iterations of the EM algorithm for the dual binomial problem under consideration. When invoked with no arguments it just performs one iteration. For each iteration it prints the p value derived from Eq.(2.1) using the estimated values of q1 and q2. After the last iteration it prints q1, q2, n11, n12, n01, and n02. p_ml() is a function which computes the ML estimate of p directly from Eq.(1.2). p from the EM algorithm should converge to this value. The first example demonstrates that, starting with q1 = q2, convergence is obtained after just one iteration. s = 0.4 q1 = 0.5 q2 = 0.5 n1 = 9 n0 = 8 m:71> iter p = 0.5294118 q1 = 0.5294118 q2 = 0.5294118 n11 = 3.6 n12 = 5.4 n01 = 3.2 n02 = 4.8 m:72> p_ml ans = 0.5294118 m:73> p-p_ml ans = 0 m:74> The next example, using the same s, n1, and n0 values as above, shows iterative convergence starting with q1 != q2: m:66> q1=0.1; q2=0.2 m:67> iter p = 0.5162193 q1 = 0.3962264 q2 = 0.5962145 n11 = 2.25 n12 = 6.75 n01 = 3.428571 n02 = 4.571429 m:68> iter p = 0.5289048 q1 = 0.4089445 q2 = 0.6088784 n11 = 2.763196 n12 = 6.236804 n01 = 3.993701 n02 = 4.006299 m:69> iter(8) p = 0.5293922 p = 0.529411 p = 0.5294117 p = 0.5294118 p = 0.5294118 p = 0.5294118 p = 0.5294118 p = 0.5294118 q1 = 0.4094561 q2 = 0.6093822 n11 = 2.784302 n12 = 6.215698 n01 = 4.015698 n02 = 3.984302 m:70> p-p_ml ans = -2.442491e-15 m:71> The following example, using the same parameters as the previous example except for different starting values of q1 and q2, shows that the final estimates of q1, q2, and the complete data (n11, n12, n01, n02) are not unique, though the ML estimate of p is unique: m:66> q1=0.2; q2=0.1 m:67> iter p = 0.5139612 q1 = 0.6333879 q2 = 0.4343434 n11 = 5.142857 n12 = 3.857143 n01 = 2.976744 n02 = 5.023256 m:68> iter(8) p = 0.5288267 p = 0.5293897 p = 0.5294109 p = 0.5294117 p = 0.5294118 p = 0.5294118 p = 0.5294118 p = 0.5294118 q1 = 0.6482015 q2 = 0.4502186 n11 = 4.40777 n12 = 4.59223 n01 = 2.39223 n02 = 5.60777 m:69> p-p_ml ans = -6.405987e-14 m:70> ______________________________________________________________________________ 5. Binomial with Noise ---------------------- Let the observed data be a sequence of N 0's and 1's corrupted by additive Gaussian noise. For example, with N=4: Y = 1.1, 0.9, 0.4, -0.1 In this case we would like to simultaneously determine optimal estimates for each value of Y mapping to 0 or 1 values, as well as determining the ML estimate of p. Each value of Y is: y = x + z where x is a Bernoulli random variable with probability p, z is zero-mean Gaussian noise with variance v, and x and z are independent. For this EM formulation, Y is the observed data, X is the complete data, and p is the parameter for which we desire a maximum likelihood estimate. Let G(z;m,v) represent the probability density function for a Gaussian random variable z with mean m and variance v: (5.1) G(z;m,v) = exp(-(z-m)^2/2v) / sqrt(2 pi v) The Bernoulli probability description may be written as a list of possible values, as a probability mass function, or as a probability density function using delta functions: (5.2) P(x=1) = p, P(x=0) = 1-p P(x;p) = p^x (1-p)^(1-x) for x = 0, 1 b(x;p) = p delta(x-1) + (1-p) delta(x) The density function for y is easily obtained from the convolution of b(x;p) with G(z;0,v) producing: (5.3) f(y;p,v) = p G(y;1,v) + (1-p) G(y;0,v) Given an estimate of p, we must produce estimates of X. To do so, we could try to use maximum-likelihood estimates by comparing f(y;p,v|x=0) with f(y;p,v|x=1) and choosing the larger one. However, f(y;p,v|x) = G(y;x,v) which is not a function of p, so this method would not be appropriate to use in an EM algorithm which is supposed to be finding an ML estimate of p. Note that checking G(y;1,v) > G(y;0,v) reduces to checking y > 0.5 and does yield ML estimates of X. But ML estimates of X do not help us in finding an ML estimate of p. So consider examining P(x=0|y) and P(x=1|y) and choosing whichever one is larger, where: (5.4) P(x=0|y) = P(y|x=0) P(x=0) / P(y) P(x=1|y) = P(y|x=1) P(x=1) / P(y) Substituting f(y;p,v) for P(y) and G(y;x,v) for P(y|x) we obtain: (5.5) P(x=0|y) = (1-p) G(y;0,v) / f(y;p,v) P(x=1|y) = p G(y;1,v) / f(y;p,v) In the E-step of the EM algorithm, we could evaluate the probabilities from (5.5) and choose the value of x corresponding to the larger probability. However, since the actual values of the probabilities in (5.5) do not matter, only their relative sizes, we can simplify this step as follows. If P(x=1|y) > P(x=0|y) we will choose x = 1; otherwise we will choose x = 0. P(x=1|y) > P(x=0|y) can be reduced: (5.6) P(x=1|y) > P(x=0|y) p G(y;1,v) > (1-p) G(y;0,v) p exp(-(y-1)^2/2v) > (1-p) exp(-y^2/2v) log(p) - (y-1)^2/2v > log(1-p) - y^2/2v Rearrange terms: (5.7) y > 0.5 + v (log(1-p) - log(p)) Equation (5.7) expresses the decision step for choosing x = 0 or x = 1 from (5.5) as a threshold decision on y. If y is greater than the threshold, then the estimate for x is 1; otherwise it is 0. We will assume that v (the variance of the noise) is fixed and known. If v was not known, it could be included as a parameter to be determined by the EM algorithm and then estimated in the E-step. Since var(Y) = var(X) + var(Z), we have var(Y) = p (1-p) + v which can be solved for v = var(Y) - p (1-p) where var(Y) is the variance of the observed data Y. This estimate of v could be problematic in practice, since it may come out negative depending on the data values and the current estimate of p. Also, no investigation has been done here to prove that such an estimate of v would be optimal in any sense. Given estimates for X, we could set n1 as the count of the number of 1's in X, and then use the M-step of the EM algorithm to produce a maximum likelihood estimate of p using Eq.(1.2). However, the algorithm just described would not really be an EM algorithm because the E-step is not using expected values. This flaw was revealed by numerical simulations in which p converged to either 0 or 1 instead of the correct ML value in cases of low SNR. Before specifying the correct EM algorithm for this problem, we will first show that the estimates from Eq.(5.7) are actually equivalent to an expected-value-based approach. To estimate X using expected values, given p: (5.8) E[x|y] = (0) P(x=0|y) + (1) P(x=1|y) = P(x=1|y) So, using P(x=1|y) from Eq.(5.5): (5.9) E[x|y] = p G(y;1,v) / f(y;p,v) If we assume that in order to generate n1, the count of the number of 1's in X needed for the M-step, we compare E[x|y] with 0.5 for each x and count it as a 1 if E[x|y] exceeds 0.5, otherwise count it as a 0, then we have: (5.10) E[x|y] > 0.5 p G(y;1,v) / f(y;p,v) > 0.5 p G(y;1,v) > 0.5 f(y;p,v) p G(y;1,v) > 0.5 [ p G(y;1,v) + (1-p) G(y;0,v) ] p G(y;1,v) > (1-p) G(y;0,v) At this point, note that the result is the same as the second step in (5.6), so this formulation leads to the same threshold equation as (5.7). In this case, we have not estimated the X values as being 0 or 1. We have just estimated their expected values, so for each x we have E[x|y] as a number in the range 0.0 to 1.0. However, to produce n1 as an integer count, we made the assumption that if E[x|y] > 0.5 then x is 1, otherwise x is 0. To summarize the results so far, we know that determination of 0 and 1 values for X using either P(x=1|y) > P(x=0|y) or E[x|y] > 0.5 leads to the same threshold decision on y, Eq.(5.7). To continue, and produce a correct EM algorithm, we must assume that n1 does not have to be an integer. If X contains all 0 and 1 values, n1 can be written as sum(X). If we use the non-integral E[x|y] values as estimates of X, then we can estimate n1 as sum(E[X|Y]). Thus, when we write n1, we really mean the expected value of n1. So the EM algorithm is: E-step: E[x|y] = p G(y;1,v) / f(y;p,v) n1 = sum(E[X|Y]) M-step: p = n1/N To prove that this is an EM algorithm, consider the Q() function from (3.1): Q(phi',phi) = E[ log f(X|phi') | Y,phi ] In this example, phi' is p, and for phi we will use r, representing the current estimate of p. For f(X|p) we substitute P(X;p) using (5.2) for P(Xi;p), i = 1..N: (5.11) Q(p,r) = E[ log P(X;p) | Y,r ] (5.12) P(X;p) = P(x1;p) P(x2;p) ... P(xN;p) = p^x1 (1-p)^(1-x1) p^x2 (1-p)^(1-x2) ... p^xN (1-p)^(1-xN) = p^S + (1-p)^(N-S) where S = x1 + x2 + ... + xN Evaluating the log of P(X;p) yields: (5.13) Q(p,r) = E[ S log(p) + (N-S) log(1-p) | Y,r ] Evaluating the expected value yields: (5.14) Q(p,r) = E[S|Y,r] log(p) + (N - E[S|Y,r]) log(1-p) with E[S|Y,r] = E[x1|Y,r] + E[x2|Y,r] + ... + E[xN|Y,r] The formula for E[S|Y,r] needed here has already been derived in the E-step above. There, we used p to represent the current parameter estimate; the same estimate is being used in (5.14) but we are calling it r because this equation is also a function of the estimate to be evaluated in the next step which is called p here. Rewriting the formula from the E-step above in terms of the notation being used here we have: (5.15) E[xi|Y,r] = r G(yi;1,v) / f(yi;r,v) i = 1..N n1 = E[S|Y,r] = sum i=1 to N { E[xi|Y,r] } Substituting the result of the computations from (5.15) into (5.14), and letting n0 = N - n1, we have: (5.16) Q(p,r) = n1 log(p) + n0 log(1-p) The M-step must produce the value of p which maximizes Q() in (5.16). Set the derivative of Q() with respect to p to 0 and solve for p: (5.17) d(Q(p,r))/dp = 0 = n1/p - n0/(1-p) ==>> p = n1/(n1+n0) which is exactly the same formula as the M-step previously specified, and confirms that the algorithm derived in this section is an EM algorithm. Following are some numerical examples using N = 50 and p = 0.4 with 0.5 as the initial estimate of p. To check the values of p produced by the EM algorithm, the function p_ml() computes the ML estimate of p directly by numerically maximizing f(Y;p,v) with respect to p, where f(Y;p,v) = f(y1;p,v) f(y2;p,v) ... f(yN;p,v) (The implementation of p_ml() actually minimizes -log(f(Y;p,v)) by first evaluating it on a grid of 1001 points of p from 0.0 to 1.0 to get an initial estimate of the minimum; then an iterative interval-halving method is used to produce an estimate of the minimum with a relative error tolerance of 1E-12). In this example, v = 0.1 (SNR = 3.8 dB): p = 0.5 m:271> iter(5) p = 0.3717019 p = 0.3527167 p = 0.3497434 p = 0.349271 p = 0.3491957 m:272> pml = p_ml() m:273> pml pml = 0.3491815 m:274> p-pml ans = 1.426443e-05 m:275> sum((ex>0.5)!=x_save) ans = 3 m:276> n1 n1 = 17.45979 m:277> sum(x_save) ans = 17 After 5 iterations, the algorithm reached a value of p very close to the ML estimate. After convergence, we can estimate X as 0 or 1 values by comparing ex (representing E[x|y]) with 0.5 which shows in this case that 3 of the estimated X values differ from the original generated data (x_save). Note that the final estimate of n1 is not an integral value, but ends up close to the actual number of 1's in the original generated data. Following is an example with low SNR using v = 0.5 (SNR = -3.2 dB): p = 0.5 m:271> iter(5) p = 0.4272446 p = 0.3843555 p = 0.3591917 p = 0.3443184 p = 0.3354554 m:272> iter(10) p = 0.3301409 p = 0.3269405 p = 0.3250079 p = 0.3238389 p = 0.323131 p = 0.322702 p = 0.322442 p = 0.3222843 p = 0.3221887 p = 0.3221307 m:273> pml=p_ml() m:274> pml pml = 0.3220413 m:275> p-pml ans = 8.942352e-05 m:276> sum((ex>0.5)!=x_save) ans = 10 m:277> n1 n1 = 16.10653 m:278> sum(x_save) ans = 17 This example converged slowly, and reached an accurate ML estimate of p. However the final estimates of X are not very good, since 10 out of 50 of them are wrong. Nevertheless, n1 is fairly close to the number of 1's in the original generated data.