Sequence Estimation EM Examples R. Perry, 25 June 1998 ______________________________________________________________________________ 0. Introduction --------------- A simple sequence estimation example is presented to illustrate the EM algorithm. The motivation for this example is the paper: C. N. Georghiades and J. C. Han, "Sequence Estimation in the Presence of Random Parameters Via the EM Algorithm", IEEE Trans. Commun., vol. 45, pp. 300-308, March 1997. which provides some excellent examples and results. Desiring simpler examples to explore implementations of the EM and Viterbi algorithms, I made up the ones described here. Section 1 derives an EM algorithm for a simple scaled binary sequence in noise. This problem appears difficult to solve directly for ML estimates without using EM. In section 2 we constrain the binary sequence in such a way that it becomes convenient to use the Viterbi algorithm for the M-step of the EM algorithm. ______________________________________________________________________________ 1. Scaled Binary Sequence ------------------------- Let the observed data Y be a sequence of N 0's and 1's scaled by a fixed but unknown non-zero parameter c and corrupted by additive Gaussian noise Z: (1.1) Y = c B + Z where B represents a sequence of deterministic binary 0 and 1 values. The probability density function for each element z of Z is G(z;0,v) where G(z;m,v) represents the density function for a Gaussian random variable z with mean m and variance v: (1.2) G(z;m,v) = exp(-(z-m)^2/2v) / sqrt(2 pi v) We would like to determine an ML estimate of B. We can try to do this directly by writing f(Y|B) and trying to maximize it. Each element y of Y is a Gaussian random variable with mean (c b) and variance v: (1.3) f(y|b) = G(y;(c b),v)) = G(y;0,v) if b = 0 = G(y;c,v) if b = 1 If G(y;c,v) > G(y;0,v) then we will estimate b as 1, otherwise 0. (1.4) G(y;c,v) > G(y;0,v) -(y-c)^2 > -y^2 (by taking the log) which reduces to: (1.5) y > c/2 if c > 0 y < c/2 if c < 0 Since the value of c is unknown, it appears impossible to proceed with a direct ML algorithm without having an estimate of c. But using the EM algorithm we can choose a set of complete data X, the obvious choice being X = {Y, c} in this case, and the estimation step of the algorithm will provide estimates of c to be used in the maximization step. So let D represent an estimate of B, and let the complete data X be {Y, c}. Dempster's Q() function for EM can then be written as: (1.6) Q(B|D) = E[ log f(Y,c|B) | Y,D ] Since c is not a random variable, f(Y,c|B) can be written as f(Y|c,B) and is the product of f(y;(c b),v) for each y,b in Y,B: (1.7) f(Y|c,b) = G(y1;(c b1),v)) G(y2;(c b2),v) ... G(yN;(c bN),v) Taking the log of (1.7), dropping constant terms which are not a function of B and will have no effect in the M-step, and substituting into (1.6): (1.8) Q(B|D) = E[ -(y1 - c b1)^2 - (y2 - c b2)^2 ... - (yN - c bN)^2 | Y,D ] Letting (y - c b)^2 represent any term in the expected value operation of (1.8), expanding the term and taking the expected value: (1.9) E[ (y - c b)^2 | Y,D ] = E[ y^2 - 2 c y b + c^2 b^2 | Y,D] = y^2 - 2 y b E[c|Y,D] + b^2 E[c^2|Y,D] Since c is an unknown constant, not a random variable, E[c^2] = E[c]^2, so all we really need to do to perform the expectation operation in (1.8) is to replace c with E[c|Y,D]. To compute E[c|Y,D], note that D represents a sequence of 0 and 1 values which are an estimate of B. The elements of D which are 0 provide no information about c, since the associated y value is zero-mean Gaussian in those cases (see (1.3)). So we only need to consider the elements of D which are 1. Let D' represent the subset of elements of D which are 1's. Let Y' represent the corresponding Y values. f(Y') is Gaussian with mean c, so E[c|Y,D] is simply E[Y'] which can be computed from the data as the average of the values from Y'. Representing this average as "a", we now have the E-step of the EM algorithm: E-step: a = avg(Y') where Y' is the subset of Y values associated with the current estimates of B which are 1's. Substituting into (1.8): (1.10) Q(B|D) = -(y1 - a b1)^2 - (y2 - a b2)^2 ... - (yN - a bN)^2 For the M-step of the EM algorithm, we must find B to maximize Q(B|D) from (1.10). Since the terms in (1.10) are independent and of the same form, we can derive a maximization formula for a typical term, then apply that formula to all of the terms separately to achieve the maximization. Writing a typical term from (1.10) as -(y - a b)^2, we must decide whether b=1 or b=0 produces a maximum by examining the inequality: (1.11) -(y - a 1)^2 > -(y - a 0)^2 Note that this inequality is the same as discussed previously in (1.4) when considering a direct ML estimation of B. But in (1.4) we did not have an estimate of c; here we do (i.e. a). So the solution, using (1.5) is: M-step: For each b,y in B,Y set b = 1 if (a > 0 and y > a/2) or (a < 0 and y < a/2) otherwise set b = 0 Note that the E-step and M-step of this EM algorithm are very easily implemented in any matrix-based programming environment such as octave, scilab, matlab, or M. For example, in M, the following loop performs n iterations of the algorithm; y is the row-vector of observed data, b is the row-vector of estimates of B, and c represents the estimate of c called "a" above: for( i = 0; i < n; ++i) { /* E-step */ j = find( b == 1); c = sum(y(j)) / cols(j); /* M-step */ b = (c > 0) ? (y > c/2) : (y < c/2); } As you can see, j is computed as a vector of indices for which B is 1, and is used to select the subset Y' of Y as described in the E-step above. (y > c/2) or (y < c/2) are performed as vector operations yielding a vector of 0's and 1's as the estimate of B. Following are some numerical examples using N = 50 data values randomly generated using P(1) = 0.4 and c = 2. In each case, the initial estimate of B was set to all 1's. In these examples, iter(i) is a function which performs i iterations of the EM algorithm for the problem under consideration. For each iteration it prints the current estimated value of c and the number of 1's in B. check_ml() is a function which verifies that the solution is at least a local minimum of -log(f(Y|B,c)) by examining nearby points: with individual bits flipped one at a time or using 0.999*c or 1.001*c for c. In the first example, v (the noise variance) is chosen to give SNR = 12 dB: m:111> iter(5) c = 0.666958, sum(b) = 20 c = 1.79068, sum(b) = 17 c = 2.02121, sum(b) = 17 c = 2.02121, sum(b) = 17 c = 2.02121, sum(b) = 17 m:112> sum(b!=b_save) ans = 0 m:113> check_ml check_ml: min = 3.11318 The algorithm converged after just a few iterations. Comparing the estimated B values with the original generated data (b_save) shows that all of the values are estimated correctly. check_ml() prints the value of -log(f(Y|B,c)) and does not find any nearby points which provide a better solution. Following is an example with SNR = 3 dB: m:115> iter(5) c = 0.848997, sum(b) = 28 c = 1.70155, sum(b) = 22 c = 1.99095, sum(b) = 21 c = 2.04252, sum(b) = 21 c = 2.04252, sum(b) = 21 m:116> sum(b!=b_save) ans = 3 m:117> check_ml check_ml: min = 18.2503 m:118> find(b!=b_save) ans = 5 8 42 m:119> [b_save(ans)' y(ans)' b(ans)'] ans = 0 1.138719 1 0 1.250233 1 1 0.908067 0 In this lower SNR case, 3 of the 50 data values are wrongly estimated when compared to the original generated data. Examining those estimates which are off shows that they are correct ML estimates, based on a threshold of c/2 ~= 1. ______________________________________________________________________________ 2. Constrained Binary Sequence ------------------------------ To make this problem more interesting, and complicated enough so that it becomes convenient to use the Viterbi algorithm for the M-step, we now assume that the binary sequence B is constrained to never have two 1's in a row. Recall that B is deterministic but unknown. Here we are assuming that the process which generates B never produces two 1's in a row, i.e. whenever it produces a 1 then it must produce a 0 for the next value. To obtain a true ML estimate, we must incorporate this apriori information into the EM algorithm. This constraint affects the M-step maximization of Q(B|D) from (1.10): Q(B|D) = -(y1 - a b1)^2 - (y2 - a b2)^2 ... - (yN - a bN)^2 Now the terms in Q(B|D) are not independent. For example, we can not independently use the second term to estimate b2, since if b1 was estimated as being 1, then b2 must be 0 under the specified constraint. Although we could devise a specialized optimization method for this particular problem, it falls neatly into the category of problems which are solvable by the Viterbi algorithm, so we will just use that. See the vit.m source code for a reference on the Viterbi algorithm and the implementation used here. For the Viterbi algorithm, we need to specify the costs associated with moving from one state to the next in the state-transition trellis. Since the Viterbi algorithm is designed to minimize a cost function, we will use -Q(B|D) as the function to minimize (equivalent to maximizing Q(B|D)). There are two states, the possible 0 and 1 values for elements of B. Considering, for example, the transition from an estimate of b1 to an estimate of b2, we have the incremental costs: b1 b2 cost -- -- ---- 0 0 y2^2 0 1 (y2 - a)^2 1 0 y2^2 1 1 Infinity The costs come from (y2 - a b2)^2 with substitution of the appropriate value of b2. Since (b1,b2) = (1,1) is not allowed under our assumed constraint of never having two 1's in a row, the cost of that transition is represented as Infinity, so it will never be selected in the Viterbi cost minimization operation. The actual implementation uses some HUGE_VAL constant, like 1e99, to represent Infinity without causing numerical difficulties. For this constrained EM algorithm, the E-step is the same as in the previous section. And the M-step simply uses the Viterbi algorithm to estimate B. The implementation in M is: for( i = 0; i < n; ++i) { /* E-step */ j = find( b == 1); c = sum(y(j)) / cols(j); /* M-step */ b = vit(y,2) - 1; /* map 1,2 from viterbi to 0,1 */ } where vit(y,2) invokes a generic Viterbi routine using data vector y and two states. vit() returns a vector of state indices representing the lowest cost path; these indices will have a value of 1 or 2 corresponding to state #1 and state #2. They are mapped to the associated 0 and 1 values for B by simply subtracting 1. The generic Viterbi routine vit() calls a user-defined function get_cost() to evaluate the cost of possible state transitions. In this example, get_cost() implements the cost table described above which associates an Infinite (HUGE_VAL) cost for state transitions that would produce a 1 followed by a 1. Following are some numerical examples using N = 50 data values randomly generated using P(1) = 0.8 and c = 2. After initially generating the B data randomly, it was then fixed to conform to the constraint of never having two 1's in a row by setting to 0 any value which was preceded by a 1 in the sequence. In each case, the initial estimate of B was set to all 1's. For this problem, check_ml() was modified so that it skips bit flipping operations which would violate the constraint. In the first example, the noise variance is chosen to give SNR = 12 dB: m:241> iter(5) c = 0.826411, sum(b) = 21 c = 1.9967, sum(b) = 21 c = 1.9967, sum(b) = 21 c = 1.9967, sum(b) = 21 c = 1.9967, sum(b) = 21 m:242> sum(b!=b_save) ans = 0 m:243> check_ml check_ml: min = 3.38763 The algorithm converged after just a few iterations. Comparing the estimated B values with the original generated data (b_save) shows that all of the values are estimated correctly. check_ml() prints the value of -log(f(Y|B,c)) and does not find any nearby points which provide a better solution. Following is an example with SNR = 3 dB: m:241> iter(5) c = 0.801701, sum(b) = 21 c = 2.01161, sum(b) = 19 c = 2.14403, sum(b) = 19 c = 2.14403, sum(b) = 19 c = 2.14403, sum(b) = 19 m:242> sum(b!=b_save) ans = 4 m:243> check_ml check_ml: min = 22.789 m:244> find(b!=b_save) ans = 3 4 16 20 m:245> i=[2 3 4] m:246> [b_save(i)' y(i)' b(i)'] ans = 0 -0.5869168 0 1 0.6685796 0 0 1.10774 1 In this lower SNR case, 4 of the 50 data values are wrongly estimated when compared to the original generated data. Nevertheless, check_ml() confirms that no better estimate can be found in the neighborhood of this solution. Examining one of the estimates which is off shows that it is a correct ML estimate, based on a threshold of c/2 ~= 1. As an example of what check_ml() would show if a better solution can be found, we try modifying one of the bits in the solution: m:248> b(1:5) ans = 1 0 0 1 0 m:249> b(1)=0 m:250> check_ml check_ml: min = 25.6417 check_ml: found better solution 25.64 with c = 2.14617 check_ml: found better solution 22.789 with bit 1 flipped In this case, check_ml() finds two better solutions; one with a slightly higher value of c, and one with bit 1 flipped corresponding to the original EM solution before we manually flipped bit 1. Now reset bit 1 to its original value, and set bits 3 and 4 to the original generated data values: m:251> b(1)=1 m:252> b(3:4)=[1 0] m:253> b(1:5) ans = 1 0 1 0 0 m:254> check_ml check_ml: min = 24.6721 check_ml: found better solution 24.6703 with c = 2.14188 check_ml: found better solution 22.9422 with bit 3 flipped In this case, check_ml() correctly determines that flipping bit 3 produces a better solution. However, note that the resulting cost of 22.9422 is still higher than the result from the original EM solution (22.789) which has both bits 3 and 4 set to their ML values. A limited number of simulations were run to compare the bit-error-rate of the EM solution with one using a known value of the scale factor c. Using N = 100 data values with P(1) = 0.8 and c = 2, 10 trials were performed for each SNR value shown in the table below. For the EM solution, B was initialized to all 1's and 3 iterations of the algorithm were performed. For the known channel case, c was computed using the E-step formula on the original generated data and B was estimated using the Viterbi algorithm: BER SNR Known EM -6 0.239 0.279 -5 0.230 0.241 -4 0.186 0.202 -3 0.195 0.210 -2 0.172 0.192 -1 0.113 0.127 0 0.079 0.090 1 0.086 0.100 2 0.069 0.073 3 0.041 0.044 4 0.028 0.028 5 0.019 0.020 6 0.012 0.014 7 0.005 0.004 8 0.006 0.006 9 0.000 0.000 10 0.001 0.001 For SNR > 3 dB, the EM algorithm performs almost as well as ML estimation with a known value of c. For higher SNR values, more trials for each case would have to be performed to get accurate estimates of the low BER values. Note that in generating this table, the EM algorithm was fixed at 3 iterations rather than running it until convergence. For very low SNR, using 5 iterations would produce better results. For SNR > 3, using just 2 iterations would suffice.