Note: this is an old internal report which is being superseded by various journal and conference publications in progress. ______________________________________________________________________________ EM ISI Examples R. Perry, 18 April 1999 ______________________________________________________________________________ 0. Introduction --------------- In Ghosh and Weber, "Maximum-likelihood blind equalization", Optical Engineering, June 1992, the following equations define the problem setup for an InterSymbol Interference (ISI) problem: (GW2) r = A h + n (GW3) f(r|h,A) = exp(-|r-Ah|^2/(2s)) / (s 2pi)^(N/2) where r is the vector of received data, the N-by-L matrix A contains the N transmitted sequence values a(1)..a(N) arranged in an appropriate form, h is the channel impulse response vector (unknown), and n is a Gaussian noise vector with variance s. |r-Ah| is the 2-norm, i.e. |r-Ah|^2 = (r-Ah)'(r-Ah) They solve min |r-Ah|^2 as a joint channel and data estimation problem, to produce ML estimates for A and h which maximize f(r|h,A). Here we consider an EM (Expectation Maximization) approach, using the primary reference: A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm", J. Roy. Stat. Soc., vol. 39, No. 1, 1977, pp. 1-38. In particular, we consider producing the ML estimate for the data only, treating the channel coefficients as unknown parameters whose values need not be determined optimally. In sections 2, 3, and 4 we consider the specific cases of deterministic, uniform, and Gaussian distributed channel coefficients. Section 5 examines MAP and ML estimators from basic principles, and section 6 discusses an alternate form of (GW2) which may be useful in a "soft-decisions" algorithm. 1. General EM Algorithm ----------------------- Define Q(A|B): (1) Q(A|B) = E[ -log f(r,h|A) | r,B] where we desire the ML estimate of A, B is the current estimate of A, r is the observed data, h is the hidden data, and (r,h) is the complete data. We use -log instead of log, in contrast to Dempster et. al, so that the M-step of EM will minimize Q(A|B) instead of maximizing it, which corresponds to the Ghosh and Weber algorithm as well as the usual Viterbi algorithm for minimizing a cost function. The EM algorithm is: 1. Initialize B to some estimate of A. 2. E-step: construct Q(A|B) 3. M-step: find A to minimize Q(A|B) 4. Set B=A and repeat steps 2 and 3 until converged. To construct Q(A|B), start with f(r,h|A): (2) f(r,h|A) = f(r|h,A) f(h|A) We can drop the f(h|A) term from this since h and A are independent so f(h|A) is not a function of A and will not affect the minimization in the M-step. Using f(r|h,A) from (GW3), and dropping constants which will not affect the minimization: (3) Q(A|B) = E[ |r-Ah|^2 | r,B] = integral over h { |r-Ah|^2 f(h|r,B) dh } Given an apriori density function f(h) for the channel coefficients, f(h|r,B) can be written as: (4) f(h|r,B) = f(r|h,B) f(h|B) / f(r|B) where the denominator term f(r|B) can be ignored since it simply serves to normalize the distribution, and is not a function of A, so will not affect the minimization step. Also, f(h|B) = f(h) since h and B are assumed to be independent. Using (GW3) for f(r|h,B), and combining constants into a scale factor K, (4) becomes: (5) f(h|r,B) = K exp(-|r-Bh|^2/(2s)) f(h) where K can be determined by setting the integral of f(h|r,B) to 1. To evaluate (3) we just need the mean and covariance of h given r and B using the distribution of (5). We will derive these for specific cases of f(h) later; for now, we define them symbolically as g and G: (6) g == E[h|r,B] (7) G == E[(h-g)(h-g)'|r,B] == {G(i,j)}, G(i,j) = E[ (h(i)-g(i)) (h(j)-g(j)) |r,B], i,j = 1..L Note that although the channel coefficients may be independent in their apriori distribution f(h), they are not independent in (5) given r and B. Also, the noise variance s is needed in general to compute g and G. If s is not known it could be estimated from the variances of the transmitted and received data. The norm-squared term in (3) represents a summation of incremental state transition costs which can be minimized by the Viterbi algorithm. Rewriting (3) in terms of the incremental costs: (8) Q(A|B) = sum k=1..N { V(k) } where V(k) is: (9) V(k) = E[ |r(k) - c(k)'h|^2 | r,B] c(k) is a column vector containing a(k-i+1), i=1..L, and a(t) == 0 for t <= 0. The expected value is taken over h using (5) for f(h|r,B). Expanding the norm squared: (10) |r(k) - c(k)'h|^2 = r(k)^2 - 2 r(k)c(k)'h + c(k)'h c(k)'h = r(k)^2 - 2 r(k)c(k)'h + c(k)'(h-g)(h-g)'c(k) - (c(k)'g)^2 + 2 c(k)'g c(k)'h Taking the expected value: (11) V(k) = r(k)^2 - 2 r(k)c(k)'g + c(k)'Gc(k) - (c(k)'g)^2 + 2 c(k)'g c(k)'g = (r(k) - c(k)'g)^2 + c(k)'Gc(k) Now we return to the evaluation of g and G using (5). We can not do this without having a specific f(h) to use in (5), but we can put the exponential term into a form which may be more useful. Expand the norm-squared term from (5): (12) |r-Bh|^2 = r'r - 2r'Bh + h'B'Bh = r'r - 2r'Bh + 2g'B'Bh - g'B'Bg + (h-g)'B'B(h-g) Now, if g = pinv(B) r, where pinv(B) is the pseudo-inverse of B (i.e. inv(B'B)B' if B has full rank) and if B has full rank, then (12) reduces to: (13) |r-Bh|^2 = r'r - r'Bg + (h-g)'B'B(h-g) which indicates that the exponential part of (5) is a joint Gaussian distribution of h, with inverse autocorrelation matrix B'B/s. If B does not have full rank then some simplification of (12) into a form similar to (13) may be performed, but that is not pursued here. In the case of data values which do not contain 0's, e.g. (-1,1) data values, B will always have full rank due to the 0's in the first L-1 rows (initial data values for t<=0) which make the columns of B linearly independent. Note that we have not proved or derived that g = pinv(B) r; we have only shown that _if_ this is so, then the exponential part of (5) is Gaussian. The actual value of g and overall form of (5) depend on f(h). In this section, we are only deriving general results, without assuming any specific form for f(h). The following sections use these results and consider some specific cases of f(h). 2. Deterministic Channel Coefficients ------------------------------------- If h is not random, then let f(h) = delta(h-g) where g is the deterministic but unknown value of h, and delta() represents an impulse function. In this case, E[h] = E[h|r,B] = g, and using (5) to estimate g does not provide anything useful since it reduces to: (14) E[h|r,B] = integral over h { h K exp(-|r-Bh|^2/(2s)) delta(h-g) dh } = g K exp(-|r-Bg|^2/(2s)) = g and g is unknown. So to estimate g, note that given r and B, h satisfies (GW2): (15) r = B h + n Taking the expected value: (16) r = B g Solving for g: (17) g = pinv(B) r This seems to be a reasonable way to estimate g for the EM algorithm in this case, even though (17) only provides a least-square-error solution to (16) and not an exact solution. And note that this is a maximum-likelihood estimate of h given r and B since it minimizes |r-Bg|^2. Although producing an optimal estimate of the channel coefficients was not a goal in this EM algorithm, it seems to do so anyway. In this case, G = 0, and the incremental Viterbi cost function from (11) is: (18) V(k) = (r(k) - c(k)'g)^2 The detailed EM algorithm is: 1. Initialize B to some estimate of A. We could initialize B randomly, or to all 1's, etc. But to reduce the number of iterations required, and to help the algorithm converge to a proper solution, it makes more sense to initialize B with some reasonable estimate of A. This can be done by initially using a pure delay channel, i.e. using g = [0 .. 0 1 0 .. 0] in (18), then using the Viterbi algorithm to estimate A, then setting B=A. 2. E-step: use (17) to estimate g: g = pinv(B) r 3. M-step: use the Viterbi algorithm with incremental cost function (18) to estimate A. 4. Set B=A and repeat steps 2 and 3 until converged. This is equivalent to the Ghosh and Weber algorithm: 1. Start with an initial estimate of g. 2. Use the Viterbi algorithm with incremental cost function (18) to estimate A. 3. Set B=A and use (17) to produce a new estimate of g: g = pinv(B) r 4. Repeat steps 2 and 3 until converged. The EM step 1 corresponds to Ghosh and Weber steps 1 and 2. The EM steps 2 and 3 correspond to Ghosh and Weber steps 3 and 2. 3. Uniform Channel Coefficient Distribution ------------------------------------------- The previous section presented a solution for the case of deterministic channel coefficients, using an impulse function for f(h). In this section we analyze a similar situation, where no information about h is available, so we use a uniform distribution for f(h). Although impulse and uniform distributions for f(h) do not seem similar, they are similar in this application because the location of the impulse (mean of h) is unknown in the previous section. If f(h) is uniform over some finite interval, then computation of g and G could be performed numerically using f(h|r,B) from (5). If we let the domain of h be infinite, then we can compute g and G analytically. In this case f(h) is irregular, since it would have to be zero for the area under it to be one. Nevertheless, assuming that f(h) is uniform and of infinite extent is useful, since it is practically equivalent to letting f(h) be uniform over some range of several standard deviations for which the exponential term in (5) is significant. Outside of some finite region, the exponential term in (5) is practically zero. So we propose that letting f(h)=1 in (5) is reasonable, which reduces (5) to: (19) f(h|r,B) = K exp(-|r-Bh|^2/(2s)) Now using (13): (20) f(h|r,B) = K exp( -(r'r - r'Bg + (h-g)'B'B(h-g))/(2s)) = K2 exp( -(h-g)'B'B(h-g)/(2s)) where K2 combines the constants K and exp( -(r'r - r'Bg)/(2s)). This shows that h has a joint Gaussian distribution with mean g = pinv(B) r, and inverse autocorrelation matrix B'B/s, if B has full rank. Therefore, for the case of uniformly distributed channel coefficients, the EM algorithm is almost the same as for deterministic channel coefficients, except that the incremental Viterbi cost function from (11), which includes G, is used instead of (18), and G = inv(B'B/s) in this case. The detailed EM algorithm is: 1. Initialize B to some estimate of A by initially estimating g = [0 .. 0 1 0 .. 0] and G = diag{s/(N-i+1)}, i=1..L, in (11), then using the Viterbi algorithm to estimate A, then setting B=A. The initial estimate of G is based on using just the diagonal elements of B'B for the case of (-1,1) data values. 2. E-step: use (17) to estimate g: g = pinv(B) r and estimate G using: G = inv(B'B/s) 3. M-step: use the Viterbi algorithm with incremental cost function (11) to estimate A. 4. Set B=A and repeat steps 2 and 3 until converged. Note that the expected value of (18) is just s, the noise variance. The expected value of (11) is s plus an extra term which depends on G; for N>>L this extra term is approximately sL/N and is small compared to s, so will not have a significant effect on the algorithm. 4. Gaussian Channel Coefficient Distribution -------------------------------------------- Let f(h) be a joint Gaussian distribution with mean d and autocovariance Rh: (21) f(h) = K3 exp(-(h-d)'inv(Rh)(h-d)/2) where K3 is a normalizing constant. Using (5): (22) f(h|r,B) = K exp(-|r-Bh|^2/(2s)) K3 exp(-(h-d)'inv(Rh)(h-d)/2) = K4 exp(-(h-g)'inv(G)(h-g)/2) where K4 is normalizing constant and: (23) G = inv(B'B/s+inv(Rh)) (24) g = G (B'r/s+inv(Rh)d) So in this case, the distribution of h given r and B is also Gaussian. Note that the formula in (24) for the mean of h given r and B does not use pinv(B)r as in the previous deterministic and uniform cases. The Gaussian distribution of h is intermediate between deterministic and uniform h. In the limit as Rh goes to 0, G approaches Rh and g approaches d, which corresponds to the Gaussian distribution approaching delta(h-d) as in the case for deterministic h. In the limit as Rh goes to infinity, G approaches inv(B'*B/s) and g approaches pinv(B)r, which corresponds to the Gaussian distribution approaching a uniform distribution. In general, the EM algorithm for Gaussian f(h) is: 1. Initialize B to some estimate of A by initially estimating g and G, then using the Viterbi algorithm to estimate A, then setting B=A. 2. E-step: use (23) and (24) to estimate G and g. 3. M-step: use the Viterbi algorithm with incremental cost function (11) to estimate A. 4. Set B=A and repeat steps 2 and 3 until converged. Simulations were run to compare the EM algorithms for deterministic and Gaussian h. To insure that the algorithms converge to the ML solution, the actual value of h (i.e. the value used to produce the simulated received data) was used to initialize g. In this case, the different Viterbi incremental cost functions from (11) and (18) are responsible for the differences in the results. Using 100000 trials, with two iterations of each algorithm for each trial, and L=3, with h generated as independent Gaussian random variables with mean 1 and variance 0.5, and (-1,1) data values generated with p(1)=p(-1)=0.5 for each trial, BER results are: EM algorithm N SNR deterministic h Gaussian h known channel 6 6 0.10426 0.10278 0.099273 6 9 0.04897 0.047355 0.04497 10 6 0.085894 0.085592 0.082203 10 9 0.033689 0.033503 0.031556 Although both EM algorithms were initialized with h equal to the known channel, the EM iterations can produce final results different from the known channel case after convergence. These results indicate that the EM algorithm for Gaussian h works better than the EM algorithm for deterministic h (which is equivalent to the ad-hoc methods for joint h,A optimization) for low values of N and SNR. 5. MAP and ML Derivations ------------------------- Here we examine the background of the ISI problem with unknown channel coefficients starting from basics. To minimize BER, we must maximize f(A|r) where A contains the transmitted data and r is the received data vector. Using Bayes rule: (25) f(A|r) = f(r|A) f(A) / f(r) = K f(r|A) f(A) where K is a constant which takes into account the effect of f(r) for normalizing the distribution. MAP estimators minimize BER by maximizing f(r|A) f(A). If f(A) is unknown or assumed to be uniform, then the ML estimator which maximizes f(r|A) can be used instead. All of the EM algorithms derived in the previous sections produce iterative solutions to the ML problem of maximizing f(r|A). If the channel coefficients are not deterministic then this ML problem is practically impossible to solve without using EM as discussed below. Furthermore, if f(A) is known and not uniform, it can be incorporated into the EM algorithms as an additive term in Q(A|B), thus producing a MAP estimator. If the transmitted data values are independent with a known distribution, or if the Viterbi state transitions have a known apriori Markov model, then f(A) can easily be incorporated into the incremental Viterbi cost function. Now we examine the dependence of f(r|A) on the channel coefficients. In general: (26) f(r|A) = E[f(r|h,A)] = integral over h { dh f(h) exp(-|r-Ah|^2/(2s)) / (s 2pi)^(N/2) } Evaluating this integral and then maximizing the result with respect to A seems to be an intractable problem in general. Yet this is exactly the problem which the EM algorithms solve iteratively. The only apparent case where direct evaluation and maximization of f(r|A) from (26) seems tractable is if h is deterministic. Then f(h) = delta(h-g) and the negative log of the integral reduces to (ignoring constants): (27) -log f(r|A) = |r-Ag|^2 where g is the deterministic but unknown value of h, as discussed in section 2. This is the basis for the joint h,A estimation algorithms such as Ghosh and Weber. Since the EM algorithms do not use any estimate of h itself, only estimates of the mean and covariance of h, they turn out to be different than the joint h,A estimation solution when h is not deterministic. But if N>>L, then the difference in the algorithms does not seem to really make any difference in the final solutions, since the effect of the covariance estimate of G in (11) becomes vanishingly small. But this indicates that if f(h) is known, and if the data block size is small so that N is not much greater than L, then the EM algorithms will provide better solutions. 6. Alternate Form ----------------- In "Modulation and Coding Concepts for Channels with Memory", Kumar Ramaswamy, Ph.D. thesis, Rensselaer Polytechnic Institute, May 1993, he states on page 80 that the proof of convergence of EM assumes a real Euclidean parameter space, and so is not directly applicable where the parameter space is discrete. Ragaswamy avoids the convergence issue by using "soft-decisions" in the M-step of EM, i.e. he produces real (or complex) floating-point estimates of the data values, then quantizes the results after the algorithm converges. He then has the EM guarantee of convergence. However, minimizing some function f(x) for continuous x, then quantizing x, may not produce the same results as minimizing f(x) over a discrete space of x. Nevertheless, Ragaswamy's method seems to work very well, with a greatly reduced complexity compared to using the Viterbi algorithm in most cases, especially with large channel lengths. To investigate the use of soft-decisions, (GW2) can be written in an alternate form, with the data values in vector a, and channel coefficients in matrix H: (28) r = Ha + n where H is an N-by-N matrix containing h(1) on the diagonal, h(2) on the first sub-diagonal, etc. Given an estimate of h, (28) can be solved for a linear least-squares "soft" (non-discrete) estimate of a using pinv(H) r. This estimate of a could then be used in (GW2) to solve for a new estimate of h = pinv(A) r. (28) and (GW2) would then be alternately solved until convergence. The Viterbi algorithm would not be needed. Preliminary investigation of such an alternating optimization algorithm indicates that it does not work well, even if a is discretized before solving (GW2) during the iterations. But note that no theoretical investigation has been performed here to justify the use of such an algorithm. The main point of this section is simply to note the alternate form (28) of (GW2), and suggest that use of this form, together with soft-decisions, may be worth investigating further.