Note: this is an old internal report which is being superseded by various
journal and conference publications in progress.
______________________________________________________________________________

EM ISI Examples                                        R. Perry, 18 April 1999
______________________________________________________________________________

0. Introduction
---------------
In Ghosh and Weber, "Maximum-likelihood blind equalization", Optical
Engineering, June 1992, the following equations define the problem setup
for an InterSymbol Interference (ISI) problem:

(GW2)  r = A h + n

(GW3)  f(r|h,A) = exp(-|r-Ah|^2/(2s)) / (s 2pi)^(N/2)

where r is the vector of received data, the N-by-L matrix A contains the N
transmitted sequence values a(1)..a(N) arranged in an appropriate form, h
is the channel impulse response vector (unknown), and n is a Gaussian noise
vector with variance s.  |r-Ah| is the 2-norm, i.e. |r-Ah|^2 = (r-Ah)'(r-Ah)

They solve  min |r-Ah|^2  as a joint channel and data estimation problem,
to produce ML estimates for A and h which maximize f(r|h,A).

Here we consider an EM (Expectation Maximization) approach, using the
primary reference:

     A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum Likelihood
     from Incomplete Data via the EM Algorithm", J. Roy. Stat. Soc.,
     vol. 39, No. 1, 1977, pp. 1-38.

In particular, we consider producing the ML estimate for the data only,
treating the channel coefficients as unknown parameters whose values need
not be determined optimally.  In sections 2, 3, and 4 we consider the
specific cases of deterministic, uniform, and Gaussian distributed channel
coefficients.  Section 5 examines MAP and ML estimators from basic
principles, and section 6 discusses an alternate form of (GW2) which may be
useful in a "soft-decisions" algorithm.

1. General EM Algorithm
-----------------------
Define Q(A|B):

(1)  Q(A|B) = E[ -log f(r,h|A) | r,B]

where we desire the ML estimate of A, B is the current estimate of A,
r is the observed data, h is the hidden data, and (r,h) is the complete
data.  We use -log instead of log, in contrast to Dempster et. al,  so that
the M-step of EM will minimize Q(A|B) instead of maximizing it, which
corresponds to the Ghosh and Weber algorithm as well as the usual Viterbi
algorithm for minimizing a cost function.

The EM algorithm is:

  1.  Initialize B to some estimate of A.

  2.  E-step: construct Q(A|B)

  3.  M-step: find A to minimize Q(A|B)

  4.  Set B=A and repeat steps 2 and 3 until converged.

To construct Q(A|B), start with f(r,h|A):

(2)  f(r,h|A) = f(r|h,A) f(h|A)

We can drop the f(h|A) term from this since h and A are independent so f(h|A)
is not a function of A and will not affect the minimization in the M-step.
Using f(r|h,A) from (GW3), and dropping constants which will not affect the
minimization:

(3)  Q(A|B) = E[ |r-Ah|^2 | r,B]

            = integral over h
              {
                |r-Ah|^2 f(h|r,B) dh
              }

Given an apriori density function f(h) for the channel coefficients,
f(h|r,B) can be written as:

(4)  f(h|r,B) = f(r|h,B) f(h|B) / f(r|B)

where the denominator term f(r|B) can be ignored since it simply serves to
normalize the distribution, and is not a function of A, so will not affect
the minimization step.  Also, f(h|B) = f(h) since h and B are assumed to be
independent.  Using (GW3) for f(r|h,B), and combining constants into a
scale factor K, (4) becomes:

(5)  f(h|r,B) = K exp(-|r-Bh|^2/(2s)) f(h)

where K can be determined by setting the integral of f(h|r,B) to 1.
 
To evaluate (3) we just need the mean and covariance of h given r and B
using the distribution of (5).  We will derive these for specific cases of
f(h) later; for now, we define them symbolically as g and G:

(6)  g == E[h|r,B]

(7)  G == E[(h-g)(h-g)'|r,B]

       == {G(i,j)},  G(i,j) = E[ (h(i)-g(i)) (h(j)-g(j)) |r,B],  i,j = 1..L

Note that although the channel coefficients may be independent in their
apriori distribution f(h), they are not independent in (5) given r and B.

Also, the noise variance s is needed in general to compute g and G.  If s
is not known it could be estimated from the variances of the transmitted
and received data.

The norm-squared term in (3) represents a summation of incremental state
transition costs which can be minimized by the Viterbi algorithm.

Rewriting (3) in terms of the incremental costs:

(8)  Q(A|B) = sum k=1..N { V(k) }

where V(k) is:

(9)  V(k) = E[ |r(k) - c(k)'h|^2 | r,B]

c(k) is a column vector containing a(k-i+1), i=1..L, and a(t) == 0 for
t <= 0.  The expected value is taken over h using (5) for f(h|r,B).

Expanding the norm squared:

(10)  |r(k) - c(k)'h|^2

        = r(k)^2 - 2 r(k)c(k)'h + c(k)'h c(k)'h

        = r(k)^2 - 2 r(k)c(k)'h

                 + c(k)'(h-g)(h-g)'c(k) - (c(k)'g)^2 + 2 c(k)'g c(k)'h

Taking the expected value:

(11)  V(k) = r(k)^2 - 2 r(k)c(k)'g + c(k)'Gc(k) - (c(k)'g)^2 + 2 c(k)'g c(k)'g

           = (r(k) - c(k)'g)^2 + c(k)'Gc(k)

Now we return to the evaluation of g and G using (5).  We can not do this
without having a specific f(h) to use in (5), but we can put the exponential
term into a form which may be more useful.

Expand the norm-squared term from (5):

(12)  |r-Bh|^2 = r'r - 2r'Bh + h'B'Bh

               = r'r - 2r'Bh + 2g'B'Bh - g'B'Bg + (h-g)'B'B(h-g)

Now, if  g = pinv(B) r,  where pinv(B) is the pseudo-inverse of B
(i.e. inv(B'B)B' if B has full rank) and if B has full rank, then (12)
reduces to:

(13)  |r-Bh|^2 = r'r - r'Bg + (h-g)'B'B(h-g)

which indicates that the exponential part of (5) is a joint Gaussian
distribution of h, with inverse autocorrelation matrix B'B/s.  If B does
not have full rank then some simplification of (12) into a form similar to
(13) may be performed, but that is not pursued here.  In the case of data
values which do not contain 0's, e.g. (-1,1) data values, B will always
have full rank due to the 0's in the first L-1 rows (initial data values
for t<=0) which make the columns of B linearly independent.

Note that we have not proved or derived that g = pinv(B) r; we have only
shown that _if_ this is so, then the exponential part of (5) is Gaussian.
The actual value of g and overall form of (5) depend on f(h).  In this
section, we are only deriving general results, without assuming any
specific form for f(h).  The following sections use these results and
consider some specific cases of f(h).

2. Deterministic Channel Coefficients
-------------------------------------
If h is not random, then let f(h) = delta(h-g) where g is the deterministic
but unknown value of h, and delta() represents an impulse function.

In this case, E[h] = E[h|r,B] = g, and using (5) to estimate g does not
provide anything useful since it reduces to:

(14)  E[h|r,B] = integral over h
                 {
                   h K exp(-|r-Bh|^2/(2s)) delta(h-g) dh
                 }

               = g K exp(-|r-Bg|^2/(2s))

               = g

and g is unknown.

So to estimate g, note that given r and B, h satisfies (GW2):

(15)  r = B h + n

Taking the expected value:

(16)  r = B g

Solving for g:

(17)  g = pinv(B) r

This seems to be a reasonable way to estimate g for the EM algorithm in
this case, even though (17) only provides a least-square-error solution to
(16) and not an exact solution.  And note that this is a maximum-likelihood
estimate of h given r and B since it minimizes |r-Bg|^2.  Although producing
an optimal estimate of the channel coefficients was not a goal in this EM
algorithm, it seems to do so anyway.

In this case, G = 0, and the incremental Viterbi cost function from (11) is:

(18)  V(k) = (r(k) - c(k)'g)^2

The detailed EM algorithm is:

  1.  Initialize B to some estimate of A.  We could initialize B randomly,
      or to all 1's, etc.  But to reduce the number of iterations required,
      and to help the algorithm converge to a proper solution, it makes
      more sense to initialize B with some reasonable estimate of A.
      This can be done by initially using a pure delay channel, i.e. using
      g = [0 .. 0 1 0 .. 0] in (18), then using the Viterbi algorithm to
      estimate A, then setting B=A.

  2.  E-step: use (17) to estimate g:  g = pinv(B) r

  3.  M-step: use the Viterbi algorithm with incremental cost function (18)
      to estimate A.

  4.  Set B=A and repeat steps 2 and 3 until converged.

This is equivalent to the Ghosh and Weber algorithm:

      1.  Start with an initial estimate of g.

      2.  Use the Viterbi algorithm with incremental cost function (18)
          to estimate A.

      3.  Set B=A and use (17) to produce a new estimate of g: g = pinv(B) r

      4.  Repeat steps 2 and 3 until converged.

The EM step 1 corresponds to Ghosh and Weber steps 1 and 2.
The EM steps 2 and 3 correspond to Ghosh and Weber steps 3 and 2.

3. Uniform Channel Coefficient Distribution
-------------------------------------------
The previous section presented a solution for the case of deterministic
channel coefficients, using an impulse function for f(h).  In this section
we analyze a similar situation, where no information about h is available,
so we use a uniform distribution for f(h).  Although impulse and uniform
distributions for f(h) do not seem similar, they are similar in this
application because the location of the impulse (mean of h) is unknown in
the previous section.

If f(h) is uniform over some finite interval, then computation of g and G
could be performed numerically using f(h|r,B) from (5).

If we let the domain of h be infinite, then we can compute g and G
analytically.  In this case f(h) is irregular, since it would have to be
zero for the area under it to be one.  Nevertheless, assuming that f(h) is
uniform and of infinite extent is useful, since it is practically
equivalent to letting f(h) be uniform over some range of several standard
deviations for which the exponential term in (5) is significant.  Outside
of some finite region, the exponential term in (5) is practically zero.

So we propose that letting f(h)=1 in (5) is reasonable, which reduces (5) to:

(19)  f(h|r,B) = K exp(-|r-Bh|^2/(2s))

Now using (13):

(20)  f(h|r,B) = K exp( -(r'r - r'Bg + (h-g)'B'B(h-g))/(2s))

               = K2 exp( -(h-g)'B'B(h-g)/(2s))

where K2 combines the constants K and exp( -(r'r - r'Bg)/(2s)).

This shows that h has a joint Gaussian distribution with mean g = pinv(B) r,
and inverse autocorrelation matrix B'B/s, if B has full rank.

Therefore, for the case of uniformly distributed channel coefficients, the
EM algorithm is almost the same as for deterministic channel coefficients,
except that the incremental Viterbi cost function from (11), which includes
G, is used instead of (18), and G = inv(B'B/s) in this case.

The detailed EM algorithm is:

  1.  Initialize B to some estimate of A by initially estimating
      g = [0 .. 0 1 0 .. 0]  and  G = diag{s/(N-i+1)}, i=1..L,
      in (11), then using the Viterbi algorithm to estimate A,
      then setting B=A.  The initial estimate of G is based on using
      just the diagonal elements of B'B for the case of (-1,1) data values.

  2.  E-step: use (17) to estimate g:  g = pinv(B) r
      and estimate G using:            G = inv(B'B/s)

  3.  M-step: use the Viterbi algorithm with incremental cost function (11)
      to estimate A.

  4.  Set B=A and repeat steps 2 and 3 until converged.

Note that the expected value of (18) is just s, the noise variance.  The
expected value of (11) is s plus an extra term which depends on G; for N>>L
this extra term is approximately sL/N and is small compared to s, so will
not have a significant effect on the algorithm.

4. Gaussian Channel Coefficient Distribution
--------------------------------------------
Let f(h) be a joint Gaussian distribution with mean d and autocovariance Rh:

(21)  f(h) = K3 exp(-(h-d)'inv(Rh)(h-d)/2)

where K3 is a normalizing constant.  Using (5):

(22)  f(h|r,B) = K exp(-|r-Bh|^2/(2s)) K3 exp(-(h-d)'inv(Rh)(h-d)/2)

               = K4 exp(-(h-g)'inv(G)(h-g)/2)

where K4 is normalizing constant and:

(23)  G = inv(B'B/s+inv(Rh))

(24)  g = G (B'r/s+inv(Rh)d)

So in this case, the distribution of h given r and B is also Gaussian.

Note that the formula in (24) for the mean of h given r and B does not
use pinv(B)r as in the previous deterministic and uniform cases.

The Gaussian distribution of h is intermediate between deterministic and
uniform h.  In the limit as Rh goes to 0, G approaches Rh and g approaches
d, which corresponds to the Gaussian distribution approaching delta(h-d)
as in the case for deterministic h.  In the limit as Rh goes to infinity,
G approaches inv(B'*B/s) and g approaches pinv(B)r, which corresponds to
the Gaussian distribution approaching a uniform distribution.

In general, the EM algorithm for Gaussian f(h) is:

  1.  Initialize B to some estimate of A by initially estimating
      g and G, then using the Viterbi algorithm to estimate A,
      then setting B=A.

  2.  E-step: use (23) and (24) to estimate G and g.

  3.  M-step: use the Viterbi algorithm with incremental cost function (11)
      to estimate A.

  4.  Set B=A and repeat steps 2 and 3 until converged.

Simulations were run to compare the EM algorithms for deterministic and
Gaussian h.  To insure that the algorithms converge to the ML solution, the
actual value of h (i.e. the value used to produce the simulated received
data) was used to initialize g.  In this case, the different Viterbi
incremental cost functions from (11) and (18) are responsible for the
differences in the results.

Using 100000 trials, with two iterations of each algorithm for each trial,
and L=3, with h generated as independent Gaussian random variables with
mean 1 and variance 0.5, and (-1,1) data values generated with p(1)=p(-1)=0.5
for each trial, BER results are:

                           EM algorithm
   N  SNR      deterministic h      Gaussian h        known channel
   6   6       0.10426              0.10278           0.099273
   6   9       0.04897              0.047355          0.04497
  10   6       0.085894             0.085592          0.082203
  10   9       0.033689             0.033503          0.031556

Although both EM algorithms were initialized with h equal to the known
channel, the EM iterations can produce final results different from the
known channel case after convergence.

These results indicate that the EM algorithm for Gaussian h works better
than the EM algorithm for deterministic h (which is equivalent to the
ad-hoc methods for joint h,A optimization) for low values of N and SNR.

5. MAP and ML Derivations
-------------------------
Here we examine the background of the ISI problem with unknown channel
coefficients starting from basics.  To minimize BER, we must maximize
f(A|r) where A contains the transmitted data and r is the received data
vector.  Using Bayes rule:

(25)  f(A|r) = f(r|A) f(A) / f(r)

             = K f(r|A) f(A)

where K is a constant which takes into account the effect of f(r) for
normalizing the distribution.

MAP estimators minimize BER by maximizing f(r|A) f(A).  If f(A) is unknown
or assumed to be uniform, then the ML estimator which maximizes f(r|A) can
be used instead.

All of the EM algorithms derived in the previous sections produce iterative
solutions to the ML problem of maximizing f(r|A).  If the channel
coefficients are not deterministic then this ML problem is practically
impossible to solve without using EM as discussed below.

Furthermore, if f(A) is known and not uniform, it can be incorporated into
the EM algorithms as an additive term in Q(A|B), thus producing a MAP
estimator.  If the transmitted data values are independent with a known
distribution, or if the Viterbi state transitions have a known apriori
Markov model, then f(A) can easily be incorporated into the incremental
Viterbi cost function.

Now we examine the dependence of f(r|A) on the channel coefficients.

In general:

(26)  f(r|A) = E[f(r|h,A)]

             = integral over h
               {
                 dh f(h) exp(-|r-Ah|^2/(2s)) / (s 2pi)^(N/2)
               }

Evaluating this integral and then maximizing the result with respect to A
seems to be an intractable problem in general.  Yet this is exactly the
problem which the EM algorithms solve iteratively.

The only apparent case where direct evaluation and maximization of f(r|A)
from (26) seems tractable is if h is deterministic.  Then f(h) = delta(h-g)
and the negative log of the integral reduces to (ignoring constants):

(27)  -log f(r|A) = |r-Ag|^2

where g is the deterministic but unknown value of h, as discussed in
section 2.  This is the basis for the joint h,A estimation algorithms
such as Ghosh and Weber.

Since the EM algorithms do not use any estimate of h itself, only estimates
of the mean and covariance of h, they turn out to be different than the
joint h,A estimation solution when h is not deterministic.  But if N>>L,
then the difference in the algorithms does not seem to really make any
difference in the final solutions, since the effect of the covariance
estimate of G in (11) becomes vanishingly small.  But this indicates that
if f(h) is known, and if the data block size is small so that N is not much
greater than L, then the EM algorithms will provide better solutions.

6. Alternate Form
-----------------
In "Modulation and Coding Concepts for Channels with Memory", Kumar
Ramaswamy, Ph.D. thesis, Rensselaer Polytechnic Institute, May 1993, he
states on page 80 that the proof of convergence of EM assumes a real
Euclidean parameter space, and so is not directly applicable where the
parameter space is discrete.

Ragaswamy avoids the convergence issue by using "soft-decisions" in the
M-step of EM, i.e. he produces real (or complex) floating-point estimates
of the data values, then quantizes the results after the algorithm
converges.  He then has the EM guarantee of convergence.  However,
minimizing some function f(x) for continuous x, then quantizing x, may not
produce the same results as minimizing f(x) over a discrete space of x.
Nevertheless, Ragaswamy's method seems to work very well, with a greatly
reduced complexity compared to using the Viterbi algorithm in most cases,
especially with large channel lengths.

To investigate the use of soft-decisions, (GW2) can be written in an
alternate form, with the data values in vector a, and channel coefficients
in matrix H:

(28)  r = Ha + n

where H is an N-by-N matrix containing h(1) on the diagonal, h(2) on the
first sub-diagonal, etc.

Given an estimate of h, (28) can be solved for a linear least-squares
"soft" (non-discrete) estimate of a using  pinv(H) r.  This estimate of a
could then be used in (GW2) to solve for a new estimate of h = pinv(A) r.
(28) and (GW2) would then be alternately solved until convergence.  The
Viterbi algorithm would not be needed.

Preliminary investigation of such an alternating optimization algorithm
indicates that it does not work well, even if a is discretized before
solving (GW2) during the iterations.  But note that no theoretical
investigation has been performed here to justify the use of such an
algorithm.  The main point of this section is simply to note the alternate
form (28) of (GW2), and suggest that use of this form, together with
soft-decisions, may be worth investigating further.