Sequence Estimation EM Examples				R. Perry, 25 June 1998
______________________________________________________________________________

0. Introduction
---------------

A simple sequence estimation example is presented to illustrate the EM
algorithm.  The motivation for this example is the paper:

     C. N. Georghiades and J. C. Han, "Sequence Estimation in the Presence
     of Random Parameters Via the EM Algorithm", IEEE Trans. Commun.,
     vol. 45, pp. 300-308, March 1997.

which provides some excellent examples and results.  Desiring simpler examples
to explore implementations of the EM and Viterbi algorithms, I made up the ones
described here.

Section 1 derives an EM algorithm for a simple scaled binary sequence in
noise.  This problem appears difficult to solve directly for ML estimates
without using EM.

In section 2 we constrain the binary sequence in such a way that it becomes
convenient to use the Viterbi algorithm for the M-step of the EM algorithm.

______________________________________________________________________________

1. Scaled Binary Sequence
-------------------------

Let the observed data Y be a sequence of N 0's and 1's scaled by a fixed but
unknown non-zero parameter c and corrupted by additive Gaussian noise Z:

(1.1)	Y = c B + Z

where B represents a sequence of deterministic binary 0 and 1 values.

The probability density function for each element z of Z is G(z;0,v) where
G(z;m,v) represents the density function for a Gaussian random variable z
with mean m and variance v:

(1.2)	G(z;m,v) = exp(-(z-m)^2/2v) / sqrt(2 pi v)

We would like to determine an ML estimate of B.  We can try to do this
directly by writing f(Y|B) and trying to maximize it.  Each element y of Y
is a Gaussian random variable with mean (c b) and variance v:

(1.3)	f(y|b) = G(y;(c b),v)) =  G(y;0,v)  if b = 0
                               =  G(y;c,v)  if b = 1

If  G(y;c,v) > G(y;0,v)  then we will estimate b as 1, otherwise 0.

(1.4)	G(y;c,v) > G(y;0,v)

	-(y-c)^2 > -y^2		(by taking the log)

which reduces to:

(1.5)	y > c/2  if c > 0

	y < c/2  if c < 0

Since the value of c is unknown, it appears impossible to proceed with a
direct ML algorithm without having an estimate of c.

But using the EM algorithm we can choose a set of complete data X, the
obvious choice being X = {Y, c} in this case, and the estimation step of
the algorithm will provide estimates of c to be used in the maximization
step.

So let D represent an estimate of B, and let the complete data X be {Y, c}.
Dempster's Q() function for EM can then be written as:

(1.6)	Q(B|D) = E[ log f(Y,c|B) | Y,D ]

Since c is not a random variable, f(Y,c|B) can be written as f(Y|c,B) and
is the product of f(y;(c b),v) for each y,b in Y,B:

(1.7)	f(Y|c,b) = G(y1;(c b1),v)) G(y2;(c b2),v) ... G(yN;(c bN),v)

Taking the log of (1.7), dropping constant terms which are not a function
of B and will have no effect in the M-step, and substituting into (1.6):

(1.8)	Q(B|D) = E[ -(y1 - c b1)^2 - (y2 - c b2)^2 ... - (yN - c bN)^2 | Y,D ]

Letting (y - c b)^2 represent any term in the expected value operation of
(1.8), expanding the term and taking the expected value:

(1.9)	E[ (y - c b)^2 | Y,D ] = E[ y^2 - 2 c y b + c^2 b^2 | Y,D]

	 = y^2 - 2 y b E[c|Y,D] + b^2 E[c^2|Y,D]

Since c is an unknown constant, not a random variable, E[c^2] = E[c]^2,
so all we really need to do to perform the expectation operation in (1.8)
is to replace c with E[c|Y,D].

To compute E[c|Y,D], note that D represents a sequence of 0 and 1 values
which are an estimate of B.  The elements of D which are 0 provide no
information about c, since the associated y value is zero-mean Gaussian
in those cases (see (1.3)).  So we only need to consider the elements of D
which are 1.

Let D' represent the subset of elements of D which are 1's.  Let Y'
represent the corresponding Y values.  f(Y') is Gaussian with mean c,
so E[c|Y,D] is simply E[Y'] which can be computed from the data as
the average of the values from Y'.  Representing this average as "a",
we now have the E-step of the EM algorithm:


E-step:
	a = avg(Y')

	where Y' is the subset of Y values associated with the current
	estimates of B which are 1's.

Substituting into (1.8):

(1.10)	Q(B|D) = -(y1 - a b1)^2 - (y2 - a b2)^2 ... - (yN - a bN)^2

For the M-step of the EM algorithm, we must find B to maximize Q(B|D) from
(1.10).  Since the terms in (1.10) are independent and of the same form, we
can derive a maximization formula for a typical term, then apply that
formula to all of the terms separately to achieve the maximization.

Writing a typical term from (1.10) as  -(y - a b)^2,  we must decide
whether b=1 or b=0 produces a maximum by examining the inequality:

(1.11)	-(y - a 1)^2 > -(y - a 0)^2

Note that this inequality is the same as discussed previously in (1.4) when
considering a direct ML estimation of B.  But in (1.4) we did not have an
estimate of c; here we do (i.e. a).  So the solution, using (1.5) is:

M-step:
	For each b,y in B,Y

	  set b = 1  if (a > 0 and y > a/2) or (a < 0 and y < a/2)

	  otherwise set b = 0


Note that the E-step and M-step of this EM algorithm are very easily
implemented in any matrix-based programming environment such as octave,
scilab, matlab, or M.  For example, in M, the following loop performs n
iterations of the algorithm; y is the row-vector of observed data, b is the
row-vector of estimates of B, and c represents the estimate of c called "a"
above:

  for( i = 0; i < n; ++i) {

    /* E-step */

        j = find( b == 1);
        c = sum(y(j)) / cols(j);

    /* M-step */

        b = (c > 0) ? (y > c/2) : (y < c/2);
  }

As you can see, j is computed as a vector of indices for which B is 1,
and is used to select the subset Y' of Y as described in the E-step
above.  (y > c/2) or (y < c/2) are performed as vector operations
yielding a vector of 0's and 1's as the estimate of B.


Following are some numerical examples using N = 50 data values randomly
generated using P(1) = 0.4 and c = 2.  In each case, the initial estimate
of B was set to all 1's.

In these examples, iter(i) is a function which performs i iterations of the
EM algorithm for the problem under consideration.  For each iteration it
prints the current estimated value of c and the number of 1's in B.

check_ml() is a function which verifies that the solution is at least a
local minimum of -log(f(Y|B,c)) by examining nearby points: with individual
bits flipped one at a time or using 0.999*c or 1.001*c for c.


In the first example, v (the noise variance) is chosen to give SNR = 12 dB:

  m:111> iter(5)
  c = 0.666958, sum(b) = 20
  c = 1.79068, sum(b) = 17
  c = 2.02121, sum(b) = 17
  c = 2.02121, sum(b) = 17
  c = 2.02121, sum(b) = 17
  m:112> sum(b!=b_save)
  ans =               0
  m:113> check_ml
  check_ml: min = 3.11318

The algorithm converged after just a few iterations.  Comparing the
estimated B values with the original generated data (b_save) shows that all
of the values are estimated correctly.  check_ml() prints the value of
-log(f(Y|B,c)) and does not find any nearby points which provide a better
solution.


Following is an example with SNR = 3 dB:

  m:115> iter(5)
  c = 0.848997, sum(b) = 28
  c = 1.70155, sum(b) = 22
  c = 1.99095, sum(b) = 21
  c = 2.04252, sum(b) = 21
  c = 2.04252, sum(b) = 21
  m:116> sum(b!=b_save)
  ans =               3
  m:117> check_ml
  check_ml: min = 18.2503
  m:118> find(b!=b_save)
  ans =
                5              8             42
  m:119> [b_save(ans)' y(ans)' b(ans)']
  ans =
              0       1.138719              1
              0       1.250233              1
              1       0.908067              0

In this lower SNR case, 3 of the 50 data values are wrongly estimated when
compared to the original generated data.  Examining those estimates which
are off shows that they are correct ML estimates, based on a threshold of
c/2 ~= 1.

______________________________________________________________________________

2. Constrained Binary Sequence
------------------------------

To make this problem more interesting, and complicated enough so that it
becomes convenient to use the Viterbi algorithm for the M-step, we now
assume that the binary sequence B is constrained to never have two 1's in a
row.  Recall that B is deterministic but unknown.  Here we are assuming
that the process which generates B never produces two 1's in a row, i.e.
whenever it produces a 1 then it must produce a 0 for the next value.  To
obtain a true ML estimate, we must incorporate this apriori information
into the EM algorithm.

This constraint affects the M-step maximization of Q(B|D) from (1.10):

	Q(B|D) = -(y1 - a b1)^2 - (y2 - a b2)^2 ... - (yN - a bN)^2

Now the terms in Q(B|D) are not independent.  For example, we can not
independently use the second term to estimate b2, since if b1 was
estimated as being 1, then b2 must be 0 under the specified constraint.

Although we could devise a specialized optimization method for this
particular problem, it falls neatly into the category of problems which are
solvable by the Viterbi algorithm, so we will just use that.  See the
vit.m source code for a reference on the Viterbi algorithm and the
implementation used here.

For the Viterbi algorithm, we need to specify the costs associated with
moving from one state to the next in the state-transition trellis.  Since
the Viterbi algorithm is designed to minimize a cost function, we will
use -Q(B|D) as the function to minimize (equivalent to maximizing Q(B|D)).

There are two states, the possible 0 and 1 values for elements of B.
Considering, for example, the transition from an estimate of b1 to an
estimate of b2, we have the incremental costs:

	b1	b2	cost
	--	--	----
	 0	 0	y2^2
	 0	 1	(y2 - a)^2
	 1	 0	y2^2
	 1	 1	Infinity

The costs come from (y2 - a b2)^2 with substitution of the appropriate
value of b2.  Since (b1,b2) = (1,1) is not allowed under our assumed
constraint of never having two 1's in a row, the cost of that transition is
represented as Infinity, so it will never be selected in the Viterbi cost
minimization operation.  The actual implementation uses some HUGE_VAL
constant, like 1e99, to represent Infinity without causing numerical
difficulties.

For this constrained EM algorithm, the E-step is the same as in the previous
section.  And the M-step simply uses the Viterbi algorithm to estimate B.
The implementation in M is:

  for( i = 0; i < n; ++i) {

    /* E-step */

        j = find( b == 1);
        c = sum(y(j)) / cols(j);

    /* M-step */

        b = vit(y,2) - 1;     /* map 1,2 from viterbi to 0,1 */
  }

where vit(y,2) invokes a generic Viterbi routine using data vector y and
two states.  vit() returns a vector of state indices representing the
lowest cost path; these indices will have a value of 1 or 2 corresponding
to state #1 and state #2.  They are mapped to the associated 0 and 1 values
for B by simply subtracting 1.

The generic Viterbi routine vit() calls a user-defined function get_cost()
to evaluate the cost of possible state transitions.  In this example,
get_cost() implements the cost table described above which associates
an Infinite (HUGE_VAL) cost for state transitions that would produce
a 1 followed by a 1.


Following are some numerical examples using N = 50 data values randomly
generated using P(1) = 0.8 and c = 2.  After initially generating the
B data randomly, it was then fixed to conform to the constraint of
never having two 1's in a row by setting to 0 any value which was
preceded by a 1 in the sequence.  In each case, the initial estimate
of B was set to all 1's.

For this problem, check_ml() was modified so that it skips bit flipping
operations which would violate the constraint.


In the first example, the noise variance is chosen to give SNR = 12 dB:

  m:241> iter(5)
  c = 0.826411, sum(b) = 21
  c = 1.9967, sum(b) = 21
  c = 1.9967, sum(b) = 21
  c = 1.9967, sum(b) = 21
  c = 1.9967, sum(b) = 21
  m:242> sum(b!=b_save)
  ans =               0
  m:243> check_ml
  check_ml: min = 3.38763

The algorithm converged after just a few iterations.  Comparing the
estimated B values with the original generated data (b_save) shows that all
of the values are estimated correctly.  check_ml() prints the value of
-log(f(Y|B,c)) and does not find any nearby points which provide a better
solution.


Following is an example with SNR = 3 dB:

  m:241> iter(5)
  c = 0.801701, sum(b) = 21
  c = 2.01161, sum(b) = 19
  c = 2.14403, sum(b) = 19
  c = 2.14403, sum(b) = 19
  c = 2.14403, sum(b) = 19
  m:242> sum(b!=b_save)
  ans =               4
  m:243> check_ml
  check_ml: min = 22.789
  m:244> find(b!=b_save)
  ans =
              3              4             16             20
  m:245> i=[2 3 4]
  m:246> [b_save(i)' y(i)' b(i)']
  ans =
              0     -0.5869168              0
              1      0.6685796              0
              0        1.10774              1

In this lower SNR case, 4 of the 50 data values are wrongly estimated when
compared to the original generated data.  Nevertheless, check_ml() confirms
that no better estimate can be found in the neighborhood of this solution.
Examining one of the estimates which is off shows that it is a correct ML
estimate, based on a threshold of c/2 ~= 1.

As an example of what check_ml() would show if a better solution can be
found, we try modifying one of the bits in the solution:

  m:248> b(1:5)
  ans =
              1              0              0              1              0
  m:249> b(1)=0
  m:250> check_ml
  check_ml: min = 25.6417
  check_ml: found better solution 25.64 with c = 2.14617
  check_ml: found better solution 22.789 with bit 1 flipped

In this case, check_ml() finds two better solutions; one with a slightly
higher value of c, and one with bit 1 flipped corresponding to the original
EM solution before we manually flipped bit 1.

Now reset bit 1 to its original value, and set bits 3 and 4 to the original
generated data values:

  m:251> b(1)=1
  m:252> b(3:4)=[1 0]
  m:253> b(1:5)
  ans =
              1              0              1              0              0
  m:254> check_ml
  check_ml: min = 24.6721
  check_ml: found better solution 24.6703 with c = 2.14188
  check_ml: found better solution 22.9422 with bit 3 flipped

In this case, check_ml() correctly determines that flipping bit 3 produces
a better solution.  However, note that the resulting cost of 22.9422 is
still higher than the result from the original EM solution (22.789) which
has both bits 3 and 4 set to their ML values.


A limited number of simulations were run to compare the bit-error-rate of
the EM solution with one using a known value of the scale factor c.  Using
N = 100 data values with P(1) = 0.8 and c = 2, 10 trials were performed for
each SNR value shown in the table below.  For the EM solution, B was
initialized to all 1's and 3 iterations of the algorithm were performed.
For the known channel case, c was computed using the E-step formula on the
original generated data and B was estimated using the Viterbi algorithm:

                BER
     SNR   Known   EM
      -6   0.239   0.279
      -5   0.230   0.241
      -4   0.186   0.202
      -3   0.195   0.210
      -2   0.172   0.192
      -1   0.113   0.127
       0   0.079   0.090
       1   0.086   0.100
       2   0.069   0.073
       3   0.041   0.044
       4   0.028   0.028
       5   0.019   0.020
       6   0.012   0.014
       7   0.005   0.004
       8   0.006   0.006
       9   0.000   0.000
      10   0.001   0.001

For SNR > 3 dB, the EM algorithm performs almost as well as ML estimation
with a known value of c.  For higher SNR values, more trials for each case
would have to be performed to get accurate estimates of the low BER values.
Note that in generating this table, the EM algorithm was fixed at 3
iterations rather than running it until convergence.  For very low SNR,
using 5 iterations would produce better results.  For SNR > 3, using just 2
iterations would suffice.