$next$ $up$ $previous$
Next: Numerical Results Up: Direct and EM-based MAP Channels Previous: MAP Sequence Estimation

EM Algorithms

Here we summarize results from [4] and discuss EM convergence for sequence estimation. To maximize (3) using EM, define the auxiliary function [9]:

$\displaystyle Q({\bf A}\vert{\bf B})$	=	$\displaystyle E [ \log (f({\bf r},{\bf H}\vert{\bf A})) \ \vert {\bf r},{\bf B} ]$
	=	$\displaystyle \int_{\bf H} \log ( f({\bf r},{\bf H}\vert{\bf A}) ) f({\bf H }\vert{\bf r},{\bf B}) d {\bf H} ,$	(8)

where we desire the MAP estimate of $\bf A$ , and ${\bf B}$ represents the current estimate of $\bf A$ . In EM terminology, ${\bf r}$ is the observed data, $\bf H$ is the hidden data, and $({\bf r},{\bf H})$ is the complete data. The general EM algorithm for this problem is: (i) initialize ${\bf B}$ to an estimate of $\bf A$ ; (ii) the E-step (Expectation) (i.e. construct $Q({\bf A}\vert{\bf B})$ ); (iii) the M-step (Maximization) (i.e. maximize $Q({\bf A}\vert{\bf B})$ with respect to $\bf A$ ); and (iv) set ${\bf B}={\bf A}$ and repeat steps (ii) and (iii) until convergence.

$Q({\bf A}\vert{\bf B})$ may be written as

$\begin{displaymath}Q({\bf A}\vert{\bf B}) \doteq \int_{\bf H} \log ( f({\bf r}\v... ...bf A}) ) f({\bf r}\vert{\bf H},{\bf B}) f({\bf H}) d {\bf H} . \end{displaymath}$

(9)

Note that in the above steps, if $Q( {\bf A}\vert {\bf B}) \geq Q( {\bf B}\vert {\bf B})$ then according to Jensen's inequality, $f( \bf {r} \vert A ) \geq f( \bf {r} \vert B )$ [9]. However, this does not prove convergence of EM to a local maximum of the log likelihood function. For discrete parameter estimation, we will address this issue below.

Using (2) in (9) and dropping constants:

$\begin{displaymath}Q({\bf A}\vert{\bf B}) \doteq \int_{\bf H} \ - \sum_{k=1}^{N}... ...{\bf h}_k\vert^{2} \ f({\bf H}\vert{\bf r},{\bf B}) d{\bf H} , \end{displaymath}$

(10)

$\begin{displaymath}f({\bf H}\vert{\bf r},{\bf B}) \doteq \ f({\bf H}) \ \prod_{... ... {\vert r_k-{\bf b}_k^{T}{\bf h}_k\vert^{2}} {\sigma^{2}} } . \end{displaymath}$

(11)

If we assume that the elements of $\bf H$ are independent over time, then $f({\bf H})$ can be written as in (4), and $f({\bf H}\vert{\bf r},{\bf B})$ becomes

$\begin{displaymath}f({\bf H}\vert{\bf r},{\bf B}) \doteq \ \prod_{k=1}^{n} f({\b... ... {\vert r_k -{\bf b}_k^{T}{\bf h}_k\vert^{2}} {\sigma^{2}} } . \end{displaymath}$

(12)

Using (12) in (10) we obtain

$\begin{displaymath}Q({\bf A}\vert{\bf B}) \doteq - \sum_{k=1}^n V_k , \end{displaymath}$

(13)

V_k	$\textstyle \stackrel{\triangle}{=}$	$\displaystyle \vert r_k - {\bf a}_k^T {\bf g}_k\vert^2 + {\bf a}_k^{T} {\bf G}_k {\bf a}_k^*$	(14)
$\displaystyle {\bf g}_k$	$\textstyle \stackrel{\triangle}{=}$	$\displaystyle E[{\bf h}_k \vert {\bf r},{\bf B}]$	(15)
$\displaystyle {\bf G}_k$	$\textstyle \stackrel{\triangle}{=}$	$\displaystyle E[({\bf h}_k - {\bf g}_k) ({\bf h}_k - {\bf g}_k)^{*T} \vert {\bf r},{\bf B}] .$	(16)

V_k represents the Viterbi algorithm incremental cost. In [4] we provide a detailed derivation for both this independent Gaussian channel, and a more realistic Gauss-Markov channel model.

A Note on EM Convergence: For discrete or continuous parameter estimation, it was pointed out above that Jensen's inequality holds, proving that at each iteration of the EM algorithm the negative log likelihood function is not increased. For continuous parameter estimation, it is well known the the EM algorithm converges to a minimum or saddle point of the negative log likelihood function (for a proof, see [10]). This proof is based on the global convergence theorem [12], which is now stated.

Let ${\cal E}$ denote an algorithm on ${\cal A}$ which, given an initial estimate ${\bf A}_0$ generates a sequence $\{ {\bf A}_k ; \ k=1,2, .... \}$ as ${\bf A}_{k+1} \in {\cal E} ( {\bf A}_k )$ . Let $\Gamma \subset {\cal A}$ be the solution set. Under the conditions:

1.

all points ${\bf A}_k$ are contained in a compact set ${\cal S} \subset {\cal A}$ ;

2.

there is a continuous function Z on ${\cal A}$ such that

(a): if ${\bf A_k} \notin \; \Gamma$ , then $Z({\bf A}_{k+1} ) < Z({\bf A}_{k} )$ for all ${\bf A}_{k+1} \in {\cal E} ( {\bf A}_k )$
(b): if ${\bf A_k} \in \Gamma$ , then $Z({\bf A}_{k+1} ) \leq Z({\bf A}_{k} )$ for all ${\bf A}_{k+1} \in {\cal E} ( {\bf A}_k )$ :

3.

the mapping ${\cal E}$ is closed at points outside $\Gamma$ ,

the limit of any convergent subsequence of $\{ {\bf A}_k \}$ is a solution in $\Gamma$ .

Let $\{ {\bf A}_k \} \in {\cal A}$ denote a sequence in the discrete space ${\cal A}$ of possible sequence estimates. Let ${\cal E}$ represent the EM algorithm, and $Z ({\bf A}_k ) = - \log f( {\bf r} \vert {\bf A}_k )$ . For discrete parameter estimation, condition 1 does not hold since there is not a compact subset ${\cal S} \subset {\cal A}$ , since ${\cal A}$ is a discrete set. As illustrated below, condition 2(a) also does not hold. This is because ${\bf A}_k \in {\cal E} ( {\bf A}_k )$ is possible when ${\bf A_k} \notin \Gamma$ . Thus, the global convergence theorem is not generally applicable to the EM algorithm for discrete parameter estimation.

We now show by example that EM does not necessarily converge to a local discrete minimum of the negative log likelihood function. For the independent Gaussian channel model, let n=1 and M=3, which are small enough to completely analyze by hand to confirm the results. A single BPSK symbol a₁ at time 1 is to be estimated as either +1 or -1. The channel is initialized as ${\bf a}_1 = [ a_1 ~ a_0 ~ a_{-1} ]^T = [ a_1 ~$ -1 -1]^T. For the Gaussian model for the channel coefficients we assume mean ${\bf d}_1 = [ 1 ~ 1 ~ 1]^T$ and covariance ${\bf C}_1 = v {\bf I}$ with v = 0.05. The noise variance is $\sigma^2 = 0.3$ . The negative log-likelihood cost function of the received data r₁ given ${\bf a}_1$ is obtained from (6) and (7). Figure 1(d) illustrates the MAP cost function for r₁ = -1.8, which is in the range of r₁ where EM breaks down, as discussed below.

**Figure 1:** Estimates of a₁ vs. r₁: (a) Discrete MAP, (b) Discrete EM, (c) Continuous MAP and EM; (d) MAP cost vs. a₁.
$\begin{figure}\centerline{\epsfysize 3in \epsfxsize 3.78in \epsfbox{a1.eps}} \end{figure}$

For the EM algorithm, letting ${\bf b}_1$ represent an estimate of ${\bf a}_1$ , the conditional mean and covariance of the channel coefficients given the received data are

$\displaystyle {\bf g}_1$	=	$\displaystyle {\bf d}_1 + \frac { {\bf C}_1 {\bf b}_1 (r_1 - {\bf b}_1^T {\bf d}_1) } { \sigma_1^2 }$	(17)
$\displaystyle {\bf G}_1$	=	$\displaystyle {\bf C}_1 - \frac { {\bf C}_1 {\bf b}_1 {\bf b}_1^T {\bf C}_1 } { \sigma_1^2 } ,$	(18)

with $\sigma_1^2 = \sigma^2 + {\bf b}_1^T {\bf C}_1 {\bf b}_1$ . The EM cost function is

$\begin{displaymath}V_1 = - Q({\bf a}_1\vert{\bf b}_1) \doteq \vert r_1 - {\bf a}_1^T {\bf g}_1\vert^2 + {\bf a}_1^T {\bf G}_1 {\bf a}_1 . \end{displaymath}$

(19)

The EM algorithm was initialized using ${\bf b}_1 = [$ -1 -1 -1]^T. Figures 1(a)-(b) show the MAP and EM discrete estimates of a₁vs. received data r₁. Note the range of values $-2 \le r_1 \le -1.7$ over which the discrete EM estimate is ``wrong'' in Figure 1(b), i.e. it is not only not equal to the MAP solution, it is not even a local minimum or stationary point of the negative log-likelihood function as can be seen in Figure 1(d). Over the range $-2 \le r_1 \le -1.7$ , the EM algorithm does not iterate, it is converged after initialization. As an aside, Figure 1(c) shows the EM and MAP estimates (which are identical) for a₁ estimated as a continuous parameter.

This example illustrates that for discrete parameter estimation EM can get stuck at a solution $\bf A$ which is not a local discrete minimum of the negative log likelihood function. By this we mean that there may be an estimate ${\bf A}^{'}$ that differs from the EM solution $\bf A$ in only one element and that has a lower negative log likelihood cost. We have observed this phenomenon for EM algorithm solutions to several communications and tracking related discrete parameter estimation problems. Nonetheless, researchers have found EM to be useful for discrete parameter estimation. The point here is that care must be taken in employing EM for discrete parameter estimation since convergence to a minimum of the cost can not be guaranteed.

$next$ $up$ $previous$
Next: Numerical Results Up: Direct and EM-based MAP Channels Previous: MAP Sequence Estimation

Rick Perry
2001-04-06