2D PET Image Reconstruction Using Robust L1 Estimation of the Gaussian Mixture Model

An image or volume of interest in positron emission tomography (PET) is reconstructed from pairs of gamma rays emitted from a radioactive substance. Many image reconstruction methods are based on estimation of pixels or voxels on some predefined grid. Such an approach is usually associated with limited resolution of the reconstruction, high computational complexity due to slow convergence and noisy results. This paper explores reconstruction of PET images using the underlying assumption that the originals of interest can be modeled using Gaussian mixture models. A robust segmentation method based on statistical properties of the model is presented, with an iterative algorithm resembling the expectation-maximization algorithm. Use of parametric models for image description instead of pixels circumvent some of the mentioned limitations.


I. INTRODUCTION
Positron emission tomography (PET) scanners detect the annihilation photon pairs arising from the positron emissions. Image reconstruction implies generating a two-or three-dimensional image of a radiotracer's concentration to estimate physiologic parameters for volumes of interest in vivo. In three-dimensional scanners, we can consider the tube (parallelepiped) joining any two detector elements as a volume of response (VOR). In the absence of physical effects and noise, the total number of coincidence events detected will be proportional to the total amount of tracer contained in the volume of response. In the two-dimensional case, we consider only lines of response (LORs) joining two detector elements, lying within a specified imaging plane. The data are recorded as event histograms (sinograms or projected data) or as a list of recorded photon-pair events (list-mode data). For a general overview of standard PET image reconstruction methodology, see e.g. [1], [2] or [3].
Modern image reconstruction methods are mostly based on maximum-likelihood expectation-maximization (MLEM) iterations. Maximum likelihood is used as the optimization criterion, combined with the expectation-maximization algorithm for finding its solution. To overcome the computational complexity and slow convergence of the MLEM, the ordered subsets expectation-maximization (OSEM) algorithm has been recently introduced. Since MLEM or OSEM estimation of pixels or voxels is usually noisy [1], use of post-filtering methods is necessary. On the one hand, image reconstruction from its projections is a mature research field with well known methods and proven results. On the other hand, known limitations made a challenge for a different approach presented in this paper.
In image segmentation, a number of algorithms based on model-based techniques utilizing prior knowledge have been proposed to model uncertainty (see e.g. [4], [5]). The Gaussian mixture model (GMM) is a well-known and widely used model in a variety of segmentation and classification problems ( [6], [7]).
In this paper, we propose a new robust algorithm for efficient reconstruction of an object from a PET image. Our method utilizes a novel L 1 minimizing algorithm to estimate the parameters of a Gaussian mixture model within framework similar to the standard expectation-maximization (EM) algorithm. Instead of voxels, we estimate parameters of the object's model. We focus on the two-dimensional case for clarity, but an extension to three dimensions would follow in a similar fashion. This paper is organized into six sections. Section II gives an overview of traditional methodology in 2D PET imaging. In Section III an introduction to Gaussian mixture models is presented. Section IV describes the estimation of GMM parameters from PET data. Section V shows the implementation of iterative estimates in the EM-like algorithm. Finally, experimental results with discussion conclude the paper.

II. TWO-DIMENSIONAL PET IMAGING
In the two-dimensional case, the acquired data are collected along LORs through a two-dimensional object. Traditionally, data are organized into sets of projections, integrals along all LORs for a fixed direction φ. The collection of all projections for 0 ≤ φ < 2π forms a two dimensional function of the distance of the LOR from the origin, denoted by s, and angle φ. The This work has been fully supported by Croatian Science Foundation under the project IP-2014-09-2625. line-integral transform of f (x, y) → p(s, φ) is called the X-ray transform [8], which in 2D is the same as the Radon transform [9]. Since a fixed point traces a sinusoidal path in the projection space, the superposition of all sinusoids corresponding to each point of activity in a general object is called a sinogram. This is illustrated in Fig. 1.
A classical image reconstruction method is the filtered backprojection (FBP), which is based on the projection theorem of Fourier analysis. The algorithm reconstructs the image by calculating the inverse Fourier transform of the 2D Fourier transform of the backprojected image (see e.g. [2], [8], [10]). This technique does not include or depend on any assumptions about the data. Assumptions such as form, volume and dependence could be utilized to obtain more precise estimates.

III. GAUSSIAN MIXTURE MODELS
A Gaussian mixture model [11] is a weighted sum of K component Gaussian densities as given by the equation: where x is a d-dimensional observation, τ i (i = 1, ..., K) are the weights of each Gaussian component, and g(x|µ k , Σ k ) are the Gaussian component densities. We assume that a given observation x is a realization from exactly one of the K Gaussian mixture components, and each component density is a d-variate Gaussian function of the form: with |Σ k | denoting the determinant of Σ k . The set of probabilities {τ k } such that K k=1 τ k = 1 defines the probabilities that x belongs to the corresponding Gaussian component. The complete Gaussian mixture model is parameterized by the mean vectors {µ k }, covariance matrices {Σ k } and mixture weights {τ k } for all Gaussian component densities.
Traditionally, in image segmentation GMMs are used to model values at points of observation, e.g. activity concentration in voxels [12] or pixel values [13]. Those models do not take into account the spatial correlation between observations, which can be corrected by introducing Markov random fields [12], [13].
We propose applying the GMM directly to the spatial data, focusing only on the locations and not the values at those locations.
In this scenario, the K Gaussian components represent potential points of origin for activity. The points x that are realizations from these components are latent (unobserved). Our observations are lines (LORs) through these points at random angles ψ, however using convenient properties of Gaussian distributions we will still be able to accurately estimate the parameters.

IV. ESTIMATING GAUSSIAN PARAMETERS
For clarity, in this section we focus only on one Gaussian component. Assuming we know a set of N LORs whose points of origin come from a Gaussian distribution with parameters (µ, Σ), we can estimate those parameters.
A. Estimating µ using minimal distance The mean vector µ = [µ x µ y ] T can be estimated in multiple ways. One method is to find the point in space whose total squared distance from all LORs is minimal. Clearly, the solution depends on our definition of distance.
Each LOR in two dimensions is uniquely given by its slope k = tan ψ and an intercept l (or by any two equivalent parameters) and we can write the LOR equations as In general, we can define the squared distance between d-dimensional vectors v 1 and v 2 as for some d × d weight matrix W . In particular, when W = I the distance in (4) is the Euclidian distance, and for W = Σ −1 we get the Mahalanobis distance. In our case d = 2 and a planar line is given by one equation, but similar results would follow for higher dimensions. For a given µ, the point on the i-th line that is nearest to µ, i.e. the solution to Now the point µ that is nearest to all lines is the solution of where x m i is given in (5). This can be shown to be (see Appendix A) the solution to: where Note that the zeros and null-vectors added in expressions above appear for calculation purposes (see Appendix A) and do not change the original problem. It can also be shown that, for Σ known, the estimate obtained using the Mahalanobis distance is also the maximum likelihood estimate for µ.

B. Estimating Σ using 1D projections
In the two-dimensional setting, since Σ is a symmetric matrix, we need to estimate three parameters: Σ 11 , Σ 12 and Σ 22 . Observe a single line of response, and recall that one of the parameters determining the LOR is the slope tan ψ, where ψ is the angle between the line and the x-axis. That means that by rotating the xy-coordinate system by ψ − π 2 , the LOR would be parallel to the new y-axis. This is equivalent to the rotation of the Gaussian distribution by ϕ = π 2 − ψ. This is illustrated Fig. 2. Rotation of the coordinate system, y-axis parallel to the LOR.
in Fig. 2. Since a rotated Gaussian distribution is again Gaussian, in this new coordinate system the LOR is parallel to the y-axis, and the Gaussian distribution has parameters Given the original LOR parameters, the coordinates of the point at which it intersects the new x-axis are (−l sin ϕ, 0). Since marginal distributions of a Gaussian are again Gaussian, a 1D projection onto the new x-axis is a Gaussian random variable with expectation (Rµ) 1 = cos ϕµ x − sin ϕµ y , and variance (RΣR T ) 11 = cos 2 ϕΣ 11 − 2 cos ϕ sin ϕΣ 12 + sin 2 ϕΣ 22 , i.e. we obtain the one-dimensional mean and variance simply by omitting rows and columns from their multidimensional counterparts. Therefore, each LOR gives us a one-dimensional projection whose squared (Euclidian) distance from the mean, (cos ϕµ x − sin ϕµ y + l sin ϕ) 2 , can be used to estimate the variance in (8). This gives us a system of equations: A =      cos 2 ϕ 1 −2 sin ϕ 1 cos ϕ 1 sin 2 ϕ 1 cos 2 ϕ 2 −2 sin ϕ 2 cos ϕ 2 sin 2 ϕ 2 . . . . . . . . .
Since there are (dozens of) thousands of measurements, the problem in (9) is overdetermined. It does not have an exact solution, but we can find the best approximation, i.e. the solution to Note that the solution will depend on the type of norm used in (10). Following classical methodology, the ordinary least squares (OLS) method gives the solution that minimizes the L 2 norm. This solution can be found by solving the equivalent problem As we will show in Sec. VI, OLS performs poorly when we introduce more than one component, so we will need to use alternative methods. It can be shown that in some cases L 1 minimization is preferred to the more traditional L 2 minimization because it is less sensitive, i.e. more resistant to gross and systematic errors [15]. The main argument against L 1 minimisation would be computational complexity, which can be alleviated by using iterative methods. In this paper, the solution is obtained by using the L 1 minimization algorithm proposed in [16]. We use it to solve a modified problem where k is a constant corrective scale factor. In a sufficiently large sample, one would have enough data points in each direction φ to obtain a one-dimensional variance estimate. In the absence of that, we are able to obtain a robust estimator for the parameters of the covariance matrix following the median absolute deviation (MAD) estimator of deviation σ [17], i.e.
where Φ denotes the distribution function of the standard normal random variable.

V. ITERATIVE EM-LIKE ALGORITHM
The expectation-maximization (EM) algorithm, first explained in [18], is an iterative method used to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models. The expectation (E) step creates a function for the expectation of the log-likelihood evaluated using the current estimate of parameters. The maximization (M) step then computes parameters maximizing the expected log-likelihood from the E step. Conversely, these estimates are then used in the next E step. In traditional applications of the EM algorithm to GMMs, the E step assigns each data point its membership probabilities {τ k }, i.e. the probabilities that the point belongs to each of the mixture components. In the M step, parameters of each component are estimated using the points "belonging" to that component. As already mentioned, in the PET setting observations are lines, however we can replicate the iterative steps using estimates from Sec. IV. Alternatively, this could also be considered a Lloyd-like algorithm, where the difference from the conventional Lloyd's algorithm [19] is that we allow different distance functions of the form (4). The algorithm initializes parameters arbitrarily, and then alternates between the following steps: 1) Compute class membership probabilities. For each LOR, compute the squared distance from each component mean. We distinguish between a hard classification where we assign the LOR to its nearest component, and a soft classification where membership probability is inversely proportional to the squared distance. 2) Estimate component parameters. Either from a hard or soft classification, where each LOR participates with its proportional share, parameters of each component are estimated using methodology from Sec. IV. The L 1 minimization algorithm recursively reduces and increases dimensionality of the observed subspace and uses weighted median to efficiently find the global minimum, and has shown to overperform state-of-the-art competitive methods when there are relatively few parameters to be estimated from a very high number of equations. For details, see Appendix B and [16].
Initial steps of the iterative algorithm use Euclidian distance in (4). Since later iterations improve the estimates, the distance gradually transforms into the Mahalanobis distance, i.e.
where α increases from 0 to 1. Therefore, in later iterations we obtain an MLE-like estimate of µ.

VI. RESULTS AND REMARKS
To evaluate the methodology, we experimented in the two-dimensional setting with K = 1 and K = 2 components. First, for proof of concept we show that the method in Sec. IV provides good estimates with both L 1 and L 2 minimization, for several covariance matrices with varying corresponding correlation coefficients: Since the 2D covariance matrix is of the form it is determined by three parameters -σ x , σ y and ρ(= ρ xy ). This corresponds to the three-dimensional vector x ρσ x σ y σ 2 y ] T in (9). Instead of observing true and estimated Σ matrices we will calculate the error in estimation of s for each s that corresponds to matrices above: Note that the corresponding correlation coefficients can be calculated from these vectors, and they are ρ 1 = 0, ρ 2 ≈ −0.3, ρ 3 ≈ 0.9.
For each of these types we simulated a measurement from N = 1000 and N = 10000 points and calculated the average estimate for L 1 and L 2 estimates separately. We repeated the experiment 1000. Accuracy of an estimate can be assessed in many ways, for illustrative purposes we chose relative error i.e.
where · denotes the standard Euclidian norm in both expressions. Results are given in Table I. Given that the variance of the traditional standard deviation estimator (from points) equals σ 4 n−1 , estimations within 10% from N = 1000 LORs seem acceptable, which justifies the methodology described in Sec. IV. Significantly more accurate estimation from N = 10000 LORs further confirms this. It is also notable that the accuracy of the estimate increases as |ρ|| increases from 0 to 1.
For K = 2 components we repeated the experiment as described in Sec. V for various combinations of types of vector s. We used hard classification, where we assign each line to at most one component. We experimented with various constraints, from simply assigning LORs to more likely components to assignations only when the probability of belonging is above a certain threshold.
For synthetic measurements from a variety of original (real) GMMs the algorithm proved robust regardless of the values of initial parameters, with estimation using L 1 minimization and the scaling factor k correcting the bias. The L 2 minimization method proved inefficient, since wrongly assigned LORs would cause unstable estimations and "breaks" in the algorithm. An illustration of the results for K = 2, N = 4000 is shown in Fig. 3, along with the corresponding classical FBP method. At the end, we would like to draw the readers attention to the fact that the reconstructed image is given by its parametric model: mean vectors {µ k }, covariance matrices {Σ k } and mixture weights {τ k }. It is virtually of infinite resolution, since the Gaussian components can be evaluated at each spatial point. The model is sparse: it consists from only a few parameters needed for the successful object representation. Hence, future research will be oriented to compressed sensing approach ( [7], [20], [21]) for reducing the number of projections, in this case reduced radiotracer's concentration. Due to robustness of the proposed reconstruction method, post-filtering step is not needed.

APPENDIX A
We will find the solution to (6) using classical matrix calculus techniques. First note that the solution to (5) is To accommodate this, we will denote the expression in (6) by d 2 and expand it: For simplicity, denote which we will differentiate piecewise to find the minimum. We have: Note that in all calculations we use the fact that W and, by extension, W are symmetric. By plugging these equations into d dµ and equating that with 0 to obtain the minimum, we get:

From the previous equation we now have
from which it follows that It remains to verify that this stationary point µ is also a turning point. However, since a point whose sum of squared distances from all lines is minimal must exist from a geometrical perspective, the solution in (7) is indeed the (global) minimum.

APPENDIX B
The L 1 minimization algorithm recursively reduces and increases dimensionality of the observed subspace and uses weighted median to efficiently find the global minimum. Reduction of dimensionality is achieved by extracting of parameters and inserting them into remaining equations in (11). If [A i1 A i2 A i3 ] is the i-th row of matrix A, and b i the i-th element of vector k · b, the i-th equation of the system is We choose equation j 1 from the set i = 1, ..., N and extract one of its parameters, e.g. Σ 11 : We insert it into all other equations and get a new system with only two unknown parameters: for i = j 1 . Now, we choose some other equation j 2 , j 2 = j 1 and extract one of the remaining parameters, e.g. Σ 12 : We insert Σ 12 into all other equations and get the system of N − 2 equations with only one unknown parameter where Parameter Σ 22 is calculated by minimizing L 1 norm The value of parameter Σ 22 is given by the weighted median (MED): where j 3 is an ordinal number of the concomitant equation and ♦ is the replication operator. The weighted median can be obtained using the algorithm given in [16]. The value of parameter Σ 22 is an element of set {b (1) i /A (1) i }. Chosen equations {j 1 , j 2 , j 3 } define a local minimum in 1D, a vertex.
Return to a higher dimension is achieved by putting calculated parameter value Σ 22 in (13). To further descending in L 1 cost surface, we fix equations j 1 and j 3 , and try to find new j 4 using (14)- (16). If j 4 = j 2 , new vertex is defined by {j 1 , j 2 , j 3 } = {j 1 , j 3 , j 4 } and we repeat the same procedure. If j 4 = j 2 , we conclude that the vertex {j 1 , j 2 , j 3 } is a local minimum in observed 2D subspace, thus we return to 3D by fixing j 2 and j 3 and try to find new j 4 instead of j 1 . We repeat the previous procedure until we cannot find new equation. Since the L 1 cost surface is convex, the global minimum is reached. Parameters Σ 11 , Σ 12 and Σ 22 are given by (12), (13) and (16).