1 Introduction
In positron emission tomography (PET), image reconstruction implies generating an image of a radiotracer’s concentration to estimate physiologic parameters for objects (volumes) of interest. Pairs of photons arise from emissions of annihilated positrons, and crystals placed in the scanner detect these pairs. When two are activated at the same time, the device records an event. Each possible pair of crystals is connected by a line or volume of response (LOR, VOR), depending on whether the scan is 2D or 3D. Assuming there are no contaminating physical effects or noise, the total number of coincidence events detected and the total amount of tracer contained in the tube are proportional. In the 2D case, we observe events along lines of response (LORs) joining two detector elements, lying within a specified imaging plane. The data are recorded as event histograms (sinograms or projected data) or as a list of recorded photon-pair events (list-mode data). Classic PET image reconstruction methodology is explained in more detail in e.g. Tong et al. (2010), Alessio and Kinahan (2006) or Reader and Zaidi (2007).
Modern image reconstruction methods are mostly based on maximum-likelihood expectation-maximization (MLEM) iterations. Maximum likelihood is used as the optimization criterion, combined with the expectation-maximization algorithm for finding its solution. To overcome the computational complexity and slow convergence of the MLEM, the ordered subsets expectation-maximization (OSEM) algorithm has been introduced (Hudson and Larkin, 1994). Since MLEM or OSEM estimation of pixels or voxels is usually noisy (Tong et al., 2010), use of post-filtering methods is necessary. On the one hand, image reconstruction from its projections is a mature research field with well known methods and proven results. On the other hand, known limitations made a challenge for a different approach presented in this paper.
In image segmentation, a number of algorithms based on model-based techniques utilizing prior knowledge have been proposed to model uncertainty, cf. Zhang et al. (2001, 1994). The Gaussian mixture model (GMM) is well-known and widely used in a variety of segmentation and classification problems (Friedman and Russell, 1997; Ralašić et al., 2018), as many observed quantities exhibit behaviour congruent with the model. A good overview of the application of GMMs and their generalizations to problems in image classification, image annotation and image retrieval can be found, e.g. in Tian (2018). However, in those problems the sample is from the GMM itself, whereas in this case the observed data is lower-dimensional (i.e. lines) and the points of origin are unknown.
In this paper, we propose a new robust method for reconstructing an object from a simulated PET image. The challenge of estimating Gaussian parameters from lower-dimensional data is solved by utilizing a novel ${L_{1}}$ minimizing algorithm. With it, our framework becomes similar to the standard Expectation-Maximization (EM) algorithm. Instead of values in individual pixels or voxels, we estimate parameters of the object’s model. This enables us to model one or more objects of interest, each as a mixture of one or more components, assuming that the entire object is a source of radiation with varying intensity. We focus on the two-dimensional case for clarity, but an extension to three dimensions would follow in a similar fashion. To the best of our knowledge, this is the first attempt to use parametric models in PET image reconstruction in this way. Parameters of Gaussian mixture models have not been estimated from lower-dimensional data due to computational complexity, which is circumvented here by the algorithm that efficiently finds global minimums. This paper is organized into six sections. Section 2 gives an overview of traditional methodology in 2D PET imaging. Section 3 describes the estimation of Gaussian mixture model parameters from PET data. Section 4 shows the implementation of iterative estimates in the EM-like algorithm. Finally, experimental results in Section 5 and discussion conclude the paper.
2 Two-Dimensional PET Imaging
Data are collected along lines lying within a specific imaging plane. Traditionally, data are organized into sets of projections, integrals along all lines (LORs) for a fixed direction ϕ. The collection of all projections for $0\leqslant \phi <2\pi $ forms a two-dimensional function of the distance of the LOR from the origin, denoted by s, and angle ϕ. The line-integral transform of $f(x,y)\to p(s,\phi )$ is called the X-ray transform (Natterer, 1986), which in 2D is the same as the Radon transform (Helgason, 1999). The path for any fixed point in the projection space resembles a sinusoidal curve. This is traditionally represented by a sinogram, a graph where these paths are drawn for all points of activity simultaneously. Figure 1 illustrates one synthetic measurement and the corresponding sinogram. Classical reconstruction methods, most notably the filtered backprojection (FBP), rely on the integral transform, cf. Natterer (1986), O’Sullivan and Pawitan (1993), Alessio and Kinahan (2006). They do not include or depend on any assumptions about the data. Assumptions such as form, volume and dependence could be utilized to obtain more precise estimates.
3 Estimating the Gaussian Mixture Model
A Gaussian mixture model (Reynolds, 2015) is a weighted sum of K component Gaussian densities as given by the equation:
where $\boldsymbol{x}$ is a d-dimensional observation, ${\tau _{i}}$ $(i=1,\dots ,K)$ are the weights of each Gaussian component, and $f(\boldsymbol{x}|{\boldsymbol{\mu }_{k}},{\boldsymbol{\Sigma }_{k}})$ are the Gaussian component densities. We assume that a given observation $\boldsymbol{x}$ is a realization from exactly one of the K Gaussian mixture components, and each component density is a d-variate Gaussian function of the form:
The set of probabilities $\{{\tau _{k}}\}$ such that ${\textstyle\sum _{k=1}^{K}}{\tau _{k}}=1$ defines the probabilities that $\boldsymbol{x}$ belongs to the corresponding Gaussian component. The complete Gaussian mixture model is parameterized by the mean vectors $\{{\boldsymbol{\mu }_{k}}\}$, covariance matrices $\{{\boldsymbol{\Sigma }_{k}}\}$ and mixture weights $\{{\tau _{k}}\}$ for all Gaussian component densities.
(1)
\[ p(\boldsymbol{x}|{\tau _{k}},{\boldsymbol{\mu }_{k}},{\boldsymbol{\Sigma }_{k}})={\sum \limits_{k=1}^{K}}{\tau _{k}}\hspace{2.5pt}f(\boldsymbol{x}|{\boldsymbol{\mu }_{k}},{\boldsymbol{\Sigma }_{k}}),\](2)
\[ f(\boldsymbol{x}|{\boldsymbol{\mu }_{k}},{\boldsymbol{\Sigma }_{k}})=\frac{1}{\sqrt{{(2\pi )^{d}}|{\boldsymbol{\Sigma }_{k}}|}}\exp \bigg(-\frac{1}{2}{(\boldsymbol{x}-{\boldsymbol{\mu }_{k}})^{T}}{\boldsymbol{\Sigma }_{k}^{-1}}(\boldsymbol{x}-{\boldsymbol{\mu }_{k}})\bigg).\]Traditionally in this setting, the GMMs are used to model activity concentration in voxels (Layer et al., 2015) or pixel values (Nguyen and Wu, 2013). Those models do not take into account the spatial correlation between observations, which can be corrected by introducing Markov random fields (Layer et al., 2015; Nguyen and Wu, 2013). In other image segmentation problems, this can be addressed by modelling the mixture weights as probability vectors, thereby creating a spatially varying finite mixture model (SVFMM) (Xiong and Huang, 2015).
Here, we apply the GMM directly to the spatial data, focusing only on the locations and not the values at those locations. In other words, the location of the tracer, i.e. the particles that originate the gamma rays, is represented by the the K Gaussian components. The points $\boldsymbol{x}$ that are realizations from these components are latent (unobserved). Our observations, i.e. events, are lines through these points at random angles φ, however, using convenient properties of Gaussian distributions we will still be able to accurately estimate the parameters.
As mentioned earlier, Gaussian mixture models have naturally appeared in many signal processing (Yu and Sapiro, 2011; Léger et al., 2011) and biometrics problems (Reynolds, 2015; Reynolds et al., 2000). Due to the flexibility of the mixtures, GMMs are convenient and effective for modelling various types of signals, as well as image inverse and missing/noisy data problems.
Note that in our simulations we assume that K, the number of mixture components, is known. In practical applications it should also be estimated, which is a further (and not insignificant) problem. For mixture components that are sufficiently separated in space, such as in Fig. 1, the sinogram itself could give insight into the most adequate number of components, otherwise ideas suggested in, e.g. Leroux (1992), Chen (1995), Cheng (2005), should be explored.
For clarity, in this section we focus only on one Gaussian component. Assuming we know a set of N events whose points of origin come from a Gaussian distribution with parameters $(\boldsymbol{\mu },\boldsymbol{\Sigma })$, we can estimate those parameters.
3.1 Estimation of μ Using Minimal Distance
Each event in two dimensions is uniquely given by its slope $k=\tan \varphi $ and an intercept l (or by any two equivalent parameters) and we can write the event equations as
The mean vector $\boldsymbol{\mu }={[{\mu _{x}}\hspace{2.5pt}{\mu _{y}}]^{T}}$ can be estimated in multiple ways. One method is to find the point in space whose total squared distance from all events is minimal. Clearly, the solution depends on our definition of distance. Formally, one would prefer to find the point whose total distance (not squared) is minimal, but that presents a nonlinear problem which is significantly more difficult to solve, and we will show that the linear problem provides a satisfying solution.
(3)
\[ {\boldsymbol{a}_{i}^{T}}\boldsymbol{x}+{l_{i}}=0,\hspace{2em}{\boldsymbol{a}_{i}}={[\tan {\varphi _{i}}-1]^{T}},\hspace{1em}i=1,\dots ,N.\]In general, the squared distance between d-dimensional vectors is defined as
for some $d\times d$ weight matrix $\boldsymbol{W}$. In particular, when $\boldsymbol{W}=\boldsymbol{I}$ the distance in (4) is Euclidian, and $\boldsymbol{W}={\boldsymbol{\Sigma }^{-1}}$ gives the Mahalanobis distance. In our case, $d=2$ and a planar line is given by one equation, but similar results would follow for higher dimensions.
(4)
\[ {d^{2}}({\boldsymbol{v}_{1}},{\boldsymbol{v}_{2}})={({\boldsymbol{v}_{1}}-{\boldsymbol{v}_{2}})^{T}}\boldsymbol{W}({\boldsymbol{v}_{1}}-{\boldsymbol{v}_{2}}),\]First, note that for a fixed $\boldsymbol{\mu }$ the solution to
Now the optimal $\boldsymbol{\mu }$ is the one that minimizes the expression
where ${\boldsymbol{x}_{i}^{\mu }}$ is determined by (5). To solve this, first note that the solution to (5) is
Note that in all calculations we use the fact that $\boldsymbol{W}$ and, by extension, $\widetilde{\boldsymbol{W}}$ are symmetric. By plugging these equations into $\frac{d}{d\boldsymbol{\mu }}$ and equating that with 0 to obtain the minimum, we get:
It remains to verify that this stationary point $\boldsymbol{\mu }$ is also a turning point. However, since a point whose sum of squared distances from all lines is minimal must exist from a geometrical perspective, the solution in (7) is indeed the (global) minimum.
\[ \min {d^{2}}(\boldsymbol{x},\boldsymbol{\mu })\hspace{1em}\text{such that}\hspace{2.5pt}{\boldsymbol{a}_{i}}\boldsymbol{x}+{l_{i}}=0,\]
is (Golub and Van Loan, 2012) the solution to
(5)
\[ \left[\begin{array}{c@{\hskip4.0pt}c}2\boldsymbol{W}& {\boldsymbol{a}_{i}}\\ {} {\boldsymbol{a}_{i}^{T}}& 0\end{array}\right]\cdot \left[\begin{array}{c}{\boldsymbol{x}_{i}^{\mu }}\\ {} \lambda \end{array}\right]=\left[\begin{array}{c}2\boldsymbol{W}\boldsymbol{\mu }\\ {} -{l_{i}}\end{array}\right].\](6)
\[ \min {\sum \limits_{i=1}^{N}}{\big({\boldsymbol{x}_{i}^{\mu }}-\boldsymbol{\mu }\big)^{T}}\boldsymbol{W}\big({\boldsymbol{x}_{i}^{\mu }}-\boldsymbol{\mu }\big),\]
\[ \left[\begin{array}{c}{\boldsymbol{x}_{i}^{\mu }}\\ {} \lambda \end{array}\right]={\left[\begin{array}{c@{\hskip4.0pt}c}2\boldsymbol{W}& {\boldsymbol{a}_{i}}\\ {} {\boldsymbol{a}_{i}^{T}}& 0\end{array}\right]^{-1}}\cdot \left[\begin{array}{c}2\boldsymbol{W}\boldsymbol{\mu }\\ {} -{l_{i}}\end{array}\right].\]
To accommodate this, we will denote the expression in (6) by ${d^{2}}$ and expand it:
\[\begin{aligned}{}{d^{2}}& ={\sum \limits_{i=1}^{N}}{\big({\boldsymbol{x}_{i}^{\mu }}-\boldsymbol{\mu }\big)^{T}}\boldsymbol{W}\big({\boldsymbol{x}_{i}^{\mu }}-\boldsymbol{\mu }\big)\\ {} & ={\sum \limits_{i=1}^{N}}{\left(\left[\begin{array}{c}{\boldsymbol{x}_{i}^{\mu }}\\ {} \lambda \end{array}\right]-\left[\begin{array}{c}\boldsymbol{\mu }\\ {} \lambda \end{array}\right]\right)^{T}}\cdot \left[\begin{array}{c@{\hskip4.0pt}c}\boldsymbol{W}& \mathbf{0}\\ {} \mathbf{0}& 0\end{array}\right]\cdot \left(\left[\begin{array}{c}{\boldsymbol{x}_{i}^{\mu }}\\ {} \lambda \end{array}\right]-\left[\begin{array}{c}\boldsymbol{\mu }\\ {} \lambda \end{array}\right]\right).\end{aligned}\]
For simplicity, denote
\[ {\boldsymbol{x}_{i\lambda }}=\left[\begin{array}{c}{\boldsymbol{x}_{i}^{\mu }}\\ {} \lambda \end{array}\right],\hspace{2em}{\boldsymbol{\mu }_{\lambda }}=\left[\begin{array}{c}\boldsymbol{\mu }\\ {} \lambda \end{array}\right]\hspace{1em}\text{and}\hspace{1em}\widetilde{\boldsymbol{W}}=\left[\begin{array}{c@{\hskip4.0pt}c}\boldsymbol{W}& \mathbf{0}\\ {} \mathbf{0}& 0\end{array}\right].\]
Now ${d^{2}}$ equals
\[ {\sum \limits_{i=1}^{N}}\big({\boldsymbol{x}_{i\lambda }^{T}}\widetilde{\boldsymbol{W}}{\boldsymbol{x}_{i\lambda }}-{\boldsymbol{x}_{i\lambda }^{T}}\widetilde{\boldsymbol{W}}{\boldsymbol{\mu }_{\lambda }}-{\boldsymbol{\mu }_{\lambda }^{T}}\widetilde{\boldsymbol{W}}{\boldsymbol{x}_{i\lambda }}+{\boldsymbol{\mu }_{\lambda }^{T}}\widetilde{\boldsymbol{W}}{\boldsymbol{\mu }_{\lambda }}\big),\]
which we will differentiate piecewise to find the minimum. We have:
-
1)\[ \frac{d}{d\boldsymbol{\mu }}{\boldsymbol{x}_{i\lambda }^{T}}\widetilde{\boldsymbol{W}}{\boldsymbol{x}_{i\lambda }}=2{\left[\begin{array}{c}2\boldsymbol{W}\boldsymbol{\mu }\\ {} -{l_{i}}\end{array}\right]^{T}}{\boldsymbol{B}_{i}}{\left[\begin{array}{c@{\hskip4.0pt}c}2W& {\boldsymbol{a}_{i}}\\ {} {\boldsymbol{a}_{i}^{T}}& 0\end{array}\right]^{-1}}\left[\begin{array}{c}2\boldsymbol{W}\\ {} \mathbf{0}\end{array}\right],\]where
-
2)\[ \frac{d}{d\boldsymbol{\mu }}\hspace{2.5pt}{\boldsymbol{\mu }_{\lambda }^{T}}\widetilde{\boldsymbol{W}}{\boldsymbol{x}_{i\lambda }}={\boldsymbol{\mu }_{\lambda }^{T}}{\boldsymbol{B}_{i}^{T}}\left[\begin{array}{c}2\boldsymbol{W}\\ {} \mathbf{0}\end{array}\right]+{\left[\begin{array}{c}2\boldsymbol{W}\boldsymbol{\mu }\\ {} -{l_{i}}\end{array}\right]^{T}}{\boldsymbol{B}_{i}}\left[\begin{array}{c}\boldsymbol{I}\\ {} \mathbf{0}\end{array}\right].\]
-
3)\[ \frac{d}{d\boldsymbol{\mu }}{\boldsymbol{\mu }_{\lambda }^{T}}\widetilde{\boldsymbol{W}}{\boldsymbol{x}_{i\lambda }}={\boldsymbol{\mu }_{\lambda }^{T}}{\boldsymbol{B}_{i}^{T}}\left[\begin{array}{c}2\boldsymbol{W}\\ {} \mathbf{0}\end{array}\right]+{\left[\begin{array}{c}2\boldsymbol{W}\boldsymbol{\mu }\\ {} -{l_{i}}\end{array}\right]^{T}}{\boldsymbol{B}_{i}}\left[\begin{array}{c}\boldsymbol{I}\\ {} \mathbf{0}\end{array}\right].\]
\[\begin{aligned}{}\frac{d}{d\boldsymbol{\mu }}{d^{2}}& ={\sum \limits_{i=1}^{N}}\left({\left[\begin{array}{c}2\boldsymbol{W}\boldsymbol{\mu }\\ {} -{l_{i}}\end{array}\right]^{T}}\cdot {\boldsymbol{B}_{i}}\left({\left[\begin{array}{c@{\hskip4.0pt}c}2W& {\boldsymbol{a}_{i}}\\ {} {\boldsymbol{a}_{i}^{T}}& 0\end{array}\right]^{-1}}\left[\begin{array}{c}2\boldsymbol{W}\\ {} \mathbf{0}\end{array}\right]-\left[\begin{array}{c}\boldsymbol{I}\\ {} \mathbf{0}\end{array}\right]\right)-\right.\\ {} & \hspace{1em}-\left.{\boldsymbol{\mu }_{\lambda }^{T}}\left({\boldsymbol{B}_{i}^{T}}\left[\begin{array}{c}2\boldsymbol{W}\\ {} \mathbf{0}\end{array}\right]-\widetilde{\boldsymbol{W}}\left[\begin{array}{c}\boldsymbol{I}\\ {} \mathbf{0}\end{array}\right]\right)\right)=0.\end{aligned}\]
Define
\[\begin{aligned}{}& \left[\begin{array}{c}{\boldsymbol{M}_{i}}\\ {} {\boldsymbol{m}_{i}}\end{array}\right]={\boldsymbol{B}_{i}}\left(\left[\begin{array}{c@{\hskip4.0pt}c}2W& {\boldsymbol{a}_{i}}\\ {} {\boldsymbol{a}_{i}^{T}}& 0\end{array}\right]\left[\begin{array}{c}2\boldsymbol{W}\\ {} \mathbf{0}\end{array}\right]-\left[\begin{array}{c@{\hskip4.0pt}c}1& 0\\ {} 0& 1\\ {} 0& 0\end{array}\right]\right),\\ {} & \left[\begin{array}{c}{\boldsymbol{N}_{i}}\\ {} \mathbf{0}\end{array}\right]={\boldsymbol{B}_{i}^{T}}\left[\begin{array}{c}2\boldsymbol{W}\\ {} \mathbf{0}\end{array}\right]-\widetilde{\boldsymbol{W}}\left[\begin{array}{c@{\hskip4.0pt}c}1& 0\\ {} 0& 1\\ {} 0& 0\end{array}\right].\end{aligned}\]
From the previous equation we now have
\[\begin{aligned}{}& {\sum \limits_{i=1}^{N}}\left(\big[2{\boldsymbol{\mu }^{T}}{\boldsymbol{W}^{T}},-{l_{i}}\big]\cdot \left[\begin{array}{c}{\boldsymbol{M}_{i}}\\ {} {\boldsymbol{m}_{i}}\end{array}\right]-\big[{\boldsymbol{\mu }^{T}},{\lambda _{i}}\big]\cdot \left[\begin{array}{c}{\boldsymbol{N}_{i}}\\ {} \mathbf{0}\end{array}\right]\right)=0,\\ {} & {\sum \limits_{i=1}^{N}}\big(2{\boldsymbol{\mu }^{T}}{\boldsymbol{W}^{T}}{\boldsymbol{M}_{i}}-{l_{i}}{\boldsymbol{m}_{i}}-{\boldsymbol{\mu }^{T}}{\boldsymbol{N}_{i}}\big)=0,\\ {} & {\boldsymbol{\mu }^{T}}{\sum \limits_{i=1}^{N}}\big(2{\boldsymbol{W}^{T}}{\boldsymbol{M}_{i}}-{\boldsymbol{N}_{i}}\big)={\sum \limits_{i=1}^{N}}{l_{i}}{\boldsymbol{m}_{i}},\end{aligned}\]
from which it follows that
(7)
\[ {\boldsymbol{\mu }^{T}}=\Bigg({\sum \limits_{i=1}^{N}}{l_{i}}{\boldsymbol{m}_{i}}\Bigg)\cdot {\Bigg({\sum \limits_{i=1}^{N}}\big(2{\boldsymbol{W}^{T}}{\boldsymbol{M}_{i}}-{\boldsymbol{N}_{i}}\big)\Bigg)^{-1}}.\]3.2 Estimation of Σ Using 1D Projections
In the two-dimensional setting, since Σ is a symmetric matrix,
\[ \boldsymbol{\Sigma }=\left[\begin{array}{c@{\hskip4.0pt}c}{\Sigma _{11}}& {\Sigma _{12}}\\ {} {\Sigma _{12}}& {\Sigma _{22}}\end{array}\right],\]
we need to estimate three parameters: ${\Sigma _{11}}$, ${\Sigma _{12}}$ and ${\Sigma _{22}}$.Figure 2 depicts a single event and its corresponding LOR. Recall that the line is determined by two parameters, one of which is the slope $\tan \varphi $, where φ is the angle between the line and the x-axis. If we rotate the coordinate system by $\varphi -\frac{\pi }{2}$, in the new coordinate system the y-axis is parallel to the event. If we represent the Gaussian distribution as in the image, in this new setup it rotated by $\psi =\frac{\pi }{2}-\varphi $. Since a rotated Gaussian distribution is again Gaussian, in this new coordinate system the event is parallel to the y-axis, and the Gaussian distribution has parameters
Given the original event parameters, the coordinates of the intersection of the event and the new x-axis are $(-l\sin \psi ,0)$. The projection of the Gaussian distribution onto the new x-axis remains Gaussian, and the parameters of this one-dimensional distribution are components of the original two-dimensional Gaussian. From (8) we see that the expectation is given by
and variance is
We can use these one-dimensional projections and their squared (Euclidian) distance from the mean, to estimate the variance in (9). This squared distance is equal to ${(\cos \psi {\mu _{x}}-\sin \psi {\mu _{y}}+l\sin \psi )^{2}}$, which gives us a system of equations:
This problem is clearly overdetermined, since there are (dozens of) thousands of measurements. There is no exact solution and the best approximation is found by solving
Note that the solution will depend on the type of norm used in (11). Following classical methodology, the ordinary least squares (OLS) method gives the solution that minimizes the ${L_{2}}$ norm. This solution can be found by solving the equivalent problem
(10)
\[\begin{aligned}{}& \boldsymbol{A}\boldsymbol{s}=\boldsymbol{b},\hspace{1em}\text{where}\hspace{2.5pt}\boldsymbol{s}=\left[\begin{array}{c}{\Sigma _{11}}\\ {} {\Sigma _{12}}\\ {} {\Sigma _{22}}\end{array}\right],\\ {} & A=\left[\begin{array}{c@{\hskip4.0pt}c@{\hskip4.0pt}c}{\cos ^{2}}{\psi _{1}}& -2\sin {\psi _{1}}\cos {\psi _{1}}& {\sin ^{2}}{\psi _{1}}\\ {} {\cos ^{2}}{\psi _{2}}& -2\sin {\psi _{2}}\cos {\psi _{2}}& {\sin ^{2}}{\psi _{2}}\\ {} \vdots & \vdots & \vdots \\ {} {\cos ^{2}}{\psi _{N}}& -2\sin {\psi _{1}}\cos {\psi _{N}}& {\sin ^{2}}{\psi _{N}}\end{array}\right],\\ {} & \text{and}\hspace{2.5pt}\boldsymbol{b}=\left[\begin{array}{c}{(\cos {\psi _{1}}{\mu _{x}}-\sin {\psi _{1}}{\mu _{y}}+{l_{1}}\sin {\psi _{1}})^{2}}\\ {} \vdots \\ {} {(\cos {\psi _{N}}{\mu _{x}}-\sin {\psi _{N}}{\mu _{y}}+{l_{N}}\sin {\psi _{N}})^{2}}\end{array}\right].\end{aligned}\]
\[ {\boldsymbol{A}^{\boldsymbol{T}}}\boldsymbol{A}\boldsymbol{s}={\boldsymbol{A}^{\boldsymbol{T}}}\cdot \boldsymbol{b},\hspace{1em}\text{i.e.}\hspace{2.5pt}\boldsymbol{s}={\big({\boldsymbol{A}^{\boldsymbol{T}}}\boldsymbol{A}\big)^{-1}}{\boldsymbol{A}^{\boldsymbol{T}}}\cdot \boldsymbol{b}.\]
As we will show in Section 5, OLS performs poorly when we introduce more than one component, so we will need to use alternative methods. ${L_{1}}$ minimization is a popular approach in engineering problems because it gives sparse solutions, but it had also been shown that it is less sensitive than the more traditional ${L_{2}}$ minimization (Bektaş and Şişman, 2010). The main argument against ${L_{1}}$ minimization would be computational complexity, which can be alleviated by using iterative methods. In this paper, the ${L_{1}}$ minimization algorithm proposed in Sović Kržić and Seršić (2018) is used to solve a modified problem
where k is a constant corrective scale factor. In a sufficiently large sample, one would have enough data points in each direction ϕ to obtain a one-dimensional variance estimate. In the absence of that, we are able to obtain a robust estimator for the parameters of the covariance matrix following the median absolute deviation (MAD) estimator of deviation σ (Rousseeuw and Croux, 1993), i.e.
where Φ denotes the distribution function of the standard normal random variable.4 EM-Like Algorithm
In many statistical models, maximum likelihood parameter estimation can not be performed directly since most equations do not have an explicit closed solution. Several iterative methods have been developed to combat this, most notably the expectation-maximization (EM) algorithm. It had been proposed and used in many different circumstances before its proper introduction in Dempster et al. (1977). The name comes from its two-step setup, namely the expectation (E) and maximization (M) step which interchange until an acceptable solution is found. In the context of GMMs, and generally in emission tomography (Shepp and Vardi, 1982), the issue is that the true mixture membership is unknown (latent). The E step for given mixture parameters estimates probabilities $\{{\tau _{k}}\}$ from (2) for each data point. The M step then (re)estimates the mixture parameters from the points assigned to that mixture with their corresponding probabilities. As already mentioned, our observations are lines and the true data points are latent, however, we can replicate the iterative steps using estimates from Section 3. Alternatively, this could also be considered a Lloyd-like algorithm, where the difference from the conventional Lloyd’s algorithm (Lloyd, 1982) is that we allow different distance functions of the form (4). The algorithm initializes parameters arbitrarily, and then alternates between the following steps:
The ${L_{1}}$ minimization algorithm recursively reduces and increases dimensionality of the observed subspace and uses weighted median to efficiently find the global minimum, and has shown to overperform state-of-the-art competitive methods when there are relatively few parameters to be estimated from a very high number of equations. For details, see Appendix and Sović Kržić and Seršić (2018).
-
1. Compute class membership probabilities. For each event, compute the squared distance from each component mean. We distinguish between a hard classification where we assign the event to its nearest component, and a soft classification where membership probability is inversely proportional to the squared distance.
-
2. Estimate component parameters. Either from a hard or soft classification, where each event participates with its proportional share, parameters of each component are estimated using methodology from Section 3.
This approach allows for different variants, depending on the type of distance used (Tafro and Seršić, 2019).
In this paper, initial steps of the iterative algorithm use Euclidian distance in (4). Since later iterations improve the estimates, the distance gradually transforms into the Mahalanobis distance, i.e.
where α increases from 0 to 1. It is known that, for Σ given, the estimate obtained using the Mahalanobis distance is also the maximum likelihood estimate for $\boldsymbol{\mu }$. Therefore, in later iterations we obtain an MLE-like estimate of $\boldsymbol{\mu }$.
5 Results
To evaluate the methodology, we experimented in the two-dimensional setting with $K=1$ and $K=2$ components.
First, for proof of concept we show that the method in Section 3 provides good estimates with both ${L_{1}}$ and ${L_{2}}$ minimization, for several covariance matrices with varying corresponding correlation coefficients:
\[ {\boldsymbol{\Sigma }_{1}}=0.05\boldsymbol{I},\hspace{2em}{\boldsymbol{\Sigma }_{2}}=\left[\begin{array}{c@{\hskip4.0pt}c}0.02& -0.01\\ {} -0.01& 0.05\end{array}\right],\hspace{2em}{\boldsymbol{\Sigma }_{3}}=\left[\begin{array}{c@{\hskip4.0pt}c}0.01& 0.02\\ {} 0.02& 0.05\end{array}\right].\]
These matrices represent different shapes, i.e. different types of dependence, from independent (${\boldsymbol{\Sigma }_{1}}$) to strongly dependent (${\boldsymbol{\Sigma }_{3}}$) variables. Since the 2D covariance matrix is of the form
\[ \boldsymbol{\Sigma }=\left[\begin{array}{c@{\hskip4.0pt}c}{\sigma _{x}^{2}}& \rho {\sigma _{x}}{\sigma _{y}}\\ {} \rho {\sigma _{x}}{\sigma _{y}}& {\sigma _{y}^{2}}\end{array}\right],\]
it is determined by three parameters – ${\sigma _{x}}$, ${\sigma _{y}}$ and $\rho (={\rho _{xy}})$. This corresponds to the three-dimensional vector
\[ \boldsymbol{s}={[{\Sigma _{11}}\hspace{2.5pt}{\Sigma _{12}}\hspace{2.5pt}{\Sigma _{22}}]^{T}}={\big[{\sigma _{x}^{2}}\hspace{2.5pt}\rho {\sigma _{x}}{\sigma _{y}}\hspace{2.5pt}{\sigma _{y}^{2}}\big]^{T}}\]
in (10). Instead of observing true and estimated Σ matrices we will calculate the error in estimation of $\boldsymbol{s}$ for each $\boldsymbol{s}$ that corresponds to matrices above:
\[ {\boldsymbol{s}_{1}}=\left[\begin{array}{c}0.05\\ {} 0\\ {} 0.05\end{array}\right],\hspace{2em}{\boldsymbol{s}_{2}}=\left[\begin{array}{c}0.02\\ {} -0.01\\ {} 0.05\end{array}\right],\hspace{2em}{\boldsymbol{s}_{3}}=\left[\begin{array}{c}0.01\\ {} 0.02\\ {} 0.05\end{array}\right].\]
Note that the corresponding correlation coefficients can be calculated from these vectors, and they are ${\rho _{1}}=0$, ${\rho _{2}}\approx -0.3$, ${\rho _{3}}\approx 0.9$.Fig. 3
Left: Relative error in estimation of $\boldsymbol{\mu }$ from a single measurement ($N=1000$ events). Right: Average relative error for increasing sample size, for three types of covariance.
5.1 One-Dimensional Estimation
Accuracy of an estimate $\hat{\boldsymbol{v}}$ of vector $\boldsymbol{v}$ can be assessed in many ways, for illustrative purposes we chose relative error, i.e.
where $\| \cdot \| $ denotes the standard Euclidian norm in both expressions.
Initial value of α in the distance weight matrix was set to zero, and proportionally increased to 1 in the final iteration. The left side of Fig. 3 illustrates how the errors in the estimation of $\boldsymbol{\mu }$ change with the number of iterations in a single experiment with $N=1000$ points, for all three types of covariance matrices. Depending on the type of covariance matrix, the accuracy increases as $|\rho |$ increases, but in all three scenarios the estimation is stable within less than 10 iterations, and sufficiently accurate. In further experiments, the number of iterations in the algorithm was set to 10, with the distance metric varying according to (13). The right side of Fig. 3 shows how the accuracy in estimation of $\boldsymbol{\mu }$ increases with the sample size, with the number of experiments set to 100 for each sample size. The relative errors decreases as the number of points increases, as is expected, but we can also notice that the error is fairly similar even for sample sizes that are very small (compared to the usual sample sizes in PET imagery).
For each of these types of covariances we then simulated a measurement from $N=1000$ and $N=10000$ points and calculated the average estimate of vector $\boldsymbol{s}$ using ${L_{1}}$ and ${L_{2}}$ minimization separately. We repeated the experiment 1000 times in each case. Results are given in Table 1.
Table 1
Mean estimation error, $K=1$.
$N=1000$ | ${\boldsymbol{s}_{1}}$ | ${\boldsymbol{s}_{2}}$ | ${\boldsymbol{s}_{3}}$ |
${L_{1}}$ method | 13.78% | 11.88% | 9.28% |
${L_{2}}$ method | 8.27% | 7.61% | 7.6% |
$N=10000$ | ${\boldsymbol{s}_{1}}$ | ${\boldsymbol{s}_{2}}$ | ${\boldsymbol{s}_{3}}$ |
${L_{1}}$ method | 4.28% | 3.66% | 2.93% |
${L_{2}}$ method | 2.61% | 2.38% | 2.37% |
Given that the variance of the traditional standard deviation estimator (from points) equals $\frac{{\sigma ^{4}}}{n-1}$, estimations within $10\% $ from $N=1000$ events are acceptable, which justifies the methodology described in Section 3. Significantly more accurate estimation from $N=10000$ events further confirms this. It is also notable that in general the accuracy of the estimate increases as $|\rho |$ increases from 0 to 1, i.e. when the set of points of origin is more elongated in shape.
5.2 Two-Dimensional Estimation
For $K=2$ components we repeated the experiment as described in Section 4 for various combinations of types of vector $\boldsymbol{s}$. We used hard classification, where we assign each line to at most one component. We experimented with various constraints, from simply assigning events to more likely components to assignations only when the probability of belonging is above a certain threshold.
For synthetic measurements from a variety of original (real) GMMs the algorithm proved robust regardless of the values of initial parameters, with estimation using ${L_{1}}$ minimization and the scaling factor k correcting the bias. The ${L_{2}}$ minimization method proved inefficient, since wrongly assigned events would cause unstable estimations and “breaks” in the algorithm.
Table 2 shows the percentage of correctly classified events for several combinations of covariance matrices, where a total of $N=4000$ events were simulated from two mixtures (${N_{1}}=2500$, ${N_{2}}=1500$) for each combination. Events were classified with no probability thresholds, i.e. each event was assigned to the more likely component. We can conclude overall that the algorithm has a very good classification rate which, combined with the results for individual components, gives reason to expect accurate estimations for multiple components.
Table 2
Correct classification, $K=2$.
Cov. 1 | Cov. 2 | Comp. 1 | Comp. 2 | Total |
${\Sigma _{1}}$ | ${\Sigma _{2}}$ | 87.63% | 93.49% | 89.83% |
${\Sigma _{2}}$ | ${\Sigma _{3}}$ | 93.52% | 92.13% | 93.00% |
${\Sigma _{3}}$ | ${\Sigma _{1}}$ | 91.26% | 95.83% | 92.98% |
An illustration of the results for $K=2$, $N=4000$ is shown in Fig. 4, along with the corresponding classical FBP method.
6 Conclusion
These results show proof of concept that it is possible to reconstruct data from latent mixture models using lower-dimensional observations and state-of-the-art computational techniques. Estimation from lower-dimensional measurements had not been thoroughly developed for Gaussian mixture models in general, and especially in this context. This paper shows that the main obstacle, slow computational speed, can be evaded by an alternative approach to minimization, which opens the door to various new procedures. Further work includes other metrics and membership calculation and utilizing traditional methods to obtain the number of mixing components and optimal initial values.
The fact that the reconstructed image is given by its parametric model (mean vectors $\{{\boldsymbol{\mu }_{k}}\}$, covariance matrices $\{{\boldsymbol{\Sigma }_{k}}\}$ and mixture weights $\{{\tau _{k}}\}$) has many benefits. This approach would allow for accurate reconstruction from significantly fewer measurements, thus decreasing patients’ exposure to radiation. It is virtually of infinite resolution, since the Gaussian components can be evaluated at each spatial point. The model is sparse: it consists from only a few parameters needed for successful object representation. Hence, future research will be oriented to compressed sensing approach (Rani et al., 2018; Ralašić et al., 2018, 2019) for reducing the number of projections, in this case reduced radiotracer’s concentration. Due to robustness of the proposed reconstruction method, a post-filtering step is not needed. This novel point of view aims to match the accuracy of iterative parametric models with the computational speed of the analytic approach by combining advantageous statistical models with state-of-the-art techniques from other image reconstruction problems.