3.1 The Autoregressive Model-Based Inverse Filtering Model
The glottal flow can be obtained from the voiced speech segments. Considering the
source-filter approach, the speech signal
$s(t)$ can be expressed as the convolution of the glottal flow
$g(t)$ and the vocal tract filter
$h(t)$
Here the lip radiation effect (modelled as a first-order differentiating filter) is included in the vocal tract processing and is not considered separately. Traditionally, the vocal tract is modelled using an all-pole filter for speech analysis purposes.
If we obtain an estimate of the inverse vocal tract filter
${\hat{h}^{-1}}(t)$ and apply it to the analysed speech signal
$s(t)$, we will eliminate the effect of the vocal tract thus obtaining the estimate of the glottal flow
In this study, we applied AR model for the modelling of the vocal tract. The choice was due to the following reasons:
-
• The AR model (also known as Linear Predictive Coding model) is an all-pole filter and had great success in speech applications. The adequacy of the AR model parameter estimation technique (Kaukėnas,
1983) to the speech signal was shown in Kaukėnas and Tamulevičius (
2016) and Tamulevičius and Kaukėnas (
2016).
-
• The linearity of filter enables us to obtain an inverse version of the filter very easy.
-
• The chosen parameter estimation technique enables us to obtain a variable model order which is adequate to individual characteristics of human vocal properties. Therefore, we can expect a more accurate estimation of the glottal flow.
The AR model parameter estimation technique is presented in the next subsection.
3.2 Estimation of the AR Model
Let us explore the speech signal as the process
$\{{S_{t}}\}$ with zero mean and describe it using the AR model
where
N is the length of the signal
${S_{t}}$,
$\{{V_{t}},\hspace{2.5pt}t=1,2,\dots \}$ is the process of mutually independent and normally distributed random variables.
Our task is to estimate the model order M, the parameters $\{{a_{1}},{a_{2}},\dots ,{a_{M}}\}$ and b of the AR model.
From (
3) we can obtain
If we denote
we get the following expression of the AR model
The equation is solved using the recurrent evaluation approach (Kaukėnas,
1983). The Efroymson matrix is composed
where
R is the cross-correlation matrix of
${X_{i}}$ and
${X_{j}}$,
$i,j=1,\dots ,M$;
T is the cross-correlation vector of
Y and
${X_{i}}$,
$i=1,\dots ,M$;
O denotes zero vectors and matrices, and
I is a unit matrix.
Each new sequence
${X_{i}}$ is included during the recurrent modification of the Efroymson matrix
$E(i,j)$ denotes the Efroymson matrix before including
${X_{i}}$,
$E{(i,j)^{\prime }}$ is an updated version of the Efroymson matrix with included
${X_{i}}$.
Finally, the model parameter
${a_{i}}$ is estimated
The estimate of
${b^{2}}$ is obtained as follows
The model order estimation task is solved by comparing
${M_{1}}$ and
${M_{2}}$ order models. Usually, the estimated variances of prediction error
${\hat{b}_{M1}^{2}}$ and
${\hat{b}_{M2}^{2}}$ are compared. The following estimator for the model order was formulated in Kaukėnas (
1983)
where
${F_{cr}}(1,N-i)$ is the quantile of Fisher distribution with 1 and
$(N-i)$ degrees of freedom;
${\hat{b}_{0}^{2}}$ is the estimate of variance, i.e.
${\hat{b}_{0}^{2}}=\hat{D}$,
$\hat{D}$ is the estimated variance of the process.
${M_{\max }}$ is the maximum model order value, it is based on empirical knowledge of the signal.
In this study, we have chosen ${M_{\max }}=20$ for the vocal tract model. The filter with order up to 20${^{th}}$ will model up to 10 resonant frequencies (formants), which is completely sufficient for description of speaker’s individual articulation (vocal tract) properties.
For the modelling of the glottal flow we have chosen
${M_{\max }}=200$. The decision is based on the results obtained in Tamulevičius and Kaukėnas (
2017), where description of individual speakers qualities demanded AR model order up to 170.
The quality of the estimated glottal flow was assessed on the basis of the ratio of estimated squared prediction error and estimated signal variance ${\hat{b}^{2}}/\hat{D}$. The value of ${\hat{b}^{2}}/\hat{D}$ indicates the relative part of the unmodelled signal: the higher ratio value we obtain, the higher signal prediction error is. Therefore, we can expect a low ${\hat{b}^{2}}/\hat{D}$ value for normal glottal flow and high values for pathological voices (with paralysis).
In this study, we will express this ratio in percentage and call it the estimated error of glottal flow. We think that for healthy and normal voices this ratio will approach towards zero level, and for pathological voices, it will converge to 100% (in case of full paralysis or dysfunction of vocal folds).