2.1 Lithuanian Graphemes, Phonemes and Allophones
Traditional Lithuanian spelling is based on the set of 32 graphemes:
a,
ą,
b,
c,
č,
d,
e,
ę,
ė,
f,
g,
h,
i,
į,
y,
j,
k,
l,
m,
n,
o,
p,
r,
s,
š,
t,
u,
ū,
ų,
v,
z,
ž that includes 9 diacritic symbols. Lithuanian orthography is essentially phonological, i.e. standardized spelling reflects the essential phonological changes but also tolerates phonological inaccuracies. The definition of Lithuanian phoneme is subject to debate among linguists. Girdenis (
2014) describes Lithuanian as having 58 phonemes (13 vowels and 45 consonants) whereas Pakerys (
2003) talks about 49 phonemes (12 vowels and 37 consonants). This study is not concerned by different phoneme definitions, because it focuses on allophones and their sets. The following considerations summarize the essence of the relationship among graphemes, phonemes and allophones and illustrate the main difficulties of Lithuanian G2P conversion:
-
• Lithuanian consonants are either palatalized, or non-palatalized. Palatalization property of a consonant is not exposed by its grapheme symbol, but can be inferred from its right context. One right standing grapheme is often enough, as consonants are always palatalized before graphemes e, ę, ė, i, į, y, j. However, in rare cases four right standing graphemes are required to infer this property correctly, e.g. perskrido [1ˈpæ:rjsjkrjɪdo:] (flew over).
-
• Lithuanian vowels are either short (lax), or long (tense). Duration property of a vowel is not exposed by graphemes
a,
e,
o (see Table
1).
-
• Grapheme pairs ie, uo, ai, au, ei, ui make up a diphtong (e.g. paukštis [2ˈpɐu˙kʃjtjɪs] (bird)) or hiatus (e.g. paupys [pɐ.ʊ2ˈpji:s] (riverside)) if they are within the same syllable or span syllable boundaries respectively.
-
• Grapheme pairs al, am, an, ar, el, em, en. er, il, im, in, ir, ul, um, un, ur make up a mixed diphthong if they are within the same syllable.
-
• Syllable boundaries are not exposed by standard spelling.
-
• Lithuanian syllables are either stressed, or unstressed. Stress falls on a nucleus of the syllable, where nucleus may be a vowel, a diphthong or a mixed diphthong. Lithuanian phonetics distinguishes between two syllable accents: acute and circumflex. If a diphthong or a mixed diphthong is stressed, the acute and the circumflex make their respective first (vowel) and the second (vowel or consonant) components more prominent. Syllable accent is not exposed by standard spelling.
-
• Traditional Lithuanian spelling uses irregular affricate encoding. Affricates are encoded either by graphemes such as c ([t͡s, t͡sj]), č ([t͡ʃ, t͡ʃj]) or by digraphs: dz ([d͡z, d͡zj]), dž ([d͡ʒ, d͡ʒj]).
-
• Digraph ch encodes sounds [x] and [xj].
Table 1
The relationship of Lithuanian graphemes and vowels. Graphemes a, e, o represent both short and long vowels.
Grapheme |
a |
ą |
e |
ę |
ė |
i |
į, y
|
o |
u |
ų, ū
|
Phoneme |
[ɐ], [ɑ:] |
[ɑ:] |
[ɛ], [æ:] |
[æ:] |
[e:] |
[ɪ] |
[i:] |
[ɔ], [o:] |
[ʊ] |
[u:] |
The considerations above imply that G2P conversion of Lithuanian is quite complex. G2P converter that relies on a word spelling and grapheme rewrite rules (Greibus
et al.,
2017; Lileikytė
et al.,
2018), henceforth referred to as a shallow G2P converter, is incapable of resolving ambiguities related to vowel duration, syllable stress, and syllable boundaries and consequently is incapable of producing detailed and consistent allophone sequences. Only G2P converter making use of supplementary pronunciation dictionaries (Skripkauskas and Telksnys,
2006) or of accentuation algorithms (Norkevičius
et al.,
2005; Kazlauskienė
et al.,
2010), henceforth referred to as a knowledge-rich G2P converter, might be capable of disambiguating and modelling these phonological properties correctly.
2.2 Related Work
The problem of finding the best word to sub-word unit mapping for the applications of Lithuanian ASR was first addressed by Raškinis and Raškinienė (
2003), followed by Šilingas (
2005), Laurinčiukaitė and Lipeika (
2007), Gales
et al. (
2015), Greibus
et al. (
2017), Lileikytė
et al. (
2018), and Ratkevicius
et al. (
2018).
All abovementioned studies have used very different ASR setups (see Table
2). First, different proprietary speech corpora were used for ASR system training and evaluation (Laurinčiukaitė
et al.,
2006; Harper,
2016; Laurinčiukaitė
et al.,
2018). Second, ASR setups were based on different acoustic modelling techniques, such as monophone HMM system (Šilingas,
2005; Ratkevicius
et al.,
2018), triphone HMM system (Raškinis and Raškinienė,
2003; Šilingas,
2005; Laurinčiukaitė,
2008; Greibus
et al.,
2017), or hybrid HMM – neural network models (Gales
et al.,
2015; Lileikytė
et al.,
2018). Third, different evaluation methodologies were used. Raškinis and Raškinienė (
2003), Laurinčiukaitė and Lipeika (
2007), Ratkevicius
et al. (
2018) and this study prefer accuracy estimation through cross-validation, whereas other studies estimate recognition accuracy on a held-out data, an approach that is less computation intensive. Fourth, different evaluation criteria were used. Studies differ by comparing PER (Šilingas,
2005), WER (Raškinis and Raškinienė,
2003; Šilingas,
2005; Laurinčiukaitė and Lipeika,
2007; Gales
et al.,
2015; Lileikytė
et al.,
2018; Ratkevicius
et al.,
2018), and sentence error rate (Greibus
et al.,
2017). Fifth, ASR setups incorporated different language models such as word loops (Raškinis and Raškinienė,
2003; Šilingas,
2005; Laurinčiukaitė and Lipeika,
2007; Ratkevicius
et al.,
2018), word
n-grams (Gales
et al.,
2015; Lileikytė
et al.,
2018), command lists (Greibus
et al.,
2017), and phone
n-grams (this study).
Table 2
Comparison of experimental setups used to compare phonemic, graphemic and syllabic lexicons in various studies (WER – Word Error Rate; PER – Phone Error Rate; ATWV/MTWV – Actual/Maximum Term-Weighted Value; SER – Sentence Error Rate).
Study |
Corpus |
Evaluation type |
Comparison criteria |
Language model |
Acoustic modelling technique |
Raškinis and Raškinienė, 2003 |
1 h of isolated words, 4 speakers |
4-fold cross-validation, 15 min per round |
WER |
Word-loop |
Triphone HMM |
Šilingas, 2005 |
9 h of broadcast speech |
Held out data, 14 min |
WER, PER |
Word-loop |
Monophone HMM, Triphone HMM |
Laurinčiukaitė and Lipeika 2007 |
23 speakers
|
10-fold cross-validation, 1 h per round |
WER |
Word-loop |
Triphone HMM |
Gales et al., 2015 |
3–40 h of convers. telephone |
Held out data, 10 hours |
WER |
Word n-gram |
Triphone HMM, Hybrid HMM-DNN system |
Lileikytė et al., 2018 |
speech |
|
WER, ATWV/MTWV |
Word 3-gram |
Triphone HMM, Hybrid HMM-DNN system |
Greibus et al., 2017 |
46.5 h of read speech, 348 speakers |
Held out data, 6.78 hours |
SER |
Command list |
Triphone HMM |
Ratkevičius et al., 2018 |
2.5 h of isolated words |
5, 10-fold cross-validation |
WER |
Word-loop |
Monophone HMM |
This study |
50 h of read speech, 50 speakers |
10-fold cross-validation 1 hour per round |
PER |
Fully interconnected triphones; phone 3-gram, 4-gram |
Triphone HMM, LDA+MLLT Triphone HMM, SAT-HMM, SGMM, Hybrid HMM-TDNN, BLSTM (recurrent DNN) |
Though word to grapheme mappings investigated by different studies are quite similar, word to phoneme mappings are different and mostly incompatible across studies. Each study makes its own choices about whether to and how to represent stress, duration, palatalization, affricates, diphthongs and mixed diphthongs in a phonemic lexicon (see Table
3). Laurinčiukaitė and Lipeika (
2007) go beyond word to phoneme mappings and investigate word to sub-word unit mappings, where sub-words may be phonemes, syllables and pseudo-syllables.
Table 3
Comparison of phonemic lexicons that were investigated by various studies. Symbols in the table denote fine-grained (✚), partial (✓), and absent (O) modelling of some phonetic property.
Study |
Phonemic lexicon as referenced by authors |
Syllable stress (vowels & diphth.) |
Vowel duration |
Fronting of back vowels |
Syllable stress (consonants) |
Consonant palatalization |
Affricate modelling |
Diphthong modelling |
Mixed diphthong modelling |
Number of phonetic units |
Raškinis et al. 2003 |
A |
✚ |
✚ |
✚ |
✚ |
✚ |
✚ |
✚ |
O |
115 |
AB |
✓ |
✚ |
✚ |
✚ |
✚ |
✚ |
✚ |
O |
101 |
ABC |
O |
✚ |
✚ |
O |
✚ |
✚ |
✚ |
O |
73 |
ABD |
✓ |
✚ |
✚ |
✚ |
O |
✚ |
✚ |
O |
76 |
ABCD |
O |
✚ |
✚ |
O |
O |
✚ |
✚ |
O |
50 |
Šilingas, 2005 |
BFR1 |
✚ |
✚ |
O |
✚ |
✚ |
✚ |
✚ |
✚ |
229 |
BFR2 |
✚ |
✚ |
O |
✚ |
O |
✚ |
✚ |
✚ |
140 |
BFR3 |
O |
✚ |
O |
O |
O |
✚ |
✚ |
✚ |
86 |
BFR4 |
O |
✚ |
O |
O |
✚ |
✚ |
✚ |
✚ |
139 |
BFR5 |
✚ |
✚ |
O |
O |
✚ |
O |
✚ |
O |
87 |
BFR6 |
✚ |
✚ |
O |
O |
O |
O |
✚ |
O |
71 |
BFR7 |
O |
✚ |
O |
O |
O |
O |
✚ |
O |
41 |
Greibus et al., 2017 |
FZ1.3 |
O |
✓ |
✚ |
O |
O |
✚ |
O |
O |
36 |
FZ15.5 |
O |
✓ |
O |
O |
✚ |
✚ |
O |
O |
61 |
FPK1 |
✚ |
✚ |
✚ |
✚ |
✚ |
✚ |
✓ |
O |
93 |
Lileikytė et al., 2016 |
FLP-32 |
O |
✓ |
O |
O |
O |
O |
O |
O |
29 |
FLP-36 |
O |
✓ |
O |
O |
O |
✚ |
O |
O |
33 |
FLP-38 |
O |
✓ |
O |
O |
O |
O |
✚ |
O |
35 |
FLP-48 |
O |
✓ |
O |
O |
✚ |
O |
O |
O |
45 |
This study |
detailed |
✚ |
✚ |
✚ |
✚ |
✚ |
✚ |
✚ |
✓ |
130 |
no stress |
O |
✚ |
✚ |
O |
✚ |
✚ |
✚ |
✓ |
79 |
no palatalization |
✚ |
✚ |
✚ |
✚ |
O |
✚ |
✚ |
✓ |
98 |
no mixed dipthongs |
✚ |
✚ |
✚ |
✚ |
✚ |
✚ |
✚ |
O |
122 |
no diphthongs |
✚ |
✚ |
✚ |
✚ |
✚ |
✚ |
✓ |
✓ |
112 |
no affricates |
✚ |
✚ |
✚ |
✚ |
✚ |
O |
✚ |
✓ |
122 |
Given such a variety of the experimental setups it is not surprising that different studies came to different and even opposite conclusions. For instance, Raškinis and Raškinienė (
2003) achieved the best WER by the word to phoneme mapping that ignored stress and preserved palatalization (see Table
3, ABC phonemic lexicon), whereas (Šilingas,
2005) achieved best WER by preserving stress and ignoring palatalization (see Table
3, BFR6 phonemic lexicon). Greibus
et al. (
2017) achieved best SER by ignoring both stress and palatalization. Gales
et al. (
2015) found that grapheme-based system outperforms phoneme-based system, whereas Šilingas (
2005), Greibus
et al. (
2017) and Lileikytė
et al. (
2018) came to an opposite result. Laurinčiukaitė and Lipeika (
2007) found that mapping into a mixture of phonemes and syllable-like units improves WER.
Incompatible conclusions are partially due to the limitations of the experimental setups. Some findings are based on a small training corpus (Raškinis and Raškinienė,
2003; Ratkevicius
et al.,
2018) or on a limited carefully selected held-out data (Šilingas,
2005). Other studies (Greibus
et al.,
2017; Lileikytė
et al.,
2018) are testing limited word-to-phoneme mappings due to the usage of a shallow G2P converter which is unable to produce allophone-rich phonemic transriptions. Conclusions of many studies are dependent on a single (though generally state-of-the-art at the time of investigation) acoustic modelling technique. Finally, recognition accuracies obtained by the majority of studies are not “pure” indicators of performance of different word to sub-word mappings as they are strongly influenced by different amounts of linguistic constraints embedded into ASR setups. For instance, Greibus
et al. (
2017) restrict their language model (LM) to a command list, where commands share 271 unique word types, and Ratkevicius
et al. (
2018) restrict their LM to a 10-digit word loop.