Voice Activation Systems for Embedded Devices: Systematic Literature Review

Kolesau, Aliaksei; Šešok, Dmitrij

doi:10.15388/20-INFOR398

Informatica

Voice Activation Systems for Embedded Devices: Systematic Literature Review

Volume 31, Issue 1 (2020), pp. 65–88

Aliaksei Kolesau Dmitrij Šešok

https://doi.org/10.15388/20-INFOR398

Pub. online: 23 March 2020 Type: Research Article

Open Access

Received
1 January 2019

Accepted
1 November 2019

Published
23 March 2020

Abstract

A large number of modern mobile devices, embedded devices and smart home devices are equipped with a voice control. Automatic recognition of the entire audio stream, however, is undesirable for the reasons of the resource consumption and privacy. Therefore, most of these devices use a voice activation system, whose task is to find the specified in advance word or phrase in the audio stream (for example, Ok, Google) and to activate the voice request processing system when it is found. The voice activation system must have the following properties: high accuracy, ability to work entirely on the device (without using remote servers), consumption of a small amount of resources (primarily CPU and RAM), noise resistance and variability of speech, as well as a small delay between the pronunciation of the key phrase and the system activation. This work is a systematic literature review on voice activation systems that satisfy the above properties. We describe the principle of various voice activation systems’ operation, the characteristic representation of sound in such systems, consider in detail the acoustic modelling and, finally, describe the approaches used to assess the models’ quality. In addition, we point to a number of open questions in this problem.

References

Bahi, H., Benati, N. (2009). A new keyword spotting approach. In: 2009 International Conference on Multimedia Computing and Systems, pp. 77–80.

Baljekar, P., Lehman, J.F., Singh, R. (2014). Online word-spotting in continuous speech with recurrent neural networks. In: 2014 IEEE Spoken Language Technology Workshop, SLT 2014, South Lake Tahoe, NV, USA, December 7–10, 2014, pp. 536–541.

Benisty, H., Katz, I., Crammer, K., Malah, D. (2018). Discriminative keyword spotting for limited-data applications. Speech Communication, 99, 1–11.

Bohac, M. (2012). Performance comparison of several techniques to detect keywords in audio streams and audio scene. In: Proceedings ELMAR-2012, pp. 215–218.

Chang, E.I., Lippmann, R.P. (1994). Figure of merit training for detection and spotting. In: Cowan, J.D., Tesauro, G., Alspector, J. (Eds.), Advances in Neural Information Processing Systems, Vol. 6. Morgan-Kaufmann, pp. 1019–1026.

Chen, G., Parada, C., Heigold, G. (2014a). Small-footprint keyword spotting using deep neural networks. 4-9, 2014. IEEE. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4–9, 2014. IEEE, pp. 4087–4091.

Chen, N.F., Sivadas, S., Lim, B.P., Ngo, H.G., Xu, H., Pham, V.T., Ma, B., Li, H. (2014b). Strategies for Vietnamese keyword search. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May, May 4–9, 2014. IEEE, pp. 4121–4125.

Cuayáhuitl, H., Serridge, B. (2002). Out-of-vocabulary word modeling and rejection for Spanish keyword spotting systems. In: Coello, C.A.C., deAlbornoz, A., Sucar, L.E., Battistutti, O.C. (Eds.), MICAI 2002: Advances in Artificial Intelligence, Second Mexican International Conference on Artificial Intelligence, Merida, Yucatan, Mexico, April 22–26, 2002, Proceedings, Lecture Notes in Computer Science, Vol. 2313. Springer, pp. 156–165.

Feng, M., Mazor, B. (1992). Continuous word spotting for applications in telecommunications. In: The Second International Conference on Spoken Language Processing, ICSLP 1992, Banff, Alberta, Canada, October 13–16, 1992, ISCA.

Fernández-Marqués, J., Tseng, V.W.S., Bhattacharya, S., Lane, N.D. (2018). Deterministic binary filters for keyword spotting applications. In: Ott, J., Dressler, F., Saroiu, S., Dutta, P. (Eds.), Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys 2018, Munich, Germany, June 10–15, 2018. ACM, p. 529.

Ge, F., Yan, Y. (2017). Deep neural network based wake-up-word speech recognition with two-stage detection. 5-9, 2017. IEEE. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5–9, 2017. IEEE, pp. 2761–2765.

Giannakopoulos, T. (2015). epyAudioAnalysis: an open-source Python library for audio signal analysis. PloS One, 10(12).

Gish, H., Ng, K. (1993). A segmental speech model with applications to word spotting. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 447–450.

Gish, H., Chow, Y.L., Rohlicek, J.R. (1990). Probabilistic vector mapping of noisy speech parameters for HMM word spotting. In: International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 117–120.

Gish, H., Ng, K., Rohlicek, J.R. (1992). Secondary processing using speech segments for an HMM word spotting system. In: The Second International Conference on Spoken Language Processing, ICSLP 1992, Banff, Alberta, Canada, October 13–16, 1992, ISCA.

Gruenstein, A., Alvarez, R., Thornton, C., Ghodrat, M. (2017). A cascade architecture for keyword spotting on mobile devices. CoRR, abs/1712.03603.

Guo, J., Kumatani, K., Sun, M., Wu, M., Raju, A., Strom, N., Mandal, A. (2018). Time-delayed bottleneck highway networks using a DFT feature for keyword spotting. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15–20, 2018. IEEE, pp. 5489–5493.

Hao, J., Li, X. (2002). Word spotting based ona posterior measure of keyword confidence. Journal of Computer Science and Technology, 17(4), 491–497.

Heracleous, P., Shimizu, T. (2003). An efficient keyword spotting technique using a complementary language for filler models training. In: 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 – INTERSPEECH 2003, Geneva, Switzerland, September 1–4, 2003, ISCA.

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Kingsbury, B., Sainath, T. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.

Hou, J., Xie, L., Fu, Z. (2016). Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in Mandarin Chinese. In: 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016, Tianjin, China, October 17–20, 2016. IEEE. pp. 1–5.

Hwang, K., Lee, M., Sung, W. (2015). Online keyword spotting with a character-level recurrent neural network. CoRR. abs/1512.08903.

Ida, M., Yamasaki, R. (1998). An evaluation of keyword spotting performance utilizing false alarm rejection based on prosodic information. In: The 5th International Conference on Spoken Language Processing, Incorporating The 7th Australian International Speech Science and Technology Conference, Sydney Convention Centre, Sydney, Australia, 30th November–4th December 1998, ISCA.

Jansen, A., Niyogi, P. (2009a). An experimental evaluation of keyword-filler hidden markov models. Tech. Rep., Department of Computer Science, University of Chicago.

Jansen, A., Niyogi, P. (2009b). Point process models for spotting keywords in continuous speech. Transactions on Audio, Speech, and Language Processing, 17(8), 1457–1470.

Jansen, A., Niyogi, P. (2009c). Robust keyword spotting with rapidly adapting point process models. In: INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6–10, 2009. ISCA, pp. 2767–2770.

Janus Toolkit Documentation (2019). http://www.cs.cmu.edu/~tanja/Lectures/JRTkDoc/OldDoc/senones/sn_main.html, Accessed 30th June 2019.

Junkawitsch, J., Ruske, G., Höge, H. (1997). Efficient methods for detecting keywords in continuous speech. In: Kokkinakis, G., Fakotakis, N., Dermatas, E. (Eds.), Fifth European Conference on Speech Communication and Technology, EUROSPEECH 1997, Rhodes, Greece, September 22–25, 1997, ISCA.

Kavya, H.P., Karjigi, V. (2014). Sensitive keyword spotting for crime analysis. In: 2014 IEEE National Conference on Communication, Signal Processing and Networking (NCCSN), pp. 1–6.

Keshet, J., Grangier, D., Bengio, S. (2009). Discriminative keyword spotting. Speech Communication, 51(4), 317–329.

Këpuska, V., Klein, T. (2009). A novel Wake-Up-Word speech recognition system, Wake-Up-Word recognition task, technology and evaluation. Nonlinear Analysis: Theory, Methods & Applications, e2772(12), 45–e2789.

Khne, M., Wolff, M., Eichner, M., Hoffmann, R. (2004). Voice activation using prosodic features. In: INTERSPEECH 2004 – ICSLP, 8th International Conference on Spoken Language Processing, Jeju Island, Korea, October 4–8, 2004, ISCA.

Klemm, H., Class, F., Kilian, U. (1995). Word- and phrase spotting with syllable-based garbage modelling. In: Fourth European Conference on Speech Communication and Technology, EUROSPEECH 1995, Madrid, Spain, September 18–21, 1995, ISCA.

Knill, K.M., Young, S.J. (1996). Fast implementation methods for Viterbi-based word-spotting. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Vol. 1, pp. 522–525.

Kosonocky, S.V., Mammone, R.J. (1995). A continuous density neural tree network word spotting system. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 305–308.

Kumatani, K., Panchapagesan, S., Wu, M., Kim, M., Strom, N., Tiwari, G., Mandal, A. (2017). Direct modeling of raw audio with DNNS for wake word detection. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, Okinawa, Japan, December 16–20, 2017. IEEE, pp. 252–257.

Kurniawati, E., Celetto, L., Capovilla, N., George, S. (2012). Personalized voice command systems in multi modal user interface. In: 2012 IEEE International Conference on Emerging Signal Processing Applications, ESPA 2012, Las Vegas, NV, USA, January 12–14, 2012. IEEE, pp. 45–47.

Laszko, L. (2016). Using formant frequencies to word detection in recorded speech. In: Ganzha, M., Maciaszek, L.A., Paprzycki, M. (Eds.), Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, FedCSIS 2016, Gdańsk, Poland, September 11–14, 2016, pp. 797–801.

Lehtonen, M. (2005). Hierarchical approach for spotting keywords. Idiap-RR Idiap-RR-41-2005, IDIAP.

Lengerich, C.T., Hannun, A.Y. (2016). An End-to-End Architecture for Keyword Spotting and Voice Activity Detection. CoRR. abs/1611.09405.

Leow, S.J., Lau, T.S., Goh, A., Peh, H.M., Ng, T.K., Siniscalchi, S.M., Lee, C. (2012). A new confidence measure combining hidden Markov models and artificial neural networks of phonemes for effective keyword spotting. In: 8th International Symposium on Chinese Spoken Language Processing, ISCSLP 2012. Kowloon Tong, China, December 5–8, 2012. IEEE, pp. 112–116.

Li, Q., Wang, L. (2014). A novel coding scheme for keyword spotting. In: 2014 Seventh International Symposium on Computational Intelligence and Design, Vol. 2, pp. 379–382.

Liu, C., Chiu, C., Chang, H. (2000). Design of vocabulary-independent Mandarin keyword spotters. IEEE Trans. Speech and Audio Processing, 8(4), 483–487.

Manor, E., Greenberg, S. (2017). Voice trigger system using fuzzy logic. In: 2017 International Conference on Circuits, System and Simulation (ICCSS), pp. 113–118.

Marcus, J.N. (1992). A novel algorithm for HMM word spotting performance evaluation and error analysis. In: [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 89–92.

Morgan, D.P., Scofield, C.L. (1991). Neural Networks and Speech Processing. Springer US, Boston, MA, pp. 329–348.

Morgan, D.P., Scofield, C.L., Lorenzo, T.M., Real, E.C., Loconto, D.P. (1990). A keyword spotter which incorporates neural networks for secondary processing. In: International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 113–116.

Morgan, D.P., Scofield, C.L., Adcock, J.E. (1991). Multiple neural network topologies applied to keyword spotting. In: [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 313–316.

Myer, S., Tomar, V.S. (2018). Efficient keyword spotting using time delay neural networks. In: Yegnanarayana, B. (Ed.), Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2–6 September 2018. ISCA, pp. 1264–1268.

Naylor, J.A., Huang, W.Y., Nguyen, M., Li, K.P. (1992). The application of neural networks to wordspotting. In: 1992 Conference Record of the Twenty-Sixth Asilomar Conference on Signals, Systems Computers, Vol. 2, pp. 1081–1085.

Oord, A., Li, Y., Vinyals, O. (2018). Representation Learning with Contrastive Predictive Coding.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K. (2011). The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, iEEE Catalog No.: CFP11SRW-USB.

Raziel, A., Hyun-Jin, P. (2018). End-to-End Streaming Keyword Spotting.

Rohlicek, J.R., Russell, W., Roukos, S., Gish, H. (1989). Continuous hidden Markov modeling for speaker-independent word spotting. In: International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 627–630.

Rohlicek, J.R., Jeanrenaud, P., Ng, K., Gish, H., Musicus, B., Siu, M. (1993). Phonetic training and language modeling for word spotting. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 459–462.

Rose, R.C., Paul, D.B. (1990). A hidden Markov model based keyword recognition system. In: International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 129–132.

Sadhu, S., Ghosh, P.K. (2017). Low resource point process models for keyword spotting using unsupervised online learning. In: 25th European Signal Processing Conference, EUSIPCO 2017, Kos, Greece, August 28–September 2, 2017. IEEE, pp. 538–542.

Sangeetha, J., Jothilakshmi, S. (2014). A novel spoken keyword spotting system using support vector machine. Engineering Applications of Artificial Intelligence, 36, 287–293.

Shan, C., Zhang, J., Wang, Y., Xie, L. (2018). Attention-based end-to-end models for small-footprint keyword spotting. In: Proc. Interspeech 2018, pp. 2037–2041. https://doi.org/10.21437/Interspeech.2018-1777.

Shokri, A., Tabibian, S., Akbari, A., Nasersharif, B., Kabudian, J. (2011). A robust keyword spotting system for Persian conversational telephone speech using feature and score normalization and ARMA filter. In: 2011 IEEE GCC Conference and Exhibition (GCC), pp. 497–500.

Shokri, A., Davarpour, M.H., Akbari, A., Nasersharif, B. (2013). Detecting keywords in Persian conversational telephony speech using a discriminative English keyword spotter. In: IEEE International Symposium on Signal Processing and Information Technology, Athens, Greece, December 12–15, 2013. IEEE Computer Society, pp. 272–276.

Shokri, A., Davarpour, M.H., Akbari, A. (2014). Improving keyword detection rate using a set of rules to merge HMM-based and SVM-based keyword spotting results. In: 2014 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2014, Delhi, India, September 24–27, 2014, IEEE, pp. 1715–1718.

Silaghi, M., Vargiya, R. (2005). A new evaluation criteria for keyword spotting techniques and a new algorithm. ISCA. In: INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4–8, 2005. ISCA, pp. 1593–1596.

Siu, M., Gish, H., Rohlicek, J.R. (1994). Predicting word spotting performance. In: The 3rd International Conference on Spoken Language Processing, ICSLP, 1994, Yokohama, Japan, September 18–22, 1994, ISCA.

Sun, M., Snyder, D., Gao, Y., Nagaraja, V., Rodehorst, M., Panchapagesan, S., Strom, N., Matsoukas, S., Vitaladevuni, S. (2017). Compressed time delay neural network for small-footprint keyword spotting. In: Lacerda, F. (Ed.), Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20–24, 2017. ISCA, pp. 3607–3611.

Szöke, I., Schwarz, P., Matejka, P., Burget, L., Karafiát, M., Cernocký, J. (2005). Phoneme based acoustics keyword spotting in informal continuous speech. In: Matousek, V., Mautner, P., Pavelka, T. (Eds.), Text, Speech and Dialogue, 8th International Conference, TSD 2005, Karlovy Vary, Czech Republic, September 12–15, 2005. Proceedings, Lecture Notes in Computer Science, Vol. 3658. Springer, pp. 302–309.

Szöke, I., Grézl, F., Cernocký, J., Fapso, M., Cipr, T. (2010). Acoustic keyword spotter - optimization from end-user perspective. In: Hakkani-Tür, D., Ostendorf, M. (Eds.), IEEE Spoken Language Technology Workshop, SLT 2010, Berkeley, California, USA, December 12–15, 2010. IEEE, pp. 189–193.

Szöke, I., Skácel, M., Burget, L., Cernocký, J. (2015). Coping with channel mismatch in Query-by-Example – but QUESST. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19–24, 2015. IEEE, pp. 5838–5842.

Tabibian, S. (2017). A voice command detection system for aerospace applications. I. Journal of Speech Technology, 20(4), 1049–1061.

Tabibian, S., Akbari, A., Nasersharif, B. (2011). An evolutionary based discriminative system for keyword spotting. In: 2011 International Symposium on Artificial Intelligence and Signal Processing (AISP), pp. 83–88.

Tabibian, S., Akbari, A., Nasersharif, B. (2013). Keyword spotting using an evolutionary-based classifier and discriminative features. Engineering Applications of Artificial Intelligence, 26(7), 1660–1670.

Tabibian, S., Akbari, A., Nasersharif, B. (2014). Extension of a kernel-based classifier for discriminative spoken keyword spotting. Neural Processing Letters, 39(2), 195–218.

Tabibian, S., Akbari, A., Nasersharif, B. (2016). A fast hierarchical search algorithm for discriminative keyword spotting. Information Sciences, 336, 45–59.

Tabibian, S., Akbari, A., Nasersharif, B. (2018). Discriminative keyword spotting using triphones information and N-best search. Information Sciences, 423, 157–171.

Vasilache, M., Vasilache, A. (2009). Keyword spotting with duration constrained HMMs. 24-28, 2009. IEEE. In: 17th European Signal Processing Conference, EUSIPCO 2009, Glasgow, Scotland, UK, August, 24–28, 2009. IEEE, pp. 1230–1234.

Vroomen, L.C., Normandin, Y. (1992). Robust speaker-independent hidden Markov model based word spotter. In: Laface, P., De Mori, R. (Eds.), Speech Recognition and Understanding. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 95–100.

Warden, P. (2018). Speech commands: a dataset for limited-vocabulary speech recognition. CoRR. abs/1804.03209.

Wilcox, L.D., Bush, M.A. (1992). Training and search algorithms for an interactive wordspotting system. In: [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 97–100.

Wöllmer, M., Eyben, F., Graves, A., Schuller, B.W., Rigoll, G. (2009a). Improving keyword spotting with a tandem BLSTM-DBN architecture. In: iCasals, J.S., Zaiats, V. (Eds.), Advances in Nonlinear Speech Processing, International Conference on Nonlinear Speech Processing, NOLISP 2009, Vic, Spain, June 25-27. Revised Selected Papers, Lecture Notes in Computer Science, Vol. 5933. Springer, pp. 68–75.

Wöllmer, M., Eyben, F., Keshet, J., Graves, A., Schuller, B.W., Rigoll, G. (2009b). Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, 19–24 April 2009, Taipei, Taiwan. IEEE, pp. 3949–3952.

Wöllmer, M., Schuller, B.W., Rigoll, G. (2013). Keyword spotting exploiting long short-term memory. Speech Communication, 55(2), 252–265.

Wu, M., Panchapagesan, S., Sun, M., Gu, J., Thomas, R., Vitaladevuni, S.N.P., Hoffmeister, B., Mandal, A. (2018). Monophone-based background modeling for two-stage on-device wake word detection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15–20, 2018. IEEE, pp. 5494–5498.

Yu, D., Deng, L. (2014). Automatic Speech Recognition: A Deep Learning Approach. Springer, London.

Zehetner, A., Hagmüller, M., Pernkopf, F. (2014). Wake-up-word spotting for mobile systems. In: 22nd European Signal Processing Conference, EUSIPCO 2014, Lisbon, Portugal, September 1–5, 2014. IEEE, pp. 1472–1476.

Zeppenfeld, T., Waibel, A.H. (1992). A hybrid neural network, dynamic programming word spotter. In: ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 77–80.

Zhang, S., Liu, W., Qin, Y. (2016). Wake-up-word spotting using end-to-end deep neural network system. In: 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4–8, 2016. IEEE, pp. 2878–2883.

Zheng, F., Xu, M., Mou, X., Wu, J., Wu, W., Fang, D. (1999). HarkMan – a vocabulary-independent keyword spotter for spontaneous Chinese speech. Journal of Computer Science and Technology, 14(1), 18–26.

Zhu, C., Kong, Q., Zhou, L., Xiong, G., Zhu, F. (2013). Sensitive keyword spotting for voice alarm systems. In: Proceedings of 2013 IEEE International Conference on Service Operations and Logistics, and Informatics, pp. 350–353.

Biographies

Kolesau Aliaksei

A. Kolesau is a PhD student at Department of Information Technologies, Vilnius Gediminas Technical University. His research interests include machine learning and speech recognition.

Šešok Dmitrij

dmitrij.sesok@vgtu.lt

D. Šešok is a professor at Department of Information Technologies, Vilnius Gediminas Technical University. His fields of interest are global optimization and machine learning. He has authored or co-authored around 40 papers.

Full article Related articles Cited by

Open access article under the CC BY license.

Keywords

voice activation keyword spotter hidden markov models acoustic model neural networks

Metrics

since January 2020

1686

Article info
views

886

Full article
views

1073

PDF
downloads

250

XML
downloads

RSS

Authors

Abstract

References

Biographies

Export citation

Copy and paste formatted citation

Download citation in file