Voice Activation Systems for Embedded Devices: Systematic Literature Review
Volume 31, Issue 1 (2020), pp. 65–88
Pub. online: 23 March 2020
Type: Research Article
Open Access
Received
1 January 2019
1 January 2019
Accepted
1 November 2019
1 November 2019
Published
23 March 2020
23 March 2020
Abstract
A large number of modern mobile devices, embedded devices and smart home devices are equipped with a voice control. Automatic recognition of the entire audio stream, however, is undesirable for the reasons of the resource consumption and privacy. Therefore, most of these devices use a voice activation system, whose task is to find the specified in advance word or phrase in the audio stream (for example, Ok, Google) and to activate the voice request processing system when it is found. The voice activation system must have the following properties: high accuracy, ability to work entirely on the device (without using remote servers), consumption of a small amount of resources (primarily CPU and RAM), noise resistance and variability of speech, as well as a small delay between the pronunciation of the key phrase and the system activation. This work is a systematic literature review on voice activation systems that satisfy the above properties. We describe the principle of various voice activation systems’ operation, the characteristic representation of sound in such systems, consider in detail the acoustic modelling and, finally, describe the approaches used to assess the models’ quality. In addition, we point to a number of open questions in this problem.
References
Cuayáhuitl, H., Serridge, B. (2002). Out-of-vocabulary word modeling and rejection for Spanish keyword spotting systems. In: Coello, C.A.C., deAlbornoz, A., Sucar, L.E., Battistutti, O.C. (Eds.), MICAI 2002: Advances in Artificial Intelligence, Second Mexican International Conference on Artificial Intelligence, Merida, Yucatan, Mexico, April 22–26, 2002, Proceedings, Lecture Notes in Computer Science, Vol. 2313. Springer, pp. 156–165.
Fernández-Marqués, J., Tseng, V.W.S., Bhattacharya, S., Lane, N.D. (2018). Deterministic binary filters for keyword spotting applications. In: Ott, J., Dressler, F., Saroiu, S., Dutta, P. (Eds.), Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys 2018, Munich, Germany, June 10–15, 2018. ACM, p. 529.
Gruenstein, A., Alvarez, R., Thornton, C., Ghodrat, M. (2017). A cascade architecture for keyword spotting on mobile devices. CoRR, abs/1712.03603.
Guo, J., Kumatani, K., Sun, M., Wu, M., Raju, A., Strom, N., Mandal, A. (2018). Time-delayed bottleneck highway networks using a DFT feature for keyword spotting. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15–20, 2018. IEEE, pp. 5489–5493.
Hou, J., Xie, L., Fu, Z. (2016). Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in Mandarin Chinese. In: 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016, Tianjin, China, October 17–20, 2016. IEEE. pp. 1–5.
Hwang, K., Lee, M., Sung, W. (2015). Online keyword spotting with a character-level recurrent neural network. CoRR. abs/1512.08903.
Ida, M., Yamasaki, R. (1998). An evaluation of keyword spotting performance utilizing false alarm rejection based on prosodic information. In: The 5th International Conference on Spoken Language Processing, Incorporating The 7th Australian International Speech Science and Technology Conference, Sydney Convention Centre, Sydney, Australia, 30th November–4th December 1998, ISCA.
Janus Toolkit Documentation (2019). http://www.cs.cmu.edu/~tanja/Lectures/JRTkDoc/OldDoc/senones/sn_main.html, Accessed 30th June 2019.
Lengerich, C.T., Hannun, A.Y. (2016). An End-to-End Architecture for Keyword Spotting and Voice Activity Detection. CoRR. abs/1611.09405.
Leow, S.J., Lau, T.S., Goh, A., Peh, H.M., Ng, T.K., Siniscalchi, S.M., Lee, C. (2012). A new confidence measure combining hidden Markov models and artificial neural networks of phonemes for effective keyword spotting. In: 8th International Symposium on Chinese Spoken Language Processing, ISCSLP 2012. Kowloon Tong, China, December 5–8, 2012. IEEE, pp. 112–116.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K. (2011). The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, iEEE Catalog No.: CFP11SRW-USB.
Shan, C., Zhang, J., Wang, Y., Xie, L. (2018). Attention-based end-to-end models for small-footprint keyword spotting. In: Proc. Interspeech 2018, pp. 2037–2041. https://doi.org/10.21437/Interspeech.2018-1777.
Shokri, A., Davarpour, M.H., Akbari, A., Nasersharif, B. (2013). Detecting keywords in Persian conversational telephony speech using a discriminative English keyword spotter. In: IEEE International Symposium on Signal Processing and Information Technology, Athens, Greece, December 12–15, 2013. IEEE Computer Society, pp. 272–276.
Shokri, A., Davarpour, M.H., Akbari, A. (2014). Improving keyword detection rate using a set of rules to merge HMM-based and SVM-based keyword spotting results. In: 2014 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2014, Delhi, India, September 24–27, 2014, IEEE, pp. 1715–1718.
Sun, M., Snyder, D., Gao, Y., Nagaraja, V., Rodehorst, M., Panchapagesan, S., Strom, N., Matsoukas, S., Vitaladevuni, S. (2017). Compressed time delay neural network for small-footprint keyword spotting. In: Lacerda, F. (Ed.), Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20–24, 2017. ISCA, pp. 3607–3611.
Szöke, I., Schwarz, P., Matejka, P., Burget, L., Karafiát, M., Cernocký, J. (2005). Phoneme based acoustics keyword spotting in informal continuous speech. In: Matousek, V., Mautner, P., Pavelka, T. (Eds.), Text, Speech and Dialogue, 8th International Conference, TSD 2005, Karlovy Vary, Czech Republic, September 12–15, 2005. Proceedings, Lecture Notes in Computer Science, Vol. 3658. Springer, pp. 302–309.
Warden, P. (2018). Speech commands: a dataset for limited-vocabulary speech recognition. CoRR. abs/1804.03209.
Wöllmer, M., Eyben, F., Graves, A., Schuller, B.W., Rigoll, G. (2009a). Improving keyword spotting with a tandem BLSTM-DBN architecture. In: iCasals, J.S., Zaiats, V. (Eds.), Advances in Nonlinear Speech Processing, International Conference on Nonlinear Speech Processing, NOLISP 2009, Vic, Spain, June 25-27. Revised Selected Papers, Lecture Notes in Computer Science, Vol. 5933. Springer, pp. 68–75.
Wöllmer, M., Eyben, F., Keshet, J., Graves, A., Schuller, B.W., Rigoll, G. (2009b). Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, 19–24 April 2009, Taipei, Taiwan. IEEE, pp. 3949–3952.
Wu, M., Panchapagesan, S., Sun, M., Gu, J., Thomas, R., Vitaladevuni, S.N.P., Hoffmeister, B., Mandal, A. (2018). Monophone-based background modeling for two-stage on-device wake word detection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15–20, 2018. IEEE, pp. 5494–5498.
Biographies
Kolesau Aliaksei
A. Kolesau is a PhD student at Department of Information Technologies, Vilnius Gediminas Technical University. His research interests include machine learning and speech recognition.
Šešok Dmitrij
D. Šešok is a professor at Department of Information Technologies, Vilnius Gediminas Technical University. His fields of interest are global optimization and machine learning. He has authored or co-authored around 40 papers.