(48-6) 17 * << * >> * Russian * English * Content * All Issues

Automatic estimation of the number of minimal language units by articulation
V.O. Yachnaya 1,2, V.R. Lutsiv 1

Saint Petersburg State University of Aerospace Instrumentation,
190000, Saint-Petersburg, Russia, Bolshaya Morskaya 67;
Pavlov Institute of Physiology, Russian Academy of Sciences,
199034, Saint-Petersburg, Russia, Naberezhnaya Makarova 6

 PDF, 992 kB

DOI: 10.18287/2412-6179-CO-1451

Pages: 956-962.

Full text of article: Russian language.

Abstract:
The presented work is dedicated to the automatic analysis of the paraverbal component of human communication. The article describes systems that determine the number of minimal linguistic units (syllables and phonemes) in spoken language based on video data. Such systems can be used to assess the subject speech rate, which can be applied in the preclinical diagnosis of certain pathological conditions or determining emotional status. To conduct the research, an existing database of English words was modified, and annotations containing information on the number of syllables and phonemes in each word were obtained. During the study, a word recognition system was adapted to solve the stated task, and a new neural network architecture to determine the number of syllables and phonemes in a word was designed. The effectiveness of the developed systems was assessed on both sets of previously known to the systems words and on new words. As a result of the research, a system that determines the number of minimal language units in a spoken word was obtained, providing the opportunity for subsequent assessment of the subject articulation rate.

Keywords:
visual speech recognition, articulation, computer vision, neural networks.

Citation:
Yachnaya VO, Lutsiv VR. Automatic estimation of the number of minimal language units by articulation. Computer Optics 2024; 48(6): 956-962. DOI: 10.18287/2412-6179-CO-1451.

Acknowledgements:
This study was supported by the State Program 47 GP "Scientific and Technological Development of the Russian Federation "(2019-2030), theme 0134-2019-0006.

References:

  1. Arakane T, Saitoh T, Chiba R, Morise M, Oda Y. Conformer-based lip-reading for Japanese sentence. In Book: Yan WQ, Nguyen M, Stommel M, eds. Image and Vision Computing. Cham: Springer; 2023. DOI: 10.1007/978-3-031-25825-1_34.
  2. Yachnaya VO, Lutsiv VR, Malashin RO. Modern automatic recognition technologies for visual communication tools. Computer Optics 2023; 47(2): 287-305. DOI: 10.18287/2412-6179-CO-1154.
  3. Yu C, Yu J, Qian Z, Tan Y. Improvement of acoustic models fused with lip visual information for low-resource speech. Sensors 2023; 23(4): 2071. DOI: 10.3390/s23042071.
  4. El-Bialy R, et al. Developing phoneme-based lip-reading sentences system for silent speech recognition. CAAI Trans Intell Technol 2023; 8(1): 129-138. DOI: 10.1049/cit2.12131.
  5. Rahmani MH, Almasganj F. Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. 3rd Int Conf on Pattern Recognition and Image Analysis (IPRIA) 2017: 195-199. DOI: 10.1109/PRIA.2017.7983045.
  6. Ivanko D, Ryumin D, Kipyatkova I, Axyonov A, Karpov A. Lip-reading using pixel-based and geometry-based features for multimodal human–robot interfaces. Proc 14th Int Conf on Electromechanics and Robotics “Zavalishin's Readings”. Smart Innovation, Systems and Technologies 2020: 154. DOI: 10.1007/978-981-13-9267-2_39.
  7. Fernandez-Lopez A, Sukno FM. Optimizing phoneme-to-viseme mapping for continuous lip-reading in Spanish. In Book: Cláudio AP, Bechmann D, Richard P, Yamaguchi T, Linsen L, Telea A, Imai F, Tremeau A, eds. Computer Vision, Imaging and Computer Graphics – Theory and Applications. Cham: Springer; 2019: 305-328. DOI: 10.1007/978-3-030-12209-6_15.
  8. Rachman A, Hidayat R, Nugroho H. Improving phoneme to viseme mapping for indonesian language. Int J Inform Technol Electrical Eng 2020; 4(1): 1-7. DOI: 10.22146/ijitee.47577.
  9. Wakkumbura WGVK, Madhubhashana RAH, Alahakoon PMK, Kumara WGCW, Hinas MNA. Phoneme-viseme mapping for sinhala speaking robot for Sri Lankan healthcare applications. IEEE 4th Eurasia Conf on Biomedical Engineering, Healthcare and Sustainability (ECBIOS) 2022; 258-262. DOI: 10.1109/ECBIOS54627.2022.9945003.
  10. Srivastava T, Khanna P, Pan S, Nguyen P, Jain S. MuteIt: Jaw motion based unvoiced command recognition using earable. Proc ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2022; 6(3): 140. DOI: 10.1145/3550281.
  11. Lee CH, Kwon Y, Kim K. Syllable transposition effects in Korean word recognition. J Psycholinguist Res 2015; 44: 309-315. DOI: 10.1007/s10936-015-9353-7.
  12. Diaz-Asper M, Holmlund TB, Chandler C, Diaz-Asper C, Foltz PW, Cohen AS, Elvevåg B. Using automated syllable counting to detect missing information in speech transcripts from clinical settings. Psychiatry Res 2022; 315: 114712. DOI: 10.1016/j.psychres.2022.114712.
  13. Wertzner HF, Silva LM. Speech rate in children with and without phonological disorder. Pró-Fono Revista de Atualização Científica 2009; 21(1): 19-24. DOI: 10.1590/S0104-56872009000100004.
  14. Brewer E, Mirheidari B, O'Malley R, Reuber M, Christensen H, Blackburn DJ. Characterising spoken interactions of healthy ageing adults with CognoSpeak, a web-based cognitive assessment tool. Alzheimer's Dementia 2021; 17: e052913. DOI: 10.1002/alz.052913.
  15. Isaeva AA. Influence of emotional tension on speech production [In Russian]. Proceedings of VSU Series: Linguistics and Intercultural Communication 2023; 4: 34-41. DOI: 10.17308/lic/1680-5755/2022/4/34-41.
  16. Horkous H, Mhania G. Speech emotions recognition of joy and sadness based on prosodic and mfccs parameters. Models & Optimisation and Mathematical Analysis Journal 2018; 6(1): 15-18. Source: <https://www.asjp.cerist.dz/en/article/71119>.
  17. Kuznetsova YM, Kuruzov IA, Smirnov IV, Stankevich MА, Starostina EV, Chudova NV. Textual manifestations of social network user frustration [In Russian]. Media Linguistics 2020; 7(1): 4-15. DOI: 10.21638/spbu22.2020.101.
  18. Agarwal S, Farid H, Fried O, Agrawala M. Detecting deep-fake videos from phoneme-viseme mismatches. IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops (CVPRW) 2020: 2814-2822. DOI: 10.1109/CVPRW50498.2020.00338.
  19. The Oxford-BBC Lip Reading in the Wild (LRW) Dataset. Source: <https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html>.
  20. How Many Syllables. Source: <https://www.howmanysyllables.com>.
  21. The Carnegie Mellon University (CMU) pronouncing dictionary. Source: <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>.
  22. Yachnaya VO. The role of the number of smallest linguistic units in a word on the accuracy of the word-level lip-reading system. 8th Int Conf “Video and Audio Signal Processing in the Context of Neurotechnologies” (SPCN-2023) 2023.
  23. Feng D, Yang S, Shan S. An efficient software for building LIP reading models without pains. IEEE Int Conf on Multimedia & Expo Workshops (ICMEW) 2021: 1-2. DOI: 10.1109/ICMEW53276.2021.9456014.
  24. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition (CVPR) 2016: 770-778. DOI: 10.1109/CVPR.2016.90.
  25. Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. Proc Interspeech 2017: 3652-3656. DOI: 10.21437/Interspeech.2017-85.
  26. Ma P, Martinez B, Petridis S, Pantic M. Towards practical lipreading with distilled and efficient models. 2021-2021 IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP) 2021: 7608-7612. DOI: 10.1109/ICASSP39728.2021.9415063.
  27. Arakane T, Saitoh T. Efficient DNN model for word lip-reading. Algorithms 2023; 16(6): 269. DOI: 10.3390/a16060269.
  28. Tsourounis D, Kastaniotis D, Fotopoulos S. Lip reading by alternating between spatiotemporal and spatial convolutions. J Imaging 2021; 7(5): 91. DOI: 10.3390/jimaging7050091.
  29. Naif KS, Hashim, Kadhim Mahdi. Automatic lip reading for decimal digits using ResNet50 Model. Journal of College of Education for Pure Science 2022; 12(2): 308. DOI: 10.32792/utq.jceps.12.02.30.
  30. Hu J, Shen L, Sun G. Squeeze and excitation networks. IEEE/CVF Conf on Computer Vision and Pattern Recognition 2018: 7132-7141. DOI: 10.1109/CVPR.2018.00745.
  31. Yang S, Zhang Y, Feng D, Yang M, Wang C, Xiao J, Long K, Shan S, Chen X. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. 14th IEEE Int Conf on Automatic Face & Gesture Recognition (FG 2019) 2019: 1-8. DOI: 10.1109/FG.2019.8756582.

© 2009, IPSI RAS
151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: journal@computeroptics.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20