(47-2) 12 * << * >> * Russian * English * Content * All Issues
Modern automatic recognition technologies for visual communication tools
V.O. Yachnaya 1,2, V.R. Lutsiv 1, R.O. Malashin 1,2
1 Saint-Petersburg State University of Aerospace Instrumentation,
190000, Saint-Petersburg, Russia, Bolshaya Morskaya 67;
2 Pavlov Institute of Physiology, Russian Academy of Sciences,
199034, Saint-Petersburg, Russia, Naberezhnaya Makarova 6
PDF, 1106 kB
DOI: 10.18287/2412-6179-CO-1154
Pages: 287-305.
Full text of article: Russian language.
Abstract:
Communication refers to a wide range of different behaviors and activities aimed at handing over information. The communication process includes verbal, paraverbal and non-verbal components, conveying the informational part of a message and its emotional part respectively. A complex analysis of all communication components makes it possible to evaluate not only the content, but also the situational context of what is being said, as well as to identify additional factors inherent in the mental and somatic state of the speaker. There are several methods of conveying a verbal message, among which are oral and gestural speech (such as the sign language and fingerspelling). Various forms of communication can be contained in multiple data transmission channels, such as audio or video channels. The review is concerned with video data analysis systems, as the audio channel is incapable of non-verbal components transmission that contribute supplemental details. The article analyzes databases of static and dynamic images and systems, developed to recognize the verbal component conveyed by oral and gestural speech, as well as systems that evaluate paraverbal and non-verbal components of communication. Challenges of designing such databases and systems are specified. Prospective directions in complex analysis of all communication components and its combinations for the most complete evaluation of messages are defined.
Keywords:
visual speech recognition, sign language recognition, affective computing, computer vision, neural networks.
Citation:
Yachnaya VO, Lutsiv VR, Malashin RO. Modern automatic recognition technologies for visual communication tools. Computer Optics 2023; 47(2): 287-305. DOI: 10.18287/2412-6179-CO-1154.
Acknowledgements:
The work was supported by the State Program 47 of the State Enterprise "Scientific and technological development of the Russian Federation" (2019-2030), topic 0134-2019-0006.
References:
- A communication tool for people with speech impairments. Source: <https://blog.google/outreach-initiatives/accessibility/project-relate/>.
- Kalyagin VA. Logopsychology [In Russian]. Moscow: “Akademiya” Publisher; 2006. ISBN: 978-5-7695-3668-7.
- McGurk H, MacDonald J. Hearing lips and seeing voices. Nature 1976; 264: 746-748. DOI: 10.1038/264746a0.
- Makarova V, Petrushin VA. RUSLANA: a database of russian emotional utterances. Conf of Spoken Language Processing 2002: 1.
- Velichko AN, Budkov VU, Karpov AA. Analytical review of computer paralinguistic systems for automatic lie recognition in human speech [In Russian]. Inf Control Syst 2017; 5(90): 30-41. DOI: 10.15217/ISSN1684-8853.2017.5.30.
- Shelepin UE. Introduction to neuroiconics [In Russian]. Saint-Petersburg: “Troickiy most” Publisher; 2017. ISBN: 978-5-6040327-1-8.
- Zhang T, El Ali A, Wang C, Hanjalic A, Cesar P. CorrNet: Fine-grained emotion recognition for video watching using wearable physiological sensors. Sensors 2021; 21(1): 52. DOI: 10.3390/s21010052.
- Geng P, Shi S, Guo H. A preliminary study on attitude recognition from speaker’s orofacial motions using random forest classifier. Proc SPIE 2021; 11878: 1187805. DOI: 10.1117/12.2599383.
- Nayak S, Nagesh B, Routray A, Sarma M. A Human–Computer Interaction framework for emotion recognition through time-series thermal video sequences. Comput Electr Eng 2021; 93: 107280. DOI: 10.1016/j.compeleceng.2021.107280.
- Saxena A, Khanna A, Gupta D. Emotion recognition and detection methods: A comprehensive survey. J Artif Intell Syst 2020; 2: 53-79. DOI: 10.33969/AIS.2020.21005.
- Kutor J, Balapangu S, Adofo JK. Speech signal analysis as an alternative to spirometry in asthma diagnosis: investigating the linear and polynomial correlation coefficient. Int J Speech Technol 2019; 22: 611-620. DOI: 10.1007/s10772-019-09608-7.
- Shubhangi DC, Pratibha AK. Asthma, Alzheimer's and dementia disease detection based on voice recognition using multi-layer perceptron algorithm. Int Conf on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES) 2021: 1-7. DOI: 10.1109/ICSES52305.2021.9633923.
- Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency L-P. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proc AAAI Conf on Artificial Intelligence 2019; 33(1): 7216-7223.
- Luna-Jiménez C, Kleinlein R, Griol D, Callejas Z, Montero JM, Fernández-Martínez F. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS Dataset. Appl Sci 2022; 12(1): 327. DOI: 10.3390/app12010327.
- Ding N, Tian Sw, Yu L. A multimodal fusion method for sarcasm detection based on late fusion. Multimed Tools Appl 2022; 81: 8597-8616. DOI: 10.1007/s11042-022-12122-9.
- Monaro M, Maldera S, Scarpazza C, Sartori G, Navarin N. Detecting deception through facial expressions in a dataset of videotaped interviews: A comparison between human judges and machine learning models. Comput Hum Behav 2022; 127: 107063. DOI: 10.1016/j.chb.2021.107063.
- Zhang K, Li Y, Wang J, Cambria E, Li X. Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Trans Circuits Syst Video Technol 2022; 32(3): 1034-1047. DOI: 10.1109/TCSVT.2021.3072412.
- Huan RH, Shu J, Bao SL. Video multimodal emotion recognition based on Bi-GRU and attention fusion. Multimed Tools Appl 2021; 80: 8213-8240. DOI:10.1007/s11042-020-10030-4.
- Wang S, Wang W, Zhao J, Chen S, Jin Q, Zhang S, Qin Y. Emotion recognition with multimodal features and temporal models. Proc 19th ACM Int Conf on multimodal interaction 2017: 598-602. DOI: 10.1145/3136755.3143016.
- Baveye Y, Bettinelli J-N, Dellandréa E, Chen L, Chamaret C. A large video database for computational models of induced emotion. 2013 Humaine Association Conf on Affective Computing and Intelligent Interaction 2013: 13-18. DOI: 10.1109/ACII.2013.9.
- Zhang S, Zhang S, Huang T, Gao W, Qi T. Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans Circ Syst Vid Technol 2017; 28(10): 3030-3043. DOI: 10.1109/TCSVT.2017.2719043.
- Liu C, Tang T, Lv K, Wang M. Multi-feature based emotion recognition for video clips. Proc 20th ACM Int Conf on Multimodal Interaction (ICMI’18) 2018: 630-634. DOI: 10.1145/3242969.326.
- Minsu K, Jeong HY, Yong MR. Distinguishing homophenes using multi-head visual-audio memory for lip reading. 2022. Source: <https://www.aaai.org/AAAI22Papers/AAAI-6712.KimM.pdf>.
- Liu T, Gao M, Lin F, Wang C, Ba Z, Han J, Xu W, Ren K. Wavoice: A noise-resistant multi-modal speech recognition system fusing mmwave and audio signals. Proc 19th ACM Conf on Embedded Networked Sensor Systems (SenSys '21) 2021: 97-110. DOI: 10.1145/3485730.3485945.
- Bouchakour L, Debyeche M. Noise-robust speech recognition in mobile network based on convolution neural networks. Int J Speech Technol 2022; 25(1): 269-277. DOI: 10.1007/s10772-021-09950-9.
- Mohamed MM, Nessiem MA, Batliner A, Bergler C, Hantke S, Schmitt M, Baird A, Mallol-Ragolta A, Karas V, Amiriparian S, Schuller BW. Face mask recognition from audio: The MASC database and an overview on the mask challenge. Pattern Recogn 2022; 122: 108361. DOI: 10.1016/j.patcog.2021.108361.
- Dvoynikova A, Markitantov M, Ryumina E, Ryumin D, Karpov A. Analytical review of audiovisual systems for determining personal protective equipment on a person's face [In Russian]. Informatics and Automation 2021; 20: 1116-1152. DOI: 10.15622/20.5.5.
- Qian Y, Chang X, Yu D. Single-channel multi-talker speech recognition with permutation invariant training. Speech Commun 2018; 104: 1-11. DOI: 10.1016/j.specom.2018.09.003.
- Subramanian AS, Weng C, Watanabe S, Yu M, Yu D. Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition. Computer Speech & Language 2022; 75: 101360. DOI: 10.1016/j.csl.2022.101360.
- Lu L, Kanda N, Li J, Gong Y. Streaming end-to-end multi-talker speech recognition. IEEE Signal Proces Lett 2021; 28: 803-807. DOI: 10.1109/LSP.2021.3070817.
- Gale R, Chen L, Dolata J, van Santen J, Asgari M. Improving ASR systems for children with autism and language impairment using domain-focused DNN transfer techniques. Interspeech 2019; 2019: 11-15. DOI: 10.21437/Interspeech.2019-3161.
- Ahmed T, Wahid MF, Habib MA. Implementation of bangla speech recognition in voice input speech output (VISO) calculator. Int Conf on Bangla Speech and Language Processing (ICBSLP) 2018: 1-5. DOI: 10.1109/ICBSLP.2018.8554773.
- Katorin UF, Monakhov AE. On the possibilities of directional microphones for transfer of audio information at the transport units [In Russian]. Vestnik Gosudarstvennogo Universiteta Morskogo i Rechnogo Flota imeni Admirala S.O. Makarova 2017; 1 (17): 61-64.
- Cheok MJ, Omar Z, Jaward MH. A review of hand gesture and sign language recognition techniques. Int J Mach Learn Cybern 2019; 10: 131-153. DOI: 10.1007/s13042-017-0705-5.
- Rusu O, Chiriță M. Verbal, non-verbal and paraverbal skills in the patient-kinetotherapist relationship. Timişoara Physical Education and Rehabilitation Journal 2017; 10: 39-45. DOI: 10.1515/tperj-2017-0014.
- Tay L, Woo SE, Hickman L, Booth BM, D’Mello S. A conceptual framework for investigating and mitigating Machine-Learning Measurement Bias (MLMB) in psychological assessment. Advances in Methods and Practices in Psychological Science 2022; 5(1): 1-30. DOI: 10.1177/25152459211061337.
- Rodrigo SR. "A reader, not a speaker": On the verbal, paraverbal and nonverbal communication trichotomy. Rev Digit Invest Docencia Univ 2017; 11(1): 77-192. DOI: 10.19083/ridu.11.499.
- Kristian Y, Purnama I, Sutanto E, Zaman L, Setiawan E, Hery Purnomo M. Klasifikasi nyeri pada video ekspresi wajah bayi menggunakan DCNN autoencoder dan LSTM. Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI) 2018: 7. DOI: 10.22146/jnteti.v7i3.440.
- Suarez MT, Cu J, Maria MS. Building a multimodal laughter database for emotion recognition. Proc Eighth Int Conf on Language Resources and Evaluation (LREC'12) 2012: 2347-2350.
- Jansen M-P, Truong KP, Heylen DKJ, Nazareth DS. Introducing MULAI: A multimodal database of laughter during dyadic interactions. Proc 12th Language Resources and Evaluation Conf 2020: 4333-4342.
- Hough J, Tian Y, Ruiter L, Betz S, Kousidis S, Schlangen D, Ginzburg J. DUEL: A multi-lingual multimodal dialogue corpus for disfluency, exclamations and laughter. Proc Tenth Int Conf on Language Resources and Evaluation (LREC'16) 2016: 1784-1788.
- Darici E, Rasmussen N, J. J, Xiao J, Chaudhari G, Rajput A, Govindan P, Yamaura M, Gomezjurado L, Khanzada A, Pilanci M. Using deep learning with large aggregated datasets for COVID-19 classification from Cough. arXiv Preprint. 2022. Source: <https://arxiv.org/abs/2201.01669v3>.
- Morozov V. Non-verbal communication. Experimental psychological research [In Russian]. Moscow: “Institut psihologii RAN” Publisher; 2011. ISBN: 978-5-9270-0187-3.
- Levkovich UI, Alyakrinskiy VV, Khropychev EI, Mironova AE, Trubnikova TA. Learning to read lips [In Russian]. Leningrad: Pavlov Institute of Physiology, Russian Academy of Sciences Publisher; 1978.
- Fernandez-Lopez A, Sukno FM. Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing 2018; 78: 53-72. DOI: 10.1016/j.imavis.2018.07.002.
- Chung JS, Zisserman A. Learning to lip read words by watching videos. Comput Vis Image Underst 2018; 173: 76-85. DOI: 10.1016/j.cviu.2018.02.001.
- Yang S, Zhang Y, Feng D, Yang M, Wang C, Xiao J, Long K, Shan S, Chen X. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. 14th IEEE Int Conf on Automatic Face & Gesture Recognition (FG 2019) 2019: 1-8. DOI: 10.1109/FG.2019.8756582.
- Lip Reading Sentences 3 Languages (LRS3-Lang) Dataset. Source: <https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3-lang.html>.
- Haliassos A, Vougioukas K, Petridis S, Pantic M. Lips don't lie: A generalisable and robust approach to face forgery detection. IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2021: 5037-5047. DOI: 10.1109/CVPR46437.2021.00500.
- Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Niessner M. FaceForensics++: Learning to detect manipulated facial images. IEEE/CVF Int Conf on Computer Vision (ICCV) 2019: 1-11. DOI: 10.1109/ICCV.2019.00009.
- Jiang L, Li R, Wu W, Qian C, Loy CC. DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection. IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2020: 2886-2895. DOI: 10.1109/CVPR42600.2020.00296.
- Li L, Bao J, Yang H, Chen D, Wen F. Advancing high fidelity identity swapping for forgery detection. Proceedings of the IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2020: 5073-5082. DOI: 10.1109/CVPR42600.2020.00512.
- Li Y, Yang X, Sun P, Qi H, Lyu S. Celeb-DF: A large-scale challenging dataset for DeepFake Forensics. IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2020: 3204-3213. DOI: 10.1109/CVPR42600.2020.00327.
- Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M, Ferrer CC. The DeepFake Detection Challenge (DFDC) dataset. arXiv Preprtint. 2020. Source: <https://arxiv.org/abs/2006.07397>.
- Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Am 2006; 120(5): 2421-2424. DOI: 10.1121/1.2229005.
- Verkhodanova V, Ronzhin A, Kipyatkova I, Ivanko D, Karpov A, Železný M. HAVRUS Corpus: High-speed recordings of audio-visual russian speech. SPECOM 2016: 338-345. DOI: 10.1007/978-3-319-43958-7_40.
- The Oxford-BBC Lip Reading in the Wild (LRW) dataset. Source: <https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html>.
- Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A. Deep audio-visual speech recognition. IEEE Trans Pattern Anal Machine Intell 2018; 44(12): 8717-8727. DOI: 10.1109/TPAMI.2018.2889052.
- Lip Reading Sentences 3 (LRS3) dataset. Source: <https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html>.
- Egorov E, Kostyumov V, Konyk M, Kolesnikov S. LRWR: Large-scale benchmark for lip reading in russian language. arXiv Preprtint. 2021. Source: <https://arxiv.org/abs/2109.06692>.
- Grishina E. Multimodal Russian Corpus (MURCO): First steps. Proc Int Conf on Language Resources and Evaluation (LREC 2010) 2010: 2953-2960.
- Assael Y, Shillingford B, Whiteson S, Freitas N. LipNet: Sentence-level lipreading. arXiv Preprtint. 2016. Source: <https://arxiv.org/abs/1611.01599v1>.
- Chiou GI, Hwang J-N. Lipreading from color motion video. IEEE Int Conf on Acoustics, Speech, and Signal Processing Conference Proc 1996; 4: 2156-2159. DOI: 10.1109/ICASSP.1996.545743.
- Hong X, Yao H, Wan Y, Chen R. A PCA based visual DCT feature extraction method for lip-reading. Int Conf on Intelligent Information Hiding and Multimedia 2006: 321-326. DOI: 10.1109/IIH-MSP.2006.265008.
- Matthews I, Cootes T, Cox S, Harvey R, Bangham A. Lipreading using shape, shading and scale. 1998. Source: <http://www.iainm.com/publications/Matthews1998-AamIoa/paper.pdf>.
- Yau WC, Kumar DK, Arjunan SP. Voiceless speech recognition using dynamic visual speech features. Conf in Research and Practice in Information Technology (CRPIT) 2006: 93-101.
- Zhao G, Barnard M, Pietikainen M. Lipreading with local spatiotemporal descriptors. IEEE Trans Multimedia 2009; 11(7): 1254-1265. DOI: 10.1109/TMM.2009.2030637.
- Shaikh A, Kumar D, Yau W, Azemin M, Gubbi J. Lip reading using optical flow and support vector machines. 3rd Int Congress on Image and Signal Processing 2010; 1: 327-330. DOI: 10.1109/CISP.2010.5646264.
- Zhou Z, Zhao G, Hong X, Pietikäinen M. A review of recent advances in visual speech decoding. Image Vis Comput 2014; 32(9): 590-605. DOI: 10.1016/j.imavis.2014.06.004.
- Feng D, Yang S, Shan S, Chen X. Learn an effective lip reading model without pains. arXiv Preprtint. 2020. Source: <https://arxiv.org/abs/2011.07557>. DOI: 10.48550/arXiv.2011.07557.
- Weng X, Kitani K. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading. BMVC 2019: 1-13. Source: <https://arxiv.org/abs/1905.02540>. DOI: 10.48550/arXiv.1905.02540.
- Martinez B, Ma P, Petridis S, Pantic M. Lipreading using temporal convolutional networks. IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP) 2020: 6319-6323. DOI: 10.1109/ICASSP40776.2020.9053841.
- Chen H, Du J, Hu Y, Dai L, Lee C, Yin B. Lip-reading with hierarchical pyramidal convolution and self-attention. arXiv Preprtint. 2020. Source: <https://arxiv.org/abs/2012.14360>. DOI: 10.48550/arXiv.2012.14360.
- Ivanko D, Ryumin D. A novel task-oriented approach toward automated lip-reading system implementation. Int Arch Photogramm Remote Sens Spatial Inf Sci 2021; XLIV-2/W1-2021: 85-89. DOI: 10.5194/isprs-archives-XLIV-2-W1-2021-85-2021.
- Ivanko D, Ryumin D, Kipyatkova I, Axyonov A, Karpov A. Lip-reading using pixel-based and geometry-based features for multimodal human–robot interfaces. Proc 14th Int Conf on Electromechanics and Robotics “Zavalishin's Readings”. Smart Innovation, Systems and Technologies 2020: 154. DOI: 10.1007/978-981-13-9267-2_39.
- Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. Interspeech 2017: 1-5. Source: <https://arxiv.org/abs/1703.04105>. DOI: 10.48550/arXiv.1703.04105.
- Zhang T, He L, Li X, Feng G. Efficient end-to-end sentence-level lipreading with temporal convolutional networks. Appl Sci 2021; 11(15): 6975. DOI: 10.3390/app11156975.
- He L, Ding B, Wang H, Zhang T. An optimal 3D convolutional neural network based lipreading method. IET Image Process 2022; 16: 113-122. DOI: 10.1049/ipr2.12337.
- Rastgoo R, Kiani K, Escalera S. Sign language recognition: A deep survey. Expert Syst Appl 2021; 164: 113794. DOI: 10.1016/j.eswa.2020.113794.
- Sign language recognition datasets. Source: <http://facundoq.github.io/guides/sign_language_datasets/slr>.
- Martinez AM, Wilbur RB, Shay R, Kak AC, Purdue RVL-SLLL ASL database for automatic recognition of American Sign Language. Proc Fourth IEEE Int Conf on Multimodal Interfaces 2002: 167-172. DOI: 10.1109/ICMI.2002.1166987.
- RVL-SLLL American Sign Language Database. Source: <https://engineering.purdue.edu/RVL/Database/ASL/asl-database-front.htm>.
- Purdue ASL Database. Source: <http://www2.ece.ohio-state.edu/~aleix/ASLdatabase.htm>.
- Dreuw P, Forster J, Deselaers T, Ney H. Efficient approximations to model-based joint tracking and recognition of continuous sign language. IEEE Int Conf on Automatic Face and Gesture Recognition (FG) 2008: 1-6. DOI: 10.1109/AFGR.2008.4813439.
- Forster J, Schmidt C, Hoyoux T, Koller O, Zelle U, Piater J, Ney H. RWTH-PHOENIX-Weather: A large vocabulary sign language recognition and translation corpus. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12) 2012: 3785-3789. Source: <http://www.lrec-conf.org/proceedings/lrec2012/pdf/844_Paper.pdf>.
- MSR Gesture 3D Dataset. Source: <https://wangjiangb.github.io/my_data.html>.
- Kurakin A, Zhang Z, Liu Z. A real time system for dynamic hand gesture recognition with a depth sensor. Proc 20th European Signal Processing Conf (EUSIPCO) 2012: 1975-1979. DOI: 10.5281/zenodo.42817.
- Corpus of Russian Sign Language [In Russian]. Source: <http://rsl.nstu.ru/site/index>.
- Joze HR, Koller O. MS-ASL: A large-scale data set and benchmark for understanding american sign language. arXiv Preprint. 2019. Source: <https://arxiv.org/abs/1812.01053>. DOI: 10.48550/arXiv.1812.01053.
- Kagirov I, Ivanko D, Ryumin D, Petrovsky AA, Karpov A. TheRuSLan: Database of Russian Sign Language. Proc 12th Conf on Language Resources and Evaluation (LREC 2020) 2020: 6079-6085.
- Chen M, Alregib G, Juang B-H. 6DMG: A new 6D motion gesture database. Proc Third Annual ACM SIGMM Conf on Multimedia Systems, MMSys 2012: 83-88. DOI: 10.1145/2155555.2155569.
- The NUS hand posture dataset-II. Source: <https://www.ece.nus.edu.sg/stfpage/elepv/NUS-HandSet/>.
- Kagirov IA, Ryumin DA, Aksenov AA, Karpov AA. Multimedia database of Russian Sign Language items in 3D [In Russian]. Voprosy Jazykoznanija 2020; 1: 104-123. DOI: 10.31857/S0373658X0008302-1.
- Lim KM, Tan AWC, Tan SC. A feature covariance matrix with serial particle filter for isolated sign language recognition. Expert Syst Appl 2016; 54: 208-218. DOI: 10.1016/j.eswa.2016.01.047.
- Ong SC, Ranganath S. Automatic sign language analysis: a survey and the future beyond lexical meaning. IEEE Trans Pattern Anal Mach Intell 2005; 27(6): 873-891. DOI: 10.1109/TPAMI.2005.112.
- Elmezain M, Al-Hamadi A, Appenrodt J, Michaelis B. A hidden Markov model-based isolated and meaningful hand gesture recognition. World Academy of Science, Engineering and Technology 2008: 31. Source: <https://zenodo.org/record/1055307/files/1497.pdf?download=1>. DOI: 10.5281/zenodo.1055307.
- Sun H-M. Skin detection for single images using dynamic skin color modeling. Pattern Recognit 2010; 43(4): 1413-1420. DOI: 10.1016/j.patcog.2009.09.022.
- Song Y, Demirdjian D, Davis R. Continuous body and hand gesture recognition for natural human-computer interaction. ACM Trans Interact Intell Syst 2012; 2(1): 5. DOI: 10.1145/2133366.2133371.
- Dardas NH, Georganas ND. Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques. IEEE Trans Instrum Meas 2011; 60(11): 3592-3607. DOI: 10.1109/TIM.2011.2161140.
- Sykora P, Kamencay P, Hudec R. Comparison of SIFT and SURF methods for use on hand gesture recognition based on depth map. AASRI Procedia 2014; 9: 19-24. DOI: 10.1016/j.aasri.2014.09.005.
- Tharwat A, Gaber T, Hassanien AE, Shahin M, Refaat B. SIFT-based Arabic Sign Language recognition system. Adv Intell Syst Comput 2014: 334. DOI: 10.1007/978-3-319-13572-4_30.
- Hartanto R, Susanto A, Santosa PI. Real time static hand gesture recognition system prototype for Indonesian sign language. 6th Int Conf on Information Technology and Electrical Engineering (ICITEE) 2014: 1-6. DOI: 10.1109/ICITEED.2014.7007911.
- Akmeliawati R, Dadgostar F, Demidenko SN, Gamage N, Kuang Y, Messom C, Ooi M, Sarrafzadeh A, SenGupta G. Towards real-time sign language analysis via markerless gesture tracking. IEEE Instrumentation and Measurement Technology Conf 2009: 1200-1204. DOI: 10.1109/IMTC.2009.5168637.
- Huong TN, Huu TV, Xuan TL, Van SV. Static hand gesture recognition for vietnamese sign language (VSL) using principle components analysis. Int Conf on Communications, Management and Telecommunications (ComManTel) 2015; 138-141. DOI: 10.1109/ComManTel.2015.7394275.
- Yasir R, Khan R. Two-handed hand gesture recognition for Bangla sign language using LDA and ANN. Proc 8th Int Conf on Software, Knowledge, Information Management and Applications (SKIMA 2014) 2014: 1-5. DOI: 10.1109/SKIMA.2014.7083527.
- Suriya M, Sathyapriya N, Srinithi M, Yesodha V. Survey on Real Time Sign Language recognition system: An LDA approach. International Journal of P2P Network Trends and Technology (IJPTT) 2017; 7: 8-13.
- Suharjito S, Ariesta M, Wiryana F, Kusuma Negara IGP. A survey of hand gesture recognition methods in sign language recognition. Pertanika J Sci Technol 2018; 26: 1659-1675. DOI: 10.1145/3492547.3492578.
- Nikam AS, Ambekar AG. Sign language recognition using image based hand gesture recognition techniques. Online Int Conf on Green Engineering and Technologies (IC-GET) 2016: 1-5. DOI: 10.1109/GET.2016.7916786.
- Dreuw P, Rybach D, Deselaers T, Zahedi M, Ney H. Speech recognition techniques for a sign language recognition system. INTERSPEECH 2007, 8th Annual Conf of the International Speech Communication Association 2007; 1: 2513-2516. DOI: 10.21437/Interspeech.2007-668.
- Kaluri R, Reddy Ch P. An enhanced framework for sign gesture recognition using hidden Markov model and adaptive histogram technique. Int J Intell Eng Syst 2017; 10: 11-19. DOI: 10.22266/ijies2017.0630.02.
- Rao GA, Syamala K, Kishore PVV, Sastry ASCS. Deep convolutional neural networks for sign language recognition. Conf on Signal Processing And Communication Engineering Systems (SPACES) 2018: 194-197. DOI: 10.1109/SPACES.2018.8316344.
- Masood S, Srivastava A, Thuwal HC, Ahmad M. Real-time sign language gesture (word) recognition from video sequences using CNN and RNN. Intelligent Engineering Informatics. Adv Intell Syst Comput 2018; 695: 623-632. DOI: 10.1007/978-981-10-7566-7_63.
- Zhou H, Zhou W, Li H. Dynamic pseudo label decoding for continuous sign language recognition. IEEE Int Conf on Multimedia and Expo (ICME) 2019: 1282-1287. DOI: 10.1109/ICME.2019.00223.
- Ehssan Aly SA, Hassanin A, Bekhet S. ESLDL: An integrated deep learning model for Egyptian Sign Language recognition. 3rd Novel Intelligent and Leading Emerging Sciences Conf (NILES) 2021: 331-335. DOI: 10.1109/NILES53778.2021.9600492.
- Guo D, Zhou W, Li H, Wang M. Hierarchical LSTM for Sign Language Translation. Proc AAAI Conf on Artificial Intelligence 2018; 32(1): 6845-6852. Source: <https://ojs.aaai.org/index.php/AAAI/article/view/12235>.
- Huang J, Zhou W, Zhang Q, Li H, Li W. Video-based sign language recognition without temporal segmentation. Proc Thirty-Second AAAI Conf on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence 2018; 275: 2257-2264. DOI: 10.48550/arXiv.1801.10111.
- Basnin N, Nahar L, Hossain MS. An integrated CNN-LSTM Model for Bangla Lexical Sign Language Recognition. Proc Int Conf on Trends in Computational and Cognitive Engineering. Adv Intell Syst Comput 2021; 1309: 695-707. DOI: 10.1007/978-981-33-4673-4_57.
- Aparna C, Geetha M. CNN and Stacked LSTM Model for Indian Sign Language Recognition. Machine Learning and Metaheuristics Algorithms, and Applications 2020; 1203: 126-134. DOI: 10.1007/978-981-15-4301-2_10.
- Papadimitriou K, Potamianos G. Multimodal Sign Language recognition via temporal deformable convolutional sequence learning. INTERSPEECH 2020: 2752-2756. DOI: 10.21437/Interspeech.2020-2691.
- Gunawan MR, Djamal EC. Spatio-temporal approach using CNN-RNN in hand gesture recognition. 4th Int Conf of Computer and Informatics Engineering (IC2IE) 2021: 385-389. DOI: 10.1109/IC2IE53219.2021.9649108.
- Koller O, Ney H, Bowden R. Deep learning of mouth shapes for sign language. IEEE Int Conf on Computer Vision Workshop (ICCVW) 2015: 477-483. DOI: 10.1109/ICCVW.2015.69.
- Ivanko D, Ryumin D, Karpov A. Automatic lip-reading of hearing impaired people. ISPRS – International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 2019; XLII-2/W12: 97-101. DOI: 10.5194/isprs-archives-XLII-2-W12-97-2019.
- Grif MG, Korolkova OO, Prikhodko AL. Sign speech recognition taking into account combinatorial changes in gestures [In Russian]. Informatics: Problems, Methods, Technologies: Materials of 21 Int Sci Method Conf 2021: 1387-1393. ISBN 978-5-6045486-2-2.
- Mukushev M, Imashev A, Kimmelman V, Sandygulova A. Automatic classification of handshapes in Russian Sign Language. Proc LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives 2020: 165-170. Source: <https://aclanthology.org/2020.signlang-1.27.pdf>.
- Prikhodko A, Grif M, Bakaev M. Sign Language Recognition based on notations and neural networks. Digital Transformation and Global Society DTGS 2020: 1242. DOI: 10.1007/978-3-030-65218-0_34.
- Ryumin D. Models and methods for automatic recognition of Russian Sign Language elements for human-machine interaction [In Russian]. The thesis for the Candidate’s degree in Technical Sciences. Saint-Petersburg; 2020.
- Shi B, Rio AM, Keane J, Brentari D, Shakhnarovich G, Livescu K. Fingerspelling recognition in the wild with iterative visual attention. IEEE/CVF Int Conf on Computer Vision (ICCV) 2019: 5399-5408. DOI: 10.48550/arXiv.1908.10546.
- Fowley F, Ventresque A. Sign Language Fingerspelling recognition using synthetic data. AICS 2021; 84-95. Source: <http://ceur-ws.org/Vol-3105/paper23.pdf>.
- Pugeault N, Bowden R. Spelling it out: Real-time ASL fingerspelling recognition. IEEE Int Conf on Computer Vision Workshops (ICCV Workshops) 2011: 1114-1119. DOI: 10.1109/ICCVW.2011.6130290.
- Kang B, Tripathi S, Nguyen T. Real-time Sign Language Fingerspelling recognition using convolutional neural networks from depth map. 3rd IAPR Asian Conf on Pattern Recognition (ACPR) 2015: 136-140. DOI: 10.48550/arXiv.1509.03001.
- Kim T, Keane J, Wang W, Tang H, Riggle J, Shakhnarovich G, Brentari D, Livescu K. Lexicon-free fingerspelling recognition from video: data, models, and signer adaptation. Computer Speech & Language 2016; 46: 209-232. DOI: 10.1016/j.csl.2017.05.009.
- Shi B, Martinez Del Rio A, Keane J, Michaux J, Brentari D, Shakhnarovich G, Livescu K. American Sign Language fingerspelling recognition in the wild. IEEE Spoken Language Technology Workshop (SLT) 2018: 145-152. DOI: 10.1109/SLT.2018.8639639.
- Grif M, Kondratenko Y. Development of a software module for recognizing the fingerspelling of the Russian Sign Language based on LSTM. J Phys: Conf Ser 2021; 2032: 012024. DOI: 10.1088/1742-6596/2032/1/012024.
- ASL Finger Spelling Dataset. Source: <https://empslocal.ex.ac.uk/people/staff/np331/index.php?section=FingerSpellingDataset>.
- Martynov DA, Voronova LI. Application of perceptron for dactyl recognition of russian sign language [In Russian]. DSPA: Voprosy Primeneniya Cifrovoj Obrabotki Signalov 2020; 2: 37-46. Source: <http://media-publisher.ru/wp-content/uploads/DSPA-2-2020.pdf>.
- Makarov I, Veldyaykin N, Chertkov M, Pokoev A. American and Russian sign language dactyl recognition. PETRA '19: Proc 12th ACM Int Conf on Pervasive Technologies Related to Assistive Environments 2019: 204-210. DOI: 10.1145/3316782.3316786.
- Shi B, Livescu K. Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017: 389-396. DOI: 10.1109/ASRU.2017.8268962.
- Koller O, Zargaran S, Ney H. Re-Sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. IEEE Conf on Computer Vision and Pattern Recognition (CVPR) 2017: 3416-3424. DOI: 10.1109/CVPR.2017.364.
- Koller O, Zargaran S, Ney H. Deep Sign: Enabling robust statistical continuous sign language recognition via hybrid CNN-HMMs. Int J Comput Vis 2018; 126: 1311-1325. DOI: 10.1007/s11263-018-1121-3.
- Kenshimov C, Buribayev Z, Amirgaliyev Y, Ataniyazova A, Aitimov A. Sign language dactyl recognition based on machine learning algorithms. EasternEuropean J Enterp Technol 2021; 4(2:112): 58-72. DOI: 10.15587/1729-4061.2021.239253.
- Gajurel K, Zhong C, Wang G. A fine-grained visual attention approach for fingerspelling recognition in the wild. arXiv Preprint. 2021. Source: <https://arxiv.org/abs/2105.07625>. DOI: 10.48550/arXiv.2105.07625.
- Shi B, Brentari D, Shakhnarovich G, Livescu K. Fingerspelling detection in American Sign Language. IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2021: 4164-4173. DOI: 10.1109/CVPR46437.2021.00415.
- Ryabinov AV, Uzdyaev MU, Vatamanyuk IV. Application of multitasking deep learning in the task of recognizing emotions in speech [In Russian]. Izvestiya Yugo-Zapadnogo Gosudarstvennogo Universiteta 2021; 25(1): 82-109. DOI: 10.21869/2223-1560-2021-25-1-82-109.
- Ekman P, Freisen WV, Ancoli S. Facial signs of emotional experience. J Pers Soc Psychol 1980; 39(6): 1125-1134. DOI: 10.1037/h0077722.
- Russell JA. A circumplex model of affect. J Pers Soc Psychol 1980; 39(6): 1161-1178. DOI: 10.1037/h0077714.
- Lövheim H. A new three-dimensional model for emotions and monoamine neurotransmitters. Medical Hypotheses 2011; 78: 341-348. DOI: 10.1016/j.mehy.2011.11.016.
- Plutchik R. The nature of emotions. Am Sci 2001; 89(4): 344-350.
- Busso C, Bulut M, Lee C, Kazemzadeh A, Mower E, Kim S, Chang J, Lee S, Narayanan S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang Resour Eval 2008; 42(4): 335-359.
- Goodfellow IJ, Erhan D, Carrier PL, Courville A, Mirza M, Hamner B, Cukierski W, Tang Y, Thaler D, Lee D-H. Challenges in representation learning: A report on three machine learning contests. Int Conf on Neural Information Processing 2013: 117-124. DOI: 10.48550/arXiv.1307.0414.
- Lyons M, Kamachi M, Gyoba J. The Japanese Female Facial Expression (JAFFE) Dataset. 1998. Source: <https://zenodo.org/record/3451524#.Y60SPo5Bxpg>. DOI: 10.5281/zenodo.3451524.
- Mollahosseini A, Hasani B, Mahoor MH. AffectNet: A new database for facial expression, valence, and arousal computation in the wild. IEEE Trans Affect Comput 2017; 10: 18-31. DOI: 10.1109/TAFFC.2017.2740923.
- Kossaifi J, Tzimiropoulos G, Todorovic S, Pantic M. AFEW-VA database for valence and arousal estimation in-the-wild. Image Vis Comput 2017; 65: 23-36. DOI: 10.1016/j.imavis.2017.02.001.
- Perepelkina O, Kazimirova E, Konstantinova M. RAMAS: Russian multimodal corpus of dyadic interaction for studying emotion recognition. PeerJPreprints. 2018. Source: <https://peerj.com/preprints/26688/>. DOI: 10.7287/peerj.preprints.26688v1.
- Livingstone SR, Russo FA. The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018; 13(5): e0196391. DOI: 10.1371/journal.pone.0196391.
- CMU-MOSEI Dataset. Source: <https://github.com/A2Zadeh/CMU-MultimodalSDK>.
- Poria S, Hazarika D, Majumder N, Naik G, Mihalcea R, Cambria E. MELD: A multimodal multi-party dataset for emotion recognition in conversation. Proc 57th Annual Meeting of the Association for Computational Linguistics 2019: 527-536. DOI: 10.18653/v1/P19-1050.
- Zadeh A, Cao YS, Hessner S, Liang PP, Poria S, Morency LP. CMU-MOSEAS: A multimodal language dataset for spanish, portuguese, german and french. Proc Conf Empir Methods Nat Lang Process 2020; 2020: 1801-1812. DOI: 10.18653/v1/2020.emnlp-main.141.
- Ryumina EV, Karpov AA. Analytical review of methods for emotion recognition by human face expressions [In Russian]. Scientific and Technical Journal of Information Technologies, Mechanics and Optics 2020; 2: 163-176. DOI: 10.17586/2226-1494-2020-20-2-163-176.
- Chen J, Chen Z, Chi Z, Fu H. Facial Expression Recognition in Video with Multiple Feature Fusion. IEEE Trans Affect Comput 2018; 9(1): 38-50. DOI: 10.1109/TAFFC.2016.2593719.
- Pang L, Zhu S, Ngo C-W. Deep multimodal learning for affective analysis and retrieval. IEEE Trans Multimedia 2015; 17(11): 2008-2020. DOI: 10.1109/TMM.2015.2482228.
- Wei J, Yang X, Dong Y. User-generated video emotion recognition based on key frames. Multimed Tools Appl 2021; 80: 14343-14361. DOI: 10.1007/s11042-020-10203-1.
- Hu M, Chu Q, Wang X, He L, Ren F. A two-stage spatiotemporal attention convolution network for continuous dimensional emotion recognition from facial video. IEEE Signal Process Lett 2021; 28: 698-702. DOI: 10.1109/LSP.2021.3063609.
- Zhao Y, Chang Y, Lu Y, Wang Y, Dong M, Lv Q, Dick RP, Yang F, Lu T, Gu N, Shang L. Do smart glasses dream of sentimental visions? Deep emotionship analysis for eyewear devices. Proc ACM Interact Mob Wearable Ubiquitous Technol 2022; 6(1): 1-29. DOI: 10.1145/3517250.
- FER. Source: <https://github.com/justinshenk/fer>.
- Adikova А, Adamova А. Study and analysis of classifiers for use in emotion recognition [In Russian]. Vestnik KazNPU imeni Abaya, seriya «Fiziko-Matematicheskie Nauki» 2021; 4(76): 72-78. DOI: 10.51889/2021-4.1728-7901.10.
- OpenVINO Toolkit: emotion-recognition-retail-0003 Source: <https://docs.openvino.ai/2019_R1/_emotions_recognition_retail_0003_description_emotions_recognition_retail_0003.html>.
- Zhukova OV, ShelepinYuE, MalahovaEYu, Koskin SA, Koval'skaya AA, Fokin VA, Sokolov AV, Vasil'ev PP, Shchemeleva OV. Recognition of minimal changes in mimic. In Book: Anan'eva KI, Barabanshchikov VA. The human face in the contexts of nature, technology and culture [In Russian]. Moscow: “Kogito-Center” Publisher; 2020: 229-254.
- Andreeva SV. Syntax transformations in speech therapy work on the development of phrasal speech in students with ASD [In Russian]. Autism and Developmental Disorders 2019; 17(3): 36-46. DOI: 10.17759/autdd.2019170304.
- Gervasi O, Franzoni V, Riganelli M, Tasso S. Automating facial emotion recognition. Web Intelligence 2019; 17(1): 17-27. DOI: 10.3233/WEB-190397.
- Parcalabescu L, Trost N, Frank A. What is Multimodality? Proc 1st Workshop on Multimodal Semantic Representations (MMSR) 2021: 1-10. Source: <https://iwcs2021.github.io/proceedings/mmsr/pdf/2021.mmsr-1.1.pdf>.
- Pérez-Rosas V, Abouelenien M, Mihalcea R, Burzo M. Deception detection using real-life trial data. Proc 2015 ACM on Int Conf on Multimodal Interaction (ICMI '15) 2015: 59-66. DOI: 10.1145/2818346.2820758.
- Velichko A, Karpov A. Analytical review of automatic systems for depression detection by speech [In Russian]. Informatics and Automation 2021; 20(3): 497-529. DOI: 10.15622/ia.2021.3.1.
- Alnaim N, Abbod M, Albar A. Hand gesture recognition using convolutional neural network for people who have experienced a stroke. 3rd Int Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT) 2019: 1-6. DOI: 10.1109/ISMSIT.2019.8932739.
- Magirovskaya O, Privalikhina E, Srmikian V. Specific features and patterns of conceptualizing the emotions and feelings in sign language (the case of the regional variant of the Russian Sign Language in the Republic of Khakassia). Journal of Siberian Federal University. Humanities & Social Sciences 2020; 13(12): 1927-1936. DOI: 10.17516/1997-1370-0695.
- Nadezhdina OE. Application of information technologies for personal identification based on psycholinguistic analysis of oral and written speech [In Russian]. Aktual'nye Problemy Rossijskogo Prava 2008; 2(7): 383-392. Source: <https://cyberleninka.ru/article/n/primenenie-informatsionnyh-tehnologiy-dlya-identifikatsii-lichnosti-na-osnove-psiholingvisticheskogo-analiza-ustnoy-i-pismennoy-rechi>.
- Serdyuk D, Braga O, Siohan O. Transformer-based video front-ends for audio-visual speech recognition. arXiv Preprint. 2022. Source: <https://arxiv.org/abs/2201.10439>. DOI: 10.48550/arXiv.2201.10439.
- Song Q, Sun B, Li S. Multimodal sparse transformer network for audio-visual speech recognition. IEEE Transactions on Neural Networks and Learning Systems 2022: 1-11. DOI: 10.1109/TNNLS.2022.3163771.
- Wang Y, Huang R, Song S, Huang Z, Huang G. Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. 35th Conf on Neural Information Processing Systems 2021: 1-14. Source: <https://proceedings.neurips.cc/paper/2021/hash/64517d8435994992e682b3e4aa0a0661-Abstract.html>.
- Liu X, Wang L, Han X. Transformer with peak suppression and knowledge guidance for fine-grained image recognition. Neurocomputing 2022; 492: 137-149. DOI: 10.1016/j.neucom.2022.04.037.
- Neimark D, Bar O, Zohar M, Asselmann D. Video transformer network. Proc IEEE/CVF Int Conf on Computer Vision (ICCV) Workshops 2021: 3163-3172. DOI: 10.48550/arXiv.2102.00719.
- Liang Y, Zhou P, Zimmermann R, Yan S. DualFormer: Local-global stratified transformer for efficient video recognition. arXiv Preprint. 2021. Source: <https://arxiv.org/abs/2112.04674>. DOI: 10.48550/arXiv.2112.04674.
- Ma S, Wang S, Lin X. A Transformer-based model for sentence-level Chinese Mandarin Lipreading. IEEE Fifth Int Conf on Data Science in Cyberspace (DSC) 2020: 78-81. DOI: 10.1109/DSC50466.2020.00020.
- Huang H, Song C, Ting J, Tian T, Hong C, Di Z, Gao D. A novel machine lip reading model. Procedia Comput Sci 2022; 199: 1432-1437. DOI: 10.1016/j.procs.2022.01.181.
- Yang C, Wang S, Zhang X, Zhu Y. Speaker-independent lipreading with limited data. IEEE Int Conf on Image Processing (ICIP) 2020: 2181-2185. DOI: 10.1109/ICIP40778.2020.9190780.
- De Coster M, Van Herreweghe M, Dambre J. Sign language recognition with transformer networks. Proc 12th Int Conf on Language Resources and Evaluation (LREC 2020), European Language Resources Association (ELRA) 2020: 6018-6024. Source: <https://biblio.ugent.be/publication/8660743>.
- Camgoz NC, Koller O, Hadfield S, Bowden R. Sign language transformers: Joint end-to-end sign language recognition and translation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2020: 10023-10033. DOI: 10.48550/arXiv.2003.13830.
- Boháček M, Hrúz M. Sign pose-based transformer for word-level sign language recognition. Proc IEEE/CVF Winter Conf on Applications of Computer Vision (WACV) Workshops 2022: 182-191. DOI: 10.1109/WACVW54805.2022.00024.
- Jeong Y, Park H-M. Syllable-level Korean Fingerspelling Recognition from a video. 21st Int Conf on Control, Automation and Systems (ICCAS) 2021: 2206-2210. DOI: 10.23919/ICCAS52745.2021.9649992.
© 2009, IPSI RAS
151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: journal@computeroptics.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20