(46-6) 14 * << * >> * Russian * English * Content * All Issues
Method for visual analysis of driver's face for automatic lip-reading in the wild
A.A. Axyonov 1, D.A. Ryumin 1, A.M. Kashevnik 1, D.V. Ivanko 1, A.A. Karpov 1
1 St. Petersburg Federal Research Center of the RAS (SPC RAS),
199178, St. Petersburg, Russia, 14th Line V.O. 39
PDF,13 MB
DOI: 10.18287/2412-6179-CO-1092
Pages: 955-962.
Full text of article: Russian language.
Abstract:
The paper proposes a method of visual analysis for automatic speech recognition of the vehicle driver. Speech recognition in acoustically noisy conditions is one of big challenges of artificial intelligence. The problem of effective automatic lip-reading in vehicle environment has not yet been resolved due to the presence of various kinds of interference (frequent turns of driver's head, vibration, varying lighting conditions, etc.). In addition, the problem is aggravated by the lack of available databases on this topic. A MediaPipe Face Mesh is used to find and extract the region-of-interest (ROI). We have developed End-to-End neural network architecture for the analysis of visual speech. Visual features are extracted from a single image using a convolutional neural network (CNN) in conjunction with a fully connected layer. The extracted features are input to a Long Short-Term Memory (LSTM) neural network. Due to a small amount of training data we proposed that a Transfer Learning method should be applied. Experiments on visual analysis and speech recognition present great opportunities for solving the problem of automatic lip-reading. The experiments were performed on an in-house multi-speaker audio-visual dataset RUSAVIC. The maximum recognition accuracy of 62 commands is 64.09 %. The results can be used in various automatic speech recognition systems, especially in acoustically noisy conditions (high speed, open windows or a sunroof in a vehicle, backgoround music, poor noise insulation, etc.) on the road.
Keywords:
vehicle, driver, visual speech recognition, automated lip-reading, machine learning, End-to-End, CNN, LSTM.
Citation:
Axyonov AA, Ryumin DA, Kashevnik AM, Ivanko DV, Karpov AA. Method for visual analysis of driver's face for automatic lip-reading in the wild. Computer Optics 2022; 46(6): 955-962. DOI: 10.18287/2412-6179-CO-1092.
Acknowledgements:
This work was partly funded by the Russian Foundation for Basic Research under grant No. 19-29-09081 and the state research project No. 0073-2019-0005.
References:
- Road traffic injuries. Source: <https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries>.
- Indicators of road safety. Source: <http://stat.gibdd.ru>.
- Ivanko D, Ryumin D. A novel task-oriented approach toward automated lip-reading system implementation. Int Arch Photogramm Remote Sens Spatial Inf Sci 2021; XLIV-2/W1-2021: 85-89. DOI: 10.5194/isprs-archives-XLIV-2-W1-2021-85-2021.
- McGurk H, MacDonald J. Hearing lips and seeing voices. Nature 1976; 264: 746-748.
- Chung JS, Zisserman A. Lip reading in the wild. Asian Conf on Computer Vision (ACCV) 2016: 87-103. DOI: 10.1007/978-3-319-54184-6_6.
- Yang S, Zhang Y, Feng D, Yang M, Wang C, Xiao J, Chen X. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. Int Conf on Automatic Face and Gesture Recognition (FG) 2019: 1-8. DOI: 10.1109/FG.2019.8756582.
- Chen X, Du J, Zhang H. Lipreading with DenseNet and resBi-LSTM. Signal Image Video Process 2020; 14: 981-989. DOI: 10.1007/s11760-019-01630-1.
- Feng D, Yang S, Shan S. An efficient software for building LIP reading models without pains. Int Conf on Multimedia and Expo Workshops (ICMEW) 2021: 1-2. DOI: 10.1109/ICMEW53276.2021.9456014.
- Martinez B, Ma P, Petridis S, Pantic M. Lipreading using temporal convolutional networks. Int Conf on Acoustics, Speech and Signal Processing (ICASSP) 2020: 6319-6323. DOI: 10.1109/ICASSP40776.2020.9053841.
- Zhang Y, Yang S, Xiao J, Shan S, Chen X. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition. Int Conf on Automatic Face and Gesture Recognition (FG) 2020: 356-363. DOI: 10.1109/FG47880.2020.00134.
- Ma P, Martinez B, Petridis S, Pantic M. Towards practical lipreading with distilled and efficient models. Int Conf on Acoustics, Speech and Signal Processing (ICASSP) 2021: 7608-7612. DOI: 10.1109/ICASSP39728.2021.9415063.
- Sui C, Bennamoun M, Togneri R. Listening with your eyes: Towards a practical visual speech recognition system using deep Boltzmann machines. Proc Int Conf on Computer Vision (ICCV) 2015: 154-162.
- Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. Interspeech 2017: 3652-3656.
- Hlaváč M, Gruber I, Železný M, Karpov A. Lipreading with LipsID. Int Conf on Speech and Computer (SPECOM) 2020: 176-183. DOI: 10.1007/978-3-030-60276-5_18.
- Viola P, Jones M. Rapid object detection using a boosted cascade of simple features. Proc Computer Society Conf on Computer Vision and Pattern Recognition (CVPR) 2001; 1: 511-518. DOI: 10.1109/CVPR.2001.990517.
- Cootes TF, Edwards GJ, Taylor CJ. Active appearance models. IEEE Trans Pattern Anal Mach Intell 2001; 23(6): 681-685. DOI: 10.1109/34.927467.
- Xu B, Wang J, Lu C, Guo Y. Watch to listen clearly: Visual speech enhancement driven multi-modality speech recognition. Proc IEEE/CVF Winter Conf on Applications of Computer Vision 2020: 1637-1646.
- Ryumina E, Ryumin D, Ivanko D, Karpov A. A novel method for protective face mask detection using convolutional neural networks and image histograms. Int Arch Photogramm Remote Sens Spatial Inf Sci 2021; XLIV-2/W1-2021: 177-182. DOI: 10.5194/isprs-archives-XLIV-2-W1-2021-177-2021.
- Ryumina E, Karpov A. Facial expression recognition using distance importance scores between facial landmarks. Graphicon, CEUR Workshop Proceedings 2020; 2744: 1-10.
- Ivanko D, Ryumin D, Axyonov A, Kashevnik A. Speaker-dependent visual command recognition in vehicle cabin: Methodology and evaluation. In Book: Karpov A, Potapova R, eds. Speech and Computer (SPECOM). Lecture Notes in Computer Science 2021; 12997: 291-302. DOI: 10.1007/978-3-030-87802-3_27.
- Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T. Lipreading using convolutional neural network. Proc Annual Conf of the Int Speech Communication Association (INTERSPEECH) 2014: 1149-1153.
- Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997; 9(8): 1735-1780.
- Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic М. End-to-end audiovisual speech recognition. 2018 IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP) 2018; 6548-6552.
- Kashevnik A, Lashkov I, Axyonov A, Ivanko D, Ryumin D, Kolchin A, Karpov A. Multimodal corpus design for audio-visual speech recognition in vehicle cabin. IEEE Access 2021; 9: 34986-35003. DOI: 10.1109/ACCESS.2021.3062752.
- Lashkov I, Axyonov A, Ivanko D, Ryumin D, Karpov A, Kashevnik A. Multimodal Russian Driver Multimodal database of Russian speech of drivers in the cab of vehicles (RUSAVIC – RUSsian Audio-Visual speech in Cars) [In Russian]. Database State Registration Certificate N2020622063 of October 27, 2020.
- Ivanko D, Axyonov A, Ryumin D, Kashevnik A, Karpov A. RUSAVIC Corpus: Russian audio-visual speech in cars. Proc Thirteenth Language Resources and Evaluation Conference (LREC'22) 2022: 1555-1559.
- Kashevnik A, Lashkov I, Gurtov A. Methodology and mobile application for driver behavior analysis and accident prevention. IEEE Trans Intell Transp Syst 2019; 21(6): 2427-2436.
- Kashevnik A, Lashkov I, Ponomarev A, Teslya N, Gurtov A. Cloud-based driver monitoring system using a smartphone. Sensors 2020; 20(12): 6701-6715.
- The multi-speaker audiovisual corpus RUSAVIC. Source: <https://mobiledrivesafely.com/corpus-rusavic>.
- Fung I, Mak B. End-to-End Low-resource lip-reading with Maxout CNN and LSTM. Int Conf on Acoustics, Speech and Signal Processing (ICASSP) 2018: 2511-2515. DOI: 10.1109/ICASSP.2018.8462280.
- Xu K, Li D, Cassimatis N, Wang X. LCANet: End-to-end lipreading with cascaded attention-CTC. Int Conf on Automatic Face and Gesture Recognition (FG) 2018: 548-555. DOI: 10.1109/FG.2018.00088.
- Ma P, Petridis S, Pantic M. End-to-end audio-visual speech recognition with conformers. Int Conf on Acoustics, Speech and Signal Processing (ICASSP) 2021: 7613-7617. DOI: 10.1109/ICASSP39728.2021.9414567.
- Lugaresi C, Tang J, Nash H, McClanahan C, Uboweja E, Hays M, Zhang F, Chang CL, Yong M, Lee J, Chang WT, Hua W, Georg M, Grundmann M. Mediapipe: A framework for building perception pipelines. arXiv Preprint. 2019. Source: <https://arxiv.org/abs/1906.08172>.
- Shin H, Roth H, Gao M, Lu L, Xu Z, Nogues I, Summers RM. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans Med Imaging 2016; 35(5): 1285-31298. DOI: 10.1109/TMI.2016.2528162.
- Torchvision. Transforms. Source: <https://pytorch.org/vision/stable/transforms.html?highlight=randomequalize#torchvision.transforms.RandomEqualize>.
- Label smoothing. Source: <https://paperswithcode.com/method/label-smoothing>.
- 3D ResNet. Source: <https://pytorch.org/hub/facebookresearch_pytorchvideo_resnet/>.
- Zhong Z, Lin ZQ, Bidart R, Hu X, Ben Daya I, Li Z, Zheng W, Li J, Wong A. Squeeze-and-attention networks for semantic segmentation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition 2020; 13065-13074.
- Cosine annealing warm restarts. Source: <https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingWarmRestarts.html>.
© 2009, IPSI RAS
151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: journal@computeroptics.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20