LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES

Fatemeh Vakhshiteh; Farshad Almasganj; Ahmad Nickabadi

doi:10.5566/ias.1859

Authors

Fatemeh Vakhshiteh Amirkabir University of Technology - Tehran Polytechnic
Farshad Almasganj Amirkabir University of Technology - Tehran Polytechnic
Ahmad Nickabadi Amirkabir University of Technology - Tehran Polytechnic

DOI:

https://doi.org/10.5566/ias.1859

Keywords:

Deep belief Networks, Hidden Markov Model, lip-reading, Restricted Boltzmann Machine

Abstract

Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.

Author Biographies

Fatemeh Vakhshiteh, Amirkabir University of Technology - Tehran Polytechnic

Biomedical Engineering Department
Farshad Almasganj, Amirkabir University of Technology - Tehran Polytechnic

Biomedical Engineering Department
Ahmad Nickabadi, Amirkabir University of Technology - Tehran Polytechnic

Computer Engineering and Information Technology Department

References

Almajai I, Cox S, Harvey R, Lan Y (2016). Improved speaker independent lip reading using speaker adaptive training and deep neural networks. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2722-6.

Barker JP, Berthommier F (1999). Evidence of correlation between acoustic and visual features of speech. Ohala et al:199-202.

Bowden R, Cox S, Harvey RW, Lan Y, Ong E-J, Owen G, Theobald B-J (2012). Is automated conversion of video to text a reality? Optics and Photonics for Counterterrorism, Crime Fighting, and Defence VIII, volume SPIE 8546:85460U.

Cooke M, Barker J, Cunningham S, Shao X (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America 120:2421-4.

Cootes TF, Taylor CJ, Cooper DH, Graham J (1995). Active shape models-their training and application. Computer Vision and Image Understanding 61:38-59.

Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-r, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29:82-97.

Hinton GE, Osindero S, Teh Y-W (2006). A fast learning algorithm for deep belief nets. Neural Computation 18:1527-54.

Hochreiter S, Schmidhuber J (1997). Long short-term memory. Neural Computation 9:1735-80.

Huang J, Kingsbury B (2013). Audio-visual deep learning for noise robust speech recognition. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7596-7599.

Lan Y, Harvey R, Theobald BJ, Ong EJ, Bowden R (2009). Comparing visual features for lipreading. 2009 International Conference on Auditory-Visual Speech Processing, 102-6.

Lan Y, Theobald BJ, Harvey R, Ong EJ, Bowden R (2010). Improving visual features for lip-reading. Prodedings of the 2010 Conference on Auditory-Visual Speech Processing.

Lan Y, Harvey R, Theobald BJ (2012). Insights into machine lip reading. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4825-8.

Lan Y, Theobald BJ, Harvey R (2012). View independent computer lip-reading. 2012 IEEE International Conference on Multimedia and Expo (ICME), 432-7.

Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:198-213.

McClain M, Brady K, Brandstein M, Quatieri T (2004). Automated lip-reading for improved speech intelligibility. Proceedings of the 2004 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1:I-701.

Mohamed A-r, Dahl GE, Hinton G (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing 20:14-22.

Mohri M, Pereira F, Riley M (2008). Speech recognition with weighted finite-state transducers. Springer handbook of speech processing 559-84.

Mroueh Y, Marcheret E, Goel V (2015). Deep multimodal learning for audio-visual speech recognition. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2130-4.

Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A Y (2011). Multimodal deep learning. Proceedings of the 28th International Conference on Machine Learning (ICML-11), 689-696.

Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2014). Lipreading using convolutional neural network. 15th Annual Conference of the International Speech Communication Association.

Patterson EK, Gurbuz S, Tufekci Z, Gowdy JN (2002). Cuave: A new audio-visual database for multimodal human-computer interface research. 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2:II-2017.

Petridis S, Pantic M (2016). Deep complementary bottleneck features for visual speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2304-8.

Petridis S, Li Z, Pantic M (2017). End-to-end visual speech recognition with lstms. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2592-6.

Potamianos G, Neti C (2001). Improved roi and within frame discriminant features for lipreading. Proceedings of the 2001 International Conference on Image Processing, 3: 250-3.

Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91:1306-26.

Potamianos G, Neti C, Luettin J, Matthews I (2004). Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-Visual Speech Processing 22:23.

Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J (2011). The kaldi speech recognition toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, EPFL-CONF-192584.

Savchenko A, Khokhlova YI (2014). About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems. Optical Memory and Neural Networks 23:34-42.

Srivastava N, Salakhutdinov RR, (2014). Multimodal learning with deep boltzmann machines. Advances in Neural Information Processing Systems, 2222-30.

Stafylakis T, Tzimiropoulos G (2017). Combining residual networks with lstms for lipreading. arXiv preprint arXiv:1703.04105.

Sumby WH, Pollack I (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America 26:212-5.

Veselý K, Ghoshal A, Burget L, Povey D (2013). Sequence-discriminative training of deep neural networks. Interspeech 2345-9.

Wand M, Koutník J, Schmidhuber J, (2016). Lipreading with long short-term memory. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6115-9.

Welling M, Rosen-Zvi M, Hinton GE, (2005). Exponential family harmoniums with an application to information retrieval. Advances in Neural Information Processing Systems, 17:1481-8.

Yehia H, Rubin P, Vatikiotis-Bateson E (1998). Quantitative association of vocal-tract and facial behavior. Speech Communication 26:23-43.