Comparison of Deep Learning Models for Voice Disorder Classification Using Phonovibrographic Images

Authors

DOI:

https://doi.org/10.5566/ias.3741

Keywords:

Deep Learning Models, Functional Voice Disorders, High-speed video endoscopy, Phonovibrogram, Voice Disorder, Classification

Abstract

Accurate diagnosis of vocal fold disorders is difficult because of subtle variations between pathological conditions. Phonovibrography (PVG), generated from high-speed videoendoscopy (HSV), documents glottal vibration patterns as static images, allowing systemic analysis. In our study, we propose PVGNet, a hybrid deep learning model combining multiscale feature extraction and channel attention, designed specifically for PVG-based classification. We benchmark PVGNet against InceptionResNetV2, VGG19, DenseNet169, and X-ViT across binary, tertiary, and multi-class tasks. PVGNet continuously outperforms baselines in accuracy, F1-score, and AUC, by minimizing false negatives, which is important for reliable diagnosis. These results show PVG’s potential as a diagnostic imaging modality and PVGNet’s effectiveness in automated voice disorder classification.

 

Author Biographies

  • B Panchami, Sri Sivasubramaniya Nadar College of Engineering

    Ms. B. Panchami is a PhD Scholar in the Department of Biomedical Engineering at Sri Sivasubramaniya Nadar College of Engineering, Chennai, India. Her research work focuses on medical image analysis, machine learning, and deep learning.

  • S Pravin Kumar, Sri Sivasubramaniya Nadar College of Engineering

    Dr. S. Pravin Kumar is an Associate Professor in the Department of Biomedical Engineering at Sri Sivasubramaniya Nadar College of Engineering, Chennai, India. He has more than two decades of teaching and research experience, including post-doctoral work at Palacký University Olomouc, Czech Republic. His research interests include biomedical signal and image processing, machine learning, deep learning, medical instrumentation, and 3D/AR-based visualization for healthcare applications.

References

Andrade-Miranda G, Stylianou Y, Deliyski DD, Godino-Llorente JI, Henrich Bernardoni N (2020). Laryngeal image processing of vocal folds motion. Appl Sci 10:1556.

Bishop CM (1995). Neural networks for pattern recognition. Oxford Univ Press.

Bohr C, Kraeck A, Eysholdt U, Ziethe A, Dollinger M (2013). Quantitative analysis of organic vocal fold pathologies in females by high-speed endoscopy. Laryngoscope 123:1686–93.

Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40:834–48.

Chollet F (2017). Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition; 1251–58.

Deliyski DD, Petrushev PP, Bonilha HS, Gerlach TT, Martin-Harris B, Hillman RE (2008). Clinical implementation of laryngeal high-speed videoendoscopy: Challenges and evolution. Folia Phoniatr Logop 60:33–44.

Deliyski DD, Hillman RE (2010). State of the art laryngeal imaging: Research and clinical implications. Curr Opin Otolaryngol Head Neck Surg 18:147–52.

Doellinger M, Berry DA (2006). Visualization and quantification of the medial surface dynamics of an excised human vocal fold during phonation. J Voice 20:401–13.

Švec JG, Schutte HK, Svec H (2007). Phonovibrography: The fingerprint of vocal fold vibrations. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing; 949–52.

Döllinger M, Lohscheller J, Svec J, McWhorter A, Kunduk M (2011). Support vector machine classification of vocal fold vibrations based on phonovibrogram features. J Voice 25:435–56.

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020). An image is worth 16×16 words: Transformers for image recognition at scale. arXiv abs/2010.11929.

Fehling MK, Grosch F, Schuster ME, Schick B, Lohscheller J (2020). Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep convolutional lstm network. PLoS One 15:e0227791.

Ganaie MA, Hu M, Malik AK, Tanveer M, Suganthan PN (2022). Ensemble deep learning: A review. Eng Appl Artif Intell 115:105151.

Gomez P, Kist AM, Schlegel P, Berry DA, Chhetri DK, Durr S, Echternach M, Johnson AM, Kniesburges S, Kunduk M, Maryn Y, Schutzenberger A, Verguts M, Dollinger M (2020). BAGLS, a multihospital benchmark for automatic glottis segmentation. Sci Data 7:186.

Heaton J (2017). Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep learning. Genet Program Evolvable Mach 19.

Hu J, Shen L, Albanie S, Sun G, Wu E (2020). Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42:2011–23.

Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017). Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition;4700–08.

Ioffe S, Szegedy C (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. International conference on machine learning; 448–56.

Jiang J, Lin E, Hanson DG (2000). Vocal fold physiology. Otolaryngol Clin North Am 33:699–718.

Kamiloğlu RG, Sauter DA (2021). Voice production and perception. Oxford University Press.

Kist AM, Gomez P, Dubrovskiy D, Schlegel P, Kunduk M, Echternach M, Patel R, Semmler M, Bohr C, Durr S, Schutzenberger A, Dollinger M (2021). A deep learning enhanced novel software tool for laryngeal dynamics analysis. J Speech Lang Hear Res 64:1889–903.

Krogh A, Hertz JA (1991). A simple weight decay can improve generalization. Proceedings of the 5th International Conference on Neural Information Processing Systems; 950–7.

Kunduk M, Döllinger M, McWhorter AJ, Švec JG, Lohscheller J (2012). Vocal fold vibratory behavior changes following surgical treatment of polyps investigated with high-speed videoendoscopy and phonovibrography. Ann Otol Rhinol Laryngol 121:355–63.

Lohscheller J, Eysholdt U (2008). Phonovibrogram visualization of entire vocal fold dynamics. Laryngoscope 118:753–8.

Lohscheller J (2009). Towards evidence based diagnosis of voice disorders using phonovibrograms. International Symposium on Applied Sciences in Biomedical and Communication Technologies; 1–4.

Patidar M, Agrawal J (2016). Which mathematical and physiological formulas are describing voice pathology: An overview. J Gen Pract 4:1–4.

Malinowski J, Pietruszewska W, Kowalczyk M, Niebudek-Bogusz E (2024).Value of high-speed videoendoscopy as an auxiliary tool in differentiation of benign and malignant unilateral vocal lesions.

Murphy KP (2012). Machine learning: A probabilistic perspective.MIT Press.

Schlegel P, Kniesburges S, Durr S, Schutzenberger A, Dollinger M (2020). Machine learning based identification of relevant parameters for functional voice disorders derived from endoscopic high-speed recordings. Sci Rep 10:10517.

Simonyan K, Zisserman A (2014). Very deep convolutional networks for large-scale image recognition. arXiv 14091556.

Spina AL, Maunsell R, Sandalo K, Gusmao R, Crespo A (2009). Correlation between voice and life quality and occupation. Braz J Otorhinolaryngol 75:275–9.

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014). Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res15:1929–58.

Stemple JC, Roy N, Klaben B (2020). Clinical voice pathology: Theory and management. Sixth edition. San Diego, CA: Plural Publishing, Inc.

Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. Proceedings of the AAAI Conference on Artificial Intelligence 31.

Voigt D, Dollinger M, Braunschweig T, Yang A, Eysholdt U, Lohscheller J (2010a). Classification of functional voice disorders based on phonovibrograms. Artif Intell Med 49:51–9.

Voigt D, Dollinger M, Yang A, Eysholdt U, Lohscheller J (2010b). Automatic diagnosis of vocal fold paresis by employing phonovibrogram features and machine learning methods. Comput Methods Programs Biomed 99:275–88.

Wang X, Girshick RB, Gupta A, He K (2018). Non-local neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition; 7794–7803.

Downloads

Published

2025-11-28

Data Availability Statement

This study utilizes the publicly available BAGLS (Benchmark for Automatic Glottis Segmentation) dataset. 

Gomez P, Kist AM, Schlegel P, Berry DA, Chhetri DK, Durr S, Echternach M, Johnson AM, Kniesburges S, Kunduk M, Maryn Y, Schutzenberger A, Verguts M, Dollinger M (2020). BAGLS, a multihospital benchmark for automatic glottis segmentation. Sci Data 7:186.

Issue

Section

Original Research Paper

How to Cite

B Panchami, & S Pravin Kumar. (2025). Comparison of Deep Learning Models for Voice Disorder Classification Using Phonovibrographic Images. Image Analysis and Stereology, 44(3), 183-196. https://doi.org/10.5566/ias.3741