Evaluating the Impact of Audio Segment Duration on Transformer-Based Stuttering Detection Using Wav2Vec2

Authors

DOI:

https://doi.org/10.26438/ijcse/v13i7.5157

Keywords:

Stuttering detection, Speech processing

Abstract

Stuttering is a speech disorder that disrupts the fluency of verbal communication. Traditional assessment methods are subjective and labor-intensive, prompting the need for scalable, automated solutions. Recent advances in self-supervised learning and transformer-based models such as Wav2Vec2 offer promising capabilities for automated stuttering detection. This study investigates the effect of varying audio clip lengths on the classification accuracy of stuttering using Wav2Vec2 models. Experiments were conducted on three benchmark datasets—SEP-28k, FluencyBank, and KSoF—across clip durations ranging from 3 to 11 seconds. Results show that shorter audio segments (3–5 seconds) consistently achieve better classification accuracy, with a peak of 65.13% observed for 3-second segments using SEP-28k. Longer durations introduce performance variability, especially in cross-dataset evaluations. The findings support the design of efficient, real-time stuttering detection systems and inform optimal segment length for future speech analysis models.

References

[1] Shakeel A. Sheikh, Md Sahidullah, F. Hirsch, and S. Ouni, “Machine learning for stuttering identification: Review, challenges and future directions”, Neurocomputing, Vol.514, pp.385-402, 2022. doi: 10.1016/j.neucom.2022.10.015.

[2] V. Changawala and F. Rudzicz, “Whister: Using Whisper’s representations for Stuttering detection”, in Proc. Interspeech 2024.

[3] S. A. Sheikh, M. Sahidullah, F. Hirsch, and S. Ouni, “End-to-End and Self-Supervised Learning for ComParE 2022 Stuttering Sub-Challenge”, arXiv preprint, arXiv:2207.10817, 2022. doi: 10.48550/arXiv.2207.10817

[4] S. P. Bayerl, D. Wagner, E. Nöth, and K. Riedhammer, “Self-supervised learning for stuttering detection: Challenges and opportunities”, arXiv preprint, arXiv:2204.03417, 2022. doi: 10.48550/arXiv.2204.03417.

[5] S. A. Sheikh, M. Sahidullah, F. Hirsch, and S. Ouni, “Advances in Stuttering Detection: Exploring Self-Supervised and End-to-End Learning Approaches”, arXiv preprint, arXiv:2204.01564, 2022. doi: 10.48550/arXiv.2204.01564

[6] S. A. Sheikh, M. Sahidullah, F. Hirsch, and S. Ouni, “Robust Stuttering Detection via Multi-task and Adversarial Learning”, arXiv preprint, arXiv:2204.01735, 2022. doi: 10.48550/arXiv.2204.01735.

[7] R. Alnashwan, N. Alhakbani, A. Al-Nafjan, A. Almudhi, and W. Al-Nuwaiser, “Computational Intelligence-Based Stuttering Detection: A Systematic Review”, Diagnostics, Vol.13, No.23, pp.35-37, 2023. doi: 10.3390/diagnostics13233537.

[8] C. Lea, V. Mitra, A. Joshi, S. Kajarekar and J. P. Bigham, “SEP-28k: A Dataset for Stuttering Event Detection from Podcasts with People Who Stutter”, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, pp.6798-6802, 2021. doi: 10.1109/ICASSP39728.2021.9413520

[9] P. Filipowicz and B. Kostek, “Rediscovering Automatic Detection of Stuttering and Its Subclasses through Machine Learning—The Impact of Changing Deep Model Architecture and Amount of Data in the Training Set”, Applied Sciences, Vol.13, No.10, pp.6192, 2023. doi: 10.3390/app13106192.

[10] Basak, K.; Mishra, N.; Chang, H.-T. TranStutter: A Convolution-Free Transformer-Based Deep Learning Method to Classify Stuttered Speech Using 2D Mel-Spectrogram Visualization and Attention-Based Feature Representation. Sensors, 23, 8033, 2023. doi.org/10.3390/s23198033

[11] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh and K. Shaalan, “Speech Recognition Using Deep Neural Networks: A Systematic Review”, in IEEE Access, Vol.7, pp.19143-19165, 2019. doi: 10.1109/ACCESS.2019.2896880.

[12] T. Kourkounakis, A. Hajavi, and A. Etemad, “Detecting Multiple Speech Disfluencies Using a Deep Residual Network with Bidirectional Long Short-Term Memory”, arXiv preprint, arXiv:1910.12590, 2019. doi: 10.48550/arXiv.1910.12590

[13] T. Kourkounakis, A. Hajavi, and A. Etemad, “FluentNet: End-to-End Detection of Stuttered Speech Disfluencies With Deep Learning”, arXiv preprint, arXiv:2009.11394, 2020. doi: 10.48550/arXiv.2009.11394

[14] S. Khara, S. Singh and D. Vir, “A Comparative Study of the Techniques for Feature Extraction and Classification in Stuttering”, 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, pp.887-893, 2018. doi: 10.1109/ICICCT.2018.8473099

[15] S. A. Sheikh, M. Sahidullah, F. Hirsch, and S. Ouni, “StutterNet: Stuttering Detection Using Time Delay Neural Network”, arXiv preprint, arXiv:2105.05599, 2021. doi: 10.48550/arXiv.2105.05599

[16] Jouaiti, Melanie & Dautenhahn, Kerstin. Dysfluency Classification in Stuttered Speech Using Deep Learning for Real-Time Applications, 2022. 10.1109/ICASSP43922.2022.9746638.

[17] S. P. Bayerl, A. Wolff von Gudenberg, F. Hönig, E. Noeth, and K. Riedhammer”,KSoF: The Kassel State of Fluency Dataset – A Therapy Centered Dataset of Stuttering”, in Proceedings of the Language Resources and Evaluation Conference, Marseille, France: European Language Resources Association, pp.1780–1787, 2022.

[18] K. Floridi and M. Chiriatti, “GPT-3: Its Nature, Scope, Limits, and Consequences”, arXiv preprint arXiv:2006.11477, Sep. 2020. doi.org/10.48550/arXiv.2006.11477

Downloads

Published

2025-07-31
CITATION
DOI: 10.26438/ijcse/v13i7.5157
Published: 2025-07-31

How to Cite

[1]
R. Singh and D. Gupta, “Evaluating the Impact of Audio Segment Duration on Transformer-Based Stuttering Detection Using Wav2Vec2”, Int. J. Comp. Sci. Eng., vol. 13, no. 7, pp. 51–57, Jul. 2025.

Issue

Section

Research Article