Adapting whisper for phoneme recognition on stroke-impaired speech

Phoneme recognition for impaired speech, such as that affected by stroke-related impairments, presents unique challenges due to phonetic vulnerability and articulation issues. This study investigates the adaptation of Whisper, a large-scale sequence-to-sequence audio model, for PR tasks in this doma...

全面介紹

Saved in:
書目詳細資料
主要作者: Ong, Hai Xiang
其他作者: -
格式: Final Year Project
語言:English
出版: Nanyang Technological University 2024
主題:
在線閱讀:https://hdl.handle.net/10356/181493
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
機構: Nanyang Technological University
語言: English
實物特徵
總結:Phoneme recognition for impaired speech, such as that affected by stroke-related impairments, presents unique challenges due to phonetic vulnerability and articulation issues. This study investigates the adaptation of Whisper, a large-scale sequence-to-sequence audio model, for PR tasks in this domain. This research applies the Self-Supervised Contrastive Recalibration for Robust Encoding (SCORE) methodology, leveraging its encoder for robust latent representation alignment. The results are compared against SOTA self-supervised learning models, WavLM and HuBERT, which have demonstrated strong performance in clean and noisy speech tasks. Despite the significant relative improvements in phoneme error rate achieved by WavLM and HuBERT using SCORE as reported in prior work, we found that Whisper consistently outperformed these models in PR for impaired speech. Whisper achieved a PER of 26.49%, surpassing the adjusted performance of WavLM and HuBERT even when accounting for hypothetical SCORE-induced gains. These findings suggest that Whisper’s architecture and extensive training on diverse data provide it with superior adaptability for handling speech variability and dysfluencies, highlighting its potential in clinical applications like speech therapy and rehabilitation. This study further explores the impact of layer-freezing strategies on model performance, revealing that unfreezing the top 8 layers in Whisper yields optimal PR results. While the exploration of layer freezing strategy is by no means exhaustive, these insights underscore the importance of architectural suitability, training diversity and task-specific fine-tuning techniques in advancing PR for impaired speech.