Adapting whisper for phoneme recognition on stroke-impaired speech
Phoneme recognition for impaired speech, such as that affected by stroke-related impairments, presents unique challenges due to phonetic vulnerability and articulation issues. This study investigates the adaptation of Whisper, a large-scale sequence-to-sequence audio model, for PR tasks in this doma...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/181493 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Phoneme recognition for impaired speech, such as that affected by stroke-related impairments, presents unique challenges due to phonetic vulnerability and articulation issues. This study investigates the adaptation of Whisper, a large-scale sequence-to-sequence audio model, for PR tasks in this domain.
This research applies the Self-Supervised Contrastive Recalibration for Robust Encoding (SCORE) methodology, leveraging its encoder for robust latent representation alignment. The results are compared against SOTA self-supervised learning models, WavLM and HuBERT, which have demonstrated strong performance in clean and noisy speech tasks.
Despite the significant relative improvements in phoneme error rate achieved by WavLM and HuBERT using SCORE as reported in prior work, we found that Whisper consistently outperformed these models in PR for impaired speech. Whisper achieved a PER of 26.49%, surpassing the adjusted performance of WavLM and HuBERT even when accounting for hypothetical SCORE-induced gains.
These findings suggest that Whisper’s architecture and extensive training on diverse data provide it with superior adaptability for handling speech variability and dysfluencies, highlighting its potential in clinical applications like speech therapy and rehabilitation.
This study further explores the impact of layer-freezing strategies on model performance, revealing that unfreezing the top 8 layers in Whisper yields optimal PR results. While the exploration of layer freezing strategy is by no means exhaustive, these insights underscore the importance of architectural suitability, training diversity and task-specific fine-tuning techniques in advancing PR for impaired speech. |
---|