Adapting whisper for phoneme recognition on stroke-impaired speech

Phoneme recognition for impaired speech, such as that affected by stroke-related impairments, presents unique challenges due to phonetic vulnerability and articulation issues. This study investigates the adaptation of Whisper, a large-scale sequence-to-sequence audio model, for PR tasks in this doma...

Full description

Saved in:
Bibliographic Details
Main Author: Ong, Hai Xiang
Other Authors: -
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/181493
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-181493
record_format dspace
spelling sg-ntu-dr.10356-1814932024-12-05T05:47:40Z Adapting whisper for phoneme recognition on stroke-impaired speech Ong, Hai Xiang - College of Computing and Data Science Imperial College London Patrick Naylor p.naylor@imperial.ac.uk Computer and Information Science Phoneme recognition for impaired speech, such as that affected by stroke-related impairments, presents unique challenges due to phonetic vulnerability and articulation issues. This study investigates the adaptation of Whisper, a large-scale sequence-to-sequence audio model, for PR tasks in this domain. This research applies the Self-Supervised Contrastive Recalibration for Robust Encoding (SCORE) methodology, leveraging its encoder for robust latent representation alignment. The results are compared against SOTA self-supervised learning models, WavLM and HuBERT, which have demonstrated strong performance in clean and noisy speech tasks. Despite the significant relative improvements in phoneme error rate achieved by WavLM and HuBERT using SCORE as reported in prior work, we found that Whisper consistently outperformed these models in PR for impaired speech. Whisper achieved a PER of 26.49%, surpassing the adjusted performance of WavLM and HuBERT even when accounting for hypothetical SCORE-induced gains. These findings suggest that Whisper’s architecture and extensive training on diverse data provide it with superior adaptability for handling speech variability and dysfluencies, highlighting its potential in clinical applications like speech therapy and rehabilitation. This study further explores the impact of layer-freezing strategies on model performance, revealing that unfreezing the top 8 layers in Whisper yields optimal PR results. While the exploration of layer freezing strategy is by no means exhaustive, these insights underscore the importance of architectural suitability, training diversity and task-specific fine-tuning techniques in advancing PR for impaired speech. Bachelor's degree 2024-12-05T05:47:40Z 2024-12-05T05:47:40Z 2024 Final Year Project (FYP) Ong, H. X. (2024). Adapting whisper for phoneme recognition on stroke-impaired speech. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181493 https://hdl.handle.net/10356/181493 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
spellingShingle Computer and Information Science
Ong, Hai Xiang
Adapting whisper for phoneme recognition on stroke-impaired speech
description Phoneme recognition for impaired speech, such as that affected by stroke-related impairments, presents unique challenges due to phonetic vulnerability and articulation issues. This study investigates the adaptation of Whisper, a large-scale sequence-to-sequence audio model, for PR tasks in this domain. This research applies the Self-Supervised Contrastive Recalibration for Robust Encoding (SCORE) methodology, leveraging its encoder for robust latent representation alignment. The results are compared against SOTA self-supervised learning models, WavLM and HuBERT, which have demonstrated strong performance in clean and noisy speech tasks. Despite the significant relative improvements in phoneme error rate achieved by WavLM and HuBERT using SCORE as reported in prior work, we found that Whisper consistently outperformed these models in PR for impaired speech. Whisper achieved a PER of 26.49%, surpassing the adjusted performance of WavLM and HuBERT even when accounting for hypothetical SCORE-induced gains. These findings suggest that Whisper’s architecture and extensive training on diverse data provide it with superior adaptability for handling speech variability and dysfluencies, highlighting its potential in clinical applications like speech therapy and rehabilitation. This study further explores the impact of layer-freezing strategies on model performance, revealing that unfreezing the top 8 layers in Whisper yields optimal PR results. While the exploration of layer freezing strategy is by no means exhaustive, these insights underscore the importance of architectural suitability, training diversity and task-specific fine-tuning techniques in advancing PR for impaired speech.
author2 -
author_facet -
Ong, Hai Xiang
format Final Year Project
author Ong, Hai Xiang
author_sort Ong, Hai Xiang
title Adapting whisper for phoneme recognition on stroke-impaired speech
title_short Adapting whisper for phoneme recognition on stroke-impaired speech
title_full Adapting whisper for phoneme recognition on stroke-impaired speech
title_fullStr Adapting whisper for phoneme recognition on stroke-impaired speech
title_full_unstemmed Adapting whisper for phoneme recognition on stroke-impaired speech
title_sort adapting whisper for phoneme recognition on stroke-impaired speech
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/181493
_version_ 1819112942952513536