Transcription software with language model integration
With rapid technological advancements in Artificial Intelligence (AI), the healthcare sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transfor...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/181269 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-181269 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1812692024-11-20T07:50:22Z Transcription software with language model integration Najah Ismail Liu Siyuan College of Computing and Data Science SYLiu@ntu.edu.sg Computer and Information Science Automatic speech recognition Large language models With rapid technological advancements in Artificial Intelligence (AI), the healthcare sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transforming Natural Language Processing (NLP) tasks and augmenting ASR outputs. Existing solutions, although effective, underscore the need for locally-hosted software with on-device processing capabilities to fully harness these advancements in language processing and optimize clinical workflows. This paper in troduces a locally-hosted transcription solution built on a technological stack featuring Streamlit for the user interface, Pyannote for speaker diarization, Whisper for transcrip tion, and Llama3 for feature extraction, structured within a Human-In-The-Loop frame work. The system’s generated features include keywords, speaker profiles, topic summaries, reflections and conclusions. In medical ASR, challenges such as accurately interpreting diverse accents, dialects, and specialized medical terminology persist. Domain-specific fine-tuning of models for niche applications can prove costly and inefficient. Accordingly, parameter adjustments in the Whisper transcription model, combined with LLMs, are explored to derive linguistically and contextually relevant information. Preliminary findings suggest that ‘initial prompts’ aid in adapting to linguistic and contextual nuances, particularly when paired with more advanced Whisper models. The best-case outcomes show a 38% reduction in Word Error Rate (WER) when addressing audio with both medical jargon and regional Singaporean terminology. Furthermore, a feature extraction pipeline was developed to automatically generate key qualitative elements from interviews. Although the pipeline provides valuable insights, it demonstrates limitations in comparison to direct retrieval methods, revealing constraints in the cognitive capabilities of the Llama3 model. The study also examines transcription correction through generative error correction techniques, integrating Llama3 with multiple Whisper models. Three distinct processing pipelines—Diarization, Correction (with keywords), and Correction (without key words)—were evaluated. Despite encountering issues with overcorrection, minimal increases in WER and notable time savings underscore the potential of the Diarization step as a time-efficient alternative to traditional diarization approaches. Bachelor's degree 2024-11-20T07:50:22Z 2024-11-20T07:50:22Z 2024 Final Year Project (FYP) Najah Ismail (2024). Transcription software with language model integration. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181269 https://hdl.handle.net/10356/181269 en SCSE23-1135 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Automatic speech recognition Large language models |
spellingShingle |
Computer and Information Science Automatic speech recognition Large language models Najah Ismail Transcription software with language model integration |
description |
With rapid technological advancements in Artificial Intelligence (AI), the healthcare
sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with
Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transforming Natural Language Processing (NLP) tasks
and augmenting ASR outputs. Existing solutions, although effective, underscore the need
for locally-hosted software with on-device processing capabilities to fully harness these
advancements in language processing and optimize clinical workflows. This paper in troduces a locally-hosted transcription solution built on a technological stack featuring
Streamlit for the user interface, Pyannote for speaker diarization, Whisper for transcrip tion, and Llama3 for feature extraction, structured within a Human-In-The-Loop frame work. The system’s generated features include keywords, speaker profiles, topic summaries, reflections and conclusions.
In medical ASR, challenges such as accurately interpreting diverse accents, dialects,
and specialized medical terminology persist. Domain-specific fine-tuning of models for
niche applications can prove costly and inefficient. Accordingly, parameter adjustments
in the Whisper transcription model, combined with LLMs, are explored to derive linguistically and contextually relevant information. Preliminary findings suggest that ‘initial
prompts’ aid in adapting to linguistic and contextual nuances, particularly when paired
with more advanced Whisper models. The best-case outcomes show a 38% reduction in
Word Error Rate (WER) when addressing audio with both medical jargon and regional
Singaporean terminology. Furthermore, a feature extraction pipeline was developed to
automatically generate key qualitative elements from interviews. Although the pipeline
provides valuable insights, it demonstrates limitations in comparison to direct retrieval
methods, revealing constraints in the cognitive capabilities of the Llama3 model.
The study also examines transcription correction through generative error correction
techniques, integrating Llama3 with multiple Whisper models. Three distinct processing pipelines—Diarization, Correction (with keywords), and Correction (without key words)—were evaluated. Despite encountering issues with overcorrection, minimal increases in WER and notable time savings underscore the potential of the Diarization step as a time-efficient alternative to traditional diarization approaches. |
author2 |
Liu Siyuan |
author_facet |
Liu Siyuan Najah Ismail |
format |
Final Year Project |
author |
Najah Ismail |
author_sort |
Najah Ismail |
title |
Transcription software with language model integration |
title_short |
Transcription software with language model integration |
title_full |
Transcription software with language model integration |
title_fullStr |
Transcription software with language model integration |
title_full_unstemmed |
Transcription software with language model integration |
title_sort |
transcription software with language model integration |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/181269 |
_version_ |
1816859029366898688 |