Transcription software with language model integration

With rapid technological advancements in Artificial Intelligence (AI), the healthcare sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transfor...

Full description

Saved in:
Bibliographic Details
Main Author: Najah Ismail
Other Authors: Liu Siyuan
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/181269
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-181269
record_format dspace
spelling sg-ntu-dr.10356-1812692024-11-20T07:50:22Z Transcription software with language model integration Najah Ismail Liu Siyuan College of Computing and Data Science SYLiu@ntu.edu.sg Computer and Information Science Automatic speech recognition Large language models With rapid technological advancements in Artificial Intelligence (AI), the healthcare sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transforming Natural Language Processing (NLP) tasks and augmenting ASR outputs. Existing solutions, although effective, underscore the need for locally-hosted software with on-device processing capabilities to fully harness these advancements in language processing and optimize clinical workflows. This paper in troduces a locally-hosted transcription solution built on a technological stack featuring Streamlit for the user interface, Pyannote for speaker diarization, Whisper for transcrip tion, and Llama3 for feature extraction, structured within a Human-In-The-Loop frame work. The system’s generated features include keywords, speaker profiles, topic summaries, reflections and conclusions. In medical ASR, challenges such as accurately interpreting diverse accents, dialects, and specialized medical terminology persist. Domain-specific fine-tuning of models for niche applications can prove costly and inefficient. Accordingly, parameter adjustments in the Whisper transcription model, combined with LLMs, are explored to derive linguistically and contextually relevant information. Preliminary findings suggest that ‘initial prompts’ aid in adapting to linguistic and contextual nuances, particularly when paired with more advanced Whisper models. The best-case outcomes show a 38% reduction in Word Error Rate (WER) when addressing audio with both medical jargon and regional Singaporean terminology. Furthermore, a feature extraction pipeline was developed to automatically generate key qualitative elements from interviews. Although the pipeline provides valuable insights, it demonstrates limitations in comparison to direct retrieval methods, revealing constraints in the cognitive capabilities of the Llama3 model. The study also examines transcription correction through generative error correction techniques, integrating Llama3 with multiple Whisper models. Three distinct processing pipelines—Diarization, Correction (with keywords), and Correction (without key words)—were evaluated. Despite encountering issues with overcorrection, minimal increases in WER and notable time savings underscore the potential of the Diarization step as a time-efficient alternative to traditional diarization approaches. Bachelor's degree 2024-11-20T07:50:22Z 2024-11-20T07:50:22Z 2024 Final Year Project (FYP) Najah Ismail (2024). Transcription software with language model integration. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181269 https://hdl.handle.net/10356/181269 en SCSE23-1135 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Automatic speech recognition
Large language models
spellingShingle Computer and Information Science
Automatic speech recognition
Large language models
Najah Ismail
Transcription software with language model integration
description With rapid technological advancements in Artificial Intelligence (AI), the healthcare sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transforming Natural Language Processing (NLP) tasks and augmenting ASR outputs. Existing solutions, although effective, underscore the need for locally-hosted software with on-device processing capabilities to fully harness these advancements in language processing and optimize clinical workflows. This paper in troduces a locally-hosted transcription solution built on a technological stack featuring Streamlit for the user interface, Pyannote for speaker diarization, Whisper for transcrip tion, and Llama3 for feature extraction, structured within a Human-In-The-Loop frame work. The system’s generated features include keywords, speaker profiles, topic summaries, reflections and conclusions. In medical ASR, challenges such as accurately interpreting diverse accents, dialects, and specialized medical terminology persist. Domain-specific fine-tuning of models for niche applications can prove costly and inefficient. Accordingly, parameter adjustments in the Whisper transcription model, combined with LLMs, are explored to derive linguistically and contextually relevant information. Preliminary findings suggest that ‘initial prompts’ aid in adapting to linguistic and contextual nuances, particularly when paired with more advanced Whisper models. The best-case outcomes show a 38% reduction in Word Error Rate (WER) when addressing audio with both medical jargon and regional Singaporean terminology. Furthermore, a feature extraction pipeline was developed to automatically generate key qualitative elements from interviews. Although the pipeline provides valuable insights, it demonstrates limitations in comparison to direct retrieval methods, revealing constraints in the cognitive capabilities of the Llama3 model. The study also examines transcription correction through generative error correction techniques, integrating Llama3 with multiple Whisper models. Three distinct processing pipelines—Diarization, Correction (with keywords), and Correction (without key words)—were evaluated. Despite encountering issues with overcorrection, minimal increases in WER and notable time savings underscore the potential of the Diarization step as a time-efficient alternative to traditional diarization approaches.
author2 Liu Siyuan
author_facet Liu Siyuan
Najah Ismail
format Final Year Project
author Najah Ismail
author_sort Najah Ismail
title Transcription software with language model integration
title_short Transcription software with language model integration
title_full Transcription software with language model integration
title_fullStr Transcription software with language model integration
title_full_unstemmed Transcription software with language model integration
title_sort transcription software with language model integration
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/181269
_version_ 1816859029366898688