Transcription software with language model integration

With rapid technological advancements in Artificial Intelligence (AI), the healthcare sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transfor...

Full description

Saved in:

Bibliographic Details
Main Author:	Najah Ismail
Other Authors:	Liu Siyuan
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Automatic speech recognition Large language models
Online Access:	https://hdl.handle.net/10356/181269
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-181269
record_format	dspace
spelling	sg-ntu-dr.10356-1812692024-11-20T07:50:22Z Transcription software with language model integration Najah Ismail Liu Siyuan College of Computing and Data Science SYLiu@ntu.edu.sg Computer and Information Science Automatic speech recognition Large language models With rapid technological advancements in Artificial Intelligence (AI), the healthcare sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transforming Natural Language Processing (NLP) tasks and augmenting ASR outputs. Existing solutions, although effective, underscore the need for locally-hosted software with on-device processing capabilities to fully harness these advancements in language processing and optimize clinical workflows. This paper in troduces a locally-hosted transcription solution built on a technological stack featuring Streamlit for the user interface, Pyannote for speaker diarization, Whisper for transcrip tion, and Llama3 for feature extraction, structured within a Human-In-The-Loop frame work. The system’s generated features include keywords, speaker profiles, topic summaries, reflections and conclusions. In medical ASR, challenges such as accurately interpreting diverse accents, dialects, and specialized medical terminology persist. Domain-specific fine-tuning of models for niche applications can prove costly and inefficient. Accordingly, parameter adjustments in the Whisper transcription model, combined with LLMs, are explored to derive linguistically and contextually relevant information. Preliminary findings suggest that ‘initial prompts’ aid in adapting to linguistic and contextual nuances, particularly when paired with more advanced Whisper models. The best-case outcomes show a 38% reduction in Word Error Rate (WER) when addressing audio with both medical jargon and regional Singaporean terminology. Furthermore, a feature extraction pipeline was developed to automatically generate key qualitative elements from interviews. Although the pipeline provides valuable insights, it demonstrates limitations in comparison to direct retrieval methods, revealing constraints in the cognitive capabilities of the Llama3 model. The study also examines transcription correction through generative error correction techniques, integrating Llama3 with multiple Whisper models. Three distinct processing pipelines—Diarization, Correction (with keywords), and Correction (without key words)—were evaluated. Despite encountering issues with overcorrection, minimal increases in WER and notable time savings underscore the potential of the Diarization step as a time-efficient alternative to traditional diarization approaches. Bachelor's degree 2024-11-20T07:50:22Z 2024-11-20T07:50:22Z 2024 Final Year Project (FYP) Najah Ismail (2024). Transcription software with language model integration. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181269 https://hdl.handle.net/10356/181269 en SCSE23-1135 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Automatic speech recognition Large language models
spellingShingle	Computer and Information Science Automatic speech recognition Large language models Najah Ismail Transcription software with language model integration
description	With rapid technological advancements in Artificial Intelligence (AI), the healthcare sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transforming Natural Language Processing (NLP) tasks and augmenting ASR outputs. Existing solutions, although effective, underscore the need for locally-hosted software with on-device processing capabilities to fully harness these advancements in language processing and optimize clinical workflows. This paper in troduces a locally-hosted transcription solution built on a technological stack featuring Streamlit for the user interface, Pyannote for speaker diarization, Whisper for transcrip tion, and Llama3 for feature extraction, structured within a Human-In-The-Loop frame work. The system’s generated features include keywords, speaker profiles, topic summaries, reflections and conclusions. In medical ASR, challenges such as accurately interpreting diverse accents, dialects, and specialized medical terminology persist. Domain-specific fine-tuning of models for niche applications can prove costly and inefficient. Accordingly, parameter adjustments in the Whisper transcription model, combined with LLMs, are explored to derive linguistically and contextually relevant information. Preliminary findings suggest that ‘initial prompts’ aid in adapting to linguistic and contextual nuances, particularly when paired with more advanced Whisper models. The best-case outcomes show a 38% reduction in Word Error Rate (WER) when addressing audio with both medical jargon and regional Singaporean terminology. Furthermore, a feature extraction pipeline was developed to automatically generate key qualitative elements from interviews. Although the pipeline provides valuable insights, it demonstrates limitations in comparison to direct retrieval methods, revealing constraints in the cognitive capabilities of the Llama3 model. The study also examines transcription correction through generative error correction techniques, integrating Llama3 with multiple Whisper models. Three distinct processing pipelines—Diarization, Correction (with keywords), and Correction (without key words)—were evaluated. Despite encountering issues with overcorrection, minimal increases in WER and notable time savings underscore the potential of the Diarization step as a time-efficient alternative to traditional diarization approaches.
author2	Liu Siyuan
author_facet	Liu Siyuan Najah Ismail
format	Final Year Project
author	Najah Ismail
author_sort	Najah Ismail
title	Transcription software with language model integration
title_short	Transcription software with language model integration
title_full	Transcription software with language model integration
title_fullStr	Transcription software with language model integration
title_full_unstemmed	Transcription software with language model integration
title_sort	transcription software with language model integration
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/181269
_version_	1816859029366898688

Transcription software with language model integration

Similar Items