Transcription software with language model integration

With rapid technological advancements in Artificial Intelligence (AI), the healthcare sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transfor...

Full description

Saved in:
Bibliographic Details
Main Author: Najah Ismail
Other Authors: Liu Siyuan
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/181269
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:With rapid technological advancements in Artificial Intelligence (AI), the healthcare sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transforming Natural Language Processing (NLP) tasks and augmenting ASR outputs. Existing solutions, although effective, underscore the need for locally-hosted software with on-device processing capabilities to fully harness these advancements in language processing and optimize clinical workflows. This paper in troduces a locally-hosted transcription solution built on a technological stack featuring Streamlit for the user interface, Pyannote for speaker diarization, Whisper for transcrip tion, and Llama3 for feature extraction, structured within a Human-In-The-Loop frame work. The system’s generated features include keywords, speaker profiles, topic summaries, reflections and conclusions. In medical ASR, challenges such as accurately interpreting diverse accents, dialects, and specialized medical terminology persist. Domain-specific fine-tuning of models for niche applications can prove costly and inefficient. Accordingly, parameter adjustments in the Whisper transcription model, combined with LLMs, are explored to derive linguistically and contextually relevant information. Preliminary findings suggest that ‘initial prompts’ aid in adapting to linguistic and contextual nuances, particularly when paired with more advanced Whisper models. The best-case outcomes show a 38% reduction in Word Error Rate (WER) when addressing audio with both medical jargon and regional Singaporean terminology. Furthermore, a feature extraction pipeline was developed to automatically generate key qualitative elements from interviews. Although the pipeline provides valuable insights, it demonstrates limitations in comparison to direct retrieval methods, revealing constraints in the cognitive capabilities of the Llama3 model. The study also examines transcription correction through generative error correction techniques, integrating Llama3 with multiple Whisper models. Three distinct processing pipelines—Diarization, Correction (with keywords), and Correction (without key words)—were evaluated. Despite encountering issues with overcorrection, minimal increases in WER and notable time savings underscore the potential of the Diarization step as a time-efficient alternative to traditional diarization approaches.