Transcription software with language model integration
With rapid technological advancements in Artificial Intelligence (AI), the healthcare sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transfor...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/181269 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | With rapid technological advancements in Artificial Intelligence (AI), the healthcare
sector is increasingly interested in integrating Automatic Speech Recognition (ASR) with
Large Language Models (LLMs) to enhance medical transcription for diagnostic, clin ical, and operational purposes, transforming Natural Language Processing (NLP) tasks
and augmenting ASR outputs. Existing solutions, although effective, underscore the need
for locally-hosted software with on-device processing capabilities to fully harness these
advancements in language processing and optimize clinical workflows. This paper in troduces a locally-hosted transcription solution built on a technological stack featuring
Streamlit for the user interface, Pyannote for speaker diarization, Whisper for transcrip tion, and Llama3 for feature extraction, structured within a Human-In-The-Loop frame work. The system’s generated features include keywords, speaker profiles, topic summaries, reflections and conclusions.
In medical ASR, challenges such as accurately interpreting diverse accents, dialects,
and specialized medical terminology persist. Domain-specific fine-tuning of models for
niche applications can prove costly and inefficient. Accordingly, parameter adjustments
in the Whisper transcription model, combined with LLMs, are explored to derive linguistically and contextually relevant information. Preliminary findings suggest that ‘initial
prompts’ aid in adapting to linguistic and contextual nuances, particularly when paired
with more advanced Whisper models. The best-case outcomes show a 38% reduction in
Word Error Rate (WER) when addressing audio with both medical jargon and regional
Singaporean terminology. Furthermore, a feature extraction pipeline was developed to
automatically generate key qualitative elements from interviews. Although the pipeline
provides valuable insights, it demonstrates limitations in comparison to direct retrieval
methods, revealing constraints in the cognitive capabilities of the Llama3 model.
The study also examines transcription correction through generative error correction
techniques, integrating Llama3 with multiple Whisper models. Three distinct processing pipelines—Diarization, Correction (with keywords), and Correction (without key words)—were evaluated. Despite encountering issues with overcorrection, minimal increases in WER and notable time savings underscore the potential of the Diarization step as a time-efficient alternative to traditional diarization approaches. |
---|