Punctuation restoration for speech transcripts using large language models

This thesis explores punctuation restoration in speech transcripts using Large Language Models (LLMs) to enhance text readability and comprehension. We focus on the efficacy of LLMs, specifically XLM-RoBERTa and Llama-2. The primary contributions include the refinement of the existing XLM-RoBERTa mo...

Full description

Saved in:
Bibliographic Details
Main Author: Liu, Changsong
Other Authors: Chng Eng Siong
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175306
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This thesis explores punctuation restoration in speech transcripts using Large Language Models (LLMs) to enhance text readability and comprehension. We focus on the efficacy of LLMs, specifically XLM-RoBERTa and Llama-2. The primary contributions include the refinement of the existing XLM-RoBERTa model and the fine-tuning of Llama-2, which possesses 13 billion parameters, using several advanced techniques. For the XLM-RoBERTa model, we implement an evaluation pipeline and utilize a model checkpoint ensemble technique that improved its F1-Score by 3%. The fine-tuned Llama-2 model incorporates prompt engineering and Low-Rank Adaptation (LoRA), achieving an F1-Score of 0.73. This score indicates performance that is comparable and even superior across all punctuation classes compared to Google’s state-of-the-art Gemini model. Additionally, this project develops and details the fine-tuning processes, data processing strategies, and standardized evaluation methodologies for different LLMs. Our experimental analysis provides a thorough evaluation of model performance and draws meaningful conclusions. Based on the refined model architecture and research conducted in this project, two papers have been published and accepted to ACIIDS and ICAICTA conferences. Future work will extend the scope of LLM evaluations to include additional datasets and will focus on further refining and fine-tuning the models to address challenges that emerge during our experiments.