DEVELOPMENT OF A SPEECH TO TEXT INTERVIEW SUMMARIZATION SYSTEM BASED ON MACHINE LEARNING

The regeneration of human resources within a company is crucial to maintain the company's operations and achieve its vision and mission. Regenerating human resources can be achieved through the recruitment of employees. However, job recruitment itself consumes a significant amount of time an...

Full description

Saved in:
Bibliographic Details
Main Author: Hanif Raharjanto, Dwianditya
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/78150
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:The regeneration of human resources within a company is crucial to maintain the company's operations and achieve its vision and mission. Regenerating human resources can be achieved through the recruitment of employees. However, job recruitment itself consumes a significant amount of time and resources to find suitable candidates. This final project aims to provide a solution by combining machine and human resources to assist the company, particularly in terms of time and cost, especially during the interview process. This final project focuses on creating interview transcripts using a speech-to-text model and selecting the appropriate model for this case, either Wav2Vec2 (Wav2Vec2-XLSR-53) or Whisper (Whisper-small and Whisper-large). According to research conducted, the Whisper model performs better than Wav2Vec2. This is because Whisper is a weakly supervised model, whereas Wav2Vec2 is trained using semi-supervised methods. Additionally, the training corpus used for Whisper is larger than that of Wav2Vec2, and the Whisper model has more parameters, specifically 1.55 billion parameters compared to Wav2Vec2's 300 million parameters. Based on the experimental results, it was found that Whisper, especially Whisper-large, indeed outperforms Wav2Vec2 in terms of performance, with an accuracy represented by a Word Error Rate (WER) of 10.9% and an average processing time of 5 minutes and 23 seconds for audio durations of 5-7 minutes. In contrast, Wav2Vec2-XLSR-53 has a WER of 22.2% with a processing time of 13 minutes and 20 seconds. The model used to assist in the job interview process here is Whisper-large because it provides the required performance, which is both accurate and fast.