DEVELOPMENT OF A SPEECH TO TEXT INTERVIEW SUMMARIZATION SYSTEM BASED ON MACHINE LEARNING

The regeneration of human resources within a company is crucial to maintain the company's operations and achieve its vision and mission. Regenerating human resources can be achieved through the recruitment of employees. However, job recruitment itself consumes a significant amount of time an...

Full description

Saved in:
Bibliographic Details
Main Author: Hanif Raharjanto, Dwianditya
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/78150
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:78150
spelling id-itb.:781502023-09-18T09:53:45ZDEVELOPMENT OF A SPEECH TO TEXT INTERVIEW SUMMARIZATION SYSTEM BASED ON MACHINE LEARNING Hanif Raharjanto, Dwianditya Indonesia Final Project Automatic Speech Recognition, Word Error Rate, Whisper, Wav2Vec2, Transformer. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/78150 The regeneration of human resources within a company is crucial to maintain the company's operations and achieve its vision and mission. Regenerating human resources can be achieved through the recruitment of employees. However, job recruitment itself consumes a significant amount of time and resources to find suitable candidates. This final project aims to provide a solution by combining machine and human resources to assist the company, particularly in terms of time and cost, especially during the interview process. This final project focuses on creating interview transcripts using a speech-to-text model and selecting the appropriate model for this case, either Wav2Vec2 (Wav2Vec2-XLSR-53) or Whisper (Whisper-small and Whisper-large). According to research conducted, the Whisper model performs better than Wav2Vec2. This is because Whisper is a weakly supervised model, whereas Wav2Vec2 is trained using semi-supervised methods. Additionally, the training corpus used for Whisper is larger than that of Wav2Vec2, and the Whisper model has more parameters, specifically 1.55 billion parameters compared to Wav2Vec2's 300 million parameters. Based on the experimental results, it was found that Whisper, especially Whisper-large, indeed outperforms Wav2Vec2 in terms of performance, with an accuracy represented by a Word Error Rate (WER) of 10.9% and an average processing time of 5 minutes and 23 seconds for audio durations of 5-7 minutes. In contrast, Wav2Vec2-XLSR-53 has a WER of 22.2% with a processing time of 13 minutes and 20 seconds. The model used to assist in the job interview process here is Whisper-large because it provides the required performance, which is both accurate and fast. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description The regeneration of human resources within a company is crucial to maintain the company's operations and achieve its vision and mission. Regenerating human resources can be achieved through the recruitment of employees. However, job recruitment itself consumes a significant amount of time and resources to find suitable candidates. This final project aims to provide a solution by combining machine and human resources to assist the company, particularly in terms of time and cost, especially during the interview process. This final project focuses on creating interview transcripts using a speech-to-text model and selecting the appropriate model for this case, either Wav2Vec2 (Wav2Vec2-XLSR-53) or Whisper (Whisper-small and Whisper-large). According to research conducted, the Whisper model performs better than Wav2Vec2. This is because Whisper is a weakly supervised model, whereas Wav2Vec2 is trained using semi-supervised methods. Additionally, the training corpus used for Whisper is larger than that of Wav2Vec2, and the Whisper model has more parameters, specifically 1.55 billion parameters compared to Wav2Vec2's 300 million parameters. Based on the experimental results, it was found that Whisper, especially Whisper-large, indeed outperforms Wav2Vec2 in terms of performance, with an accuracy represented by a Word Error Rate (WER) of 10.9% and an average processing time of 5 minutes and 23 seconds for audio durations of 5-7 minutes. In contrast, Wav2Vec2-XLSR-53 has a WER of 22.2% with a processing time of 13 minutes and 20 seconds. The model used to assist in the job interview process here is Whisper-large because it provides the required performance, which is both accurate and fast.
format Final Project
author Hanif Raharjanto, Dwianditya
spellingShingle Hanif Raharjanto, Dwianditya
DEVELOPMENT OF A SPEECH TO TEXT INTERVIEW SUMMARIZATION SYSTEM BASED ON MACHINE LEARNING
author_facet Hanif Raharjanto, Dwianditya
author_sort Hanif Raharjanto, Dwianditya
title DEVELOPMENT OF A SPEECH TO TEXT INTERVIEW SUMMARIZATION SYSTEM BASED ON MACHINE LEARNING
title_short DEVELOPMENT OF A SPEECH TO TEXT INTERVIEW SUMMARIZATION SYSTEM BASED ON MACHINE LEARNING
title_full DEVELOPMENT OF A SPEECH TO TEXT INTERVIEW SUMMARIZATION SYSTEM BASED ON MACHINE LEARNING
title_fullStr DEVELOPMENT OF A SPEECH TO TEXT INTERVIEW SUMMARIZATION SYSTEM BASED ON MACHINE LEARNING
title_full_unstemmed DEVELOPMENT OF A SPEECH TO TEXT INTERVIEW SUMMARIZATION SYSTEM BASED ON MACHINE LEARNING
title_sort development of a speech to text interview summarization system based on machine learning
url https://digilib.itb.ac.id/gdl/view/78150
_version_ 1822995643714502656