COMPARING CHATGPT-4O’S CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS

Globalization has influenced English to become one of the languages that continue to develop as an international communication medium. However, learning active language skills like speaking remains difficult to do independently. The devel- opment of AI can help address this issue by providing v...

Full description

Saved in:
Bibliographic Details
Main Author: Angelina Eunike Leman, Gresya
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/84976
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:84976
spelling id-itb.:849762024-08-19T11:53:05ZCOMPARING CHATGPT-4O’S CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS Angelina Eunike Leman, Gresya Indonesia Final Project ASR, LLM, multimodal model, speech recognition INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/84976 Globalization has influenced English to become one of the languages that continue to develop as an international communication medium. However, learning active language skills like speaking remains difficult to do independently. The devel- opment of AI can help address this issue by providing voice chat technology that can understand the speech of non-native speakers from various backgrounds. This study aims to compare the capabilities of a multimodal model with combinations of open-source and closed-source ASR and LLM in understanding English spoken by Indonesian non-native speakers. The eval- uation follows the CRISP-DM framework, which consists of business understanding, data understanding, data preparation, modeling, and evaluation. A dataset consisting of 26 evaluation subjects asking 20 questions on general knowledge and mathe- matics topics in 2 versions—using each subject’s phrasing and a scripted version was collected. The evaluation was conducted by calculating WER, accuracy, and cosine similarity of the LLM model’s answers based on ASR model transcriptions. The results show that multimodal models like GPT-4o have superior transcription and question- answering capabilities compared to combinations of ASR and both open-source and closed-source LLM models, with WER 0.0967, accuracy 0.8269, and cosine similarity 0.8964. This was followed by the best-performing closed-source ASR and LLM combination, Claude 3.5 Sonnet and Amazon Transcribe, with WER 0.1045, accuracy 0.7856, and cosine similarity 0.9194. Lastly, the best-performing open-source ASR and LLM model combination, DeepSeek-V2 and Canary- 1B, achieved WER 0.1665, accuracy 0.7471, and cosine similarity 0.9093. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Globalization has influenced English to become one of the languages that continue to develop as an international communication medium. However, learning active language skills like speaking remains difficult to do independently. The devel- opment of AI can help address this issue by providing voice chat technology that can understand the speech of non-native speakers from various backgrounds. This study aims to compare the capabilities of a multimodal model with combinations of open-source and closed-source ASR and LLM in understanding English spoken by Indonesian non-native speakers. The eval- uation follows the CRISP-DM framework, which consists of business understanding, data understanding, data preparation, modeling, and evaluation. A dataset consisting of 26 evaluation subjects asking 20 questions on general knowledge and mathe- matics topics in 2 versions—using each subject’s phrasing and a scripted version was collected. The evaluation was conducted by calculating WER, accuracy, and cosine similarity of the LLM model’s answers based on ASR model transcriptions. The results show that multimodal models like GPT-4o have superior transcription and question- answering capabilities compared to combinations of ASR and both open-source and closed-source LLM models, with WER 0.0967, accuracy 0.8269, and cosine similarity 0.8964. This was followed by the best-performing closed-source ASR and LLM combination, Claude 3.5 Sonnet and Amazon Transcribe, with WER 0.1045, accuracy 0.7856, and cosine similarity 0.9194. Lastly, the best-performing open-source ASR and LLM model combination, DeepSeek-V2 and Canary- 1B, achieved WER 0.1665, accuracy 0.7471, and cosine similarity 0.9093.
format Final Project
author Angelina Eunike Leman, Gresya
spellingShingle Angelina Eunike Leman, Gresya
COMPARING CHATGPT-4O’S CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
author_facet Angelina Eunike Leman, Gresya
author_sort Angelina Eunike Leman, Gresya
title COMPARING CHATGPT-4O’S CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
title_short COMPARING CHATGPT-4O’S CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
title_full COMPARING CHATGPT-4O’S CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
title_fullStr COMPARING CHATGPT-4O’S CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
title_full_unstemmed COMPARING CHATGPT-4O’S CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
title_sort comparing chatgpt-4o’s capabilities to combination of open-sourced along with closed-sourced asrs and llms in understanding spoken english from indonesian non- native speakers
url https://digilib.itb.ac.id/gdl/view/84976
_version_ 1822998856355282944