COMPARING CHATGPT-4O’S CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS

Globalization has influenced English to become one of the languages that continue to develop as an international communication medium. However, learning active language skills like speaking remains difficult to do independently. The devel- opment of AI can help address this issue by providing v...

Full description

Saved in:
Bibliographic Details
Main Author: Angelina Eunike Leman, Gresya
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/84976
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Globalization has influenced English to become one of the languages that continue to develop as an international communication medium. However, learning active language skills like speaking remains difficult to do independently. The devel- opment of AI can help address this issue by providing voice chat technology that can understand the speech of non-native speakers from various backgrounds. This study aims to compare the capabilities of a multimodal model with combinations of open-source and closed-source ASR and LLM in understanding English spoken by Indonesian non-native speakers. The eval- uation follows the CRISP-DM framework, which consists of business understanding, data understanding, data preparation, modeling, and evaluation. A dataset consisting of 26 evaluation subjects asking 20 questions on general knowledge and mathe- matics topics in 2 versions—using each subject’s phrasing and a scripted version was collected. The evaluation was conducted by calculating WER, accuracy, and cosine similarity of the LLM model’s answers based on ASR model transcriptions. The results show that multimodal models like GPT-4o have superior transcription and question- answering capabilities compared to combinations of ASR and both open-source and closed-source LLM models, with WER 0.0967, accuracy 0.8269, and cosine similarity 0.8964. This was followed by the best-performing closed-source ASR and LLM combination, Claude 3.5 Sonnet and Amazon Transcribe, with WER 0.1045, accuracy 0.7856, and cosine similarity 0.9194. Lastly, the best-performing open-source ASR and LLM model combination, DeepSeek-V2 and Canary- 1B, achieved WER 0.1665, accuracy 0.7471, and cosine similarity 0.9093.