COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
Globalization has influenced English to become one of the languages that continue to develop as an international communication medium. However, learning active language skills like speaking remains difficult to do independently. The devel- opment of AI can help address this issue by providing v...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/84976 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Globalization has influenced English to become one of the languages that continue
to develop as an international communication medium. However, learning active
language skills like speaking remains difficult to do independently. The devel-
opment of AI can help address this issue by providing voice chat technology that
can understand the speech of non-native speakers from various backgrounds. This
study aims to compare the capabilities of a multimodal model with combinations of
open-source and closed-source ASR and LLM in understanding English spoken by
Indonesian non-native speakers. The eval- uation follows the CRISP-DM
framework, which consists of business understanding, data understanding, data
preparation, modeling, and evaluation. A dataset consisting of 26 evaluation
subjects asking 20 questions on general knowledge and mathe- matics topics in 2
versions—using each subject’s phrasing and a scripted version was collected. The
evaluation was conducted by calculating WER, accuracy, and cosine similarity of
the LLM model’s answers based on ASR model transcriptions. The results show
that multimodal models like GPT-4o have superior transcription and question-
answering capabilities compared to combinations of ASR and both open-source and
closed-source LLM models, with WER 0.0967, accuracy 0.8269, and cosine
similarity 0.8964. This was followed by the best-performing closed-source ASR
and LLM combination, Claude 3.5 Sonnet and Amazon Transcribe, with WER
0.1045, accuracy 0.7856, and cosine similarity 0.9194. Lastly, the best-performing
open-source ASR and LLM model combination, DeepSeek-V2 and Canary- 1B,
achieved WER 0.1665, accuracy 0.7471, and cosine similarity 0.9093. |
---|