COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
Globalization has influenced English to become one of the languages that continue to develop as an international communication medium. However, learning active language skills like speaking remains difficult to do independently. The devel- opment of AI can help address this issue by providing v...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/84976 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:84976 |
---|---|
spelling |
id-itb.:849762024-08-19T11:53:05ZCOMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS Angelina Eunike Leman, Gresya Indonesia Final Project ASR, LLM, multimodal model, speech recognition INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/84976 Globalization has influenced English to become one of the languages that continue to develop as an international communication medium. However, learning active language skills like speaking remains difficult to do independently. The devel- opment of AI can help address this issue by providing voice chat technology that can understand the speech of non-native speakers from various backgrounds. This study aims to compare the capabilities of a multimodal model with combinations of open-source and closed-source ASR and LLM in understanding English spoken by Indonesian non-native speakers. The eval- uation follows the CRISP-DM framework, which consists of business understanding, data understanding, data preparation, modeling, and evaluation. A dataset consisting of 26 evaluation subjects asking 20 questions on general knowledge and mathe- matics topics in 2 versions—using each subject’s phrasing and a scripted version was collected. The evaluation was conducted by calculating WER, accuracy, and cosine similarity of the LLM model’s answers based on ASR model transcriptions. The results show that multimodal models like GPT-4o have superior transcription and question- answering capabilities compared to combinations of ASR and both open-source and closed-source LLM models, with WER 0.0967, accuracy 0.8269, and cosine similarity 0.8964. This was followed by the best-performing closed-source ASR and LLM combination, Claude 3.5 Sonnet and Amazon Transcribe, with WER 0.1045, accuracy 0.7856, and cosine similarity 0.9194. Lastly, the best-performing open-source ASR and LLM model combination, DeepSeek-V2 and Canary- 1B, achieved WER 0.1665, accuracy 0.7471, and cosine similarity 0.9093. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Globalization has influenced English to become one of the languages that continue
to develop as an international communication medium. However, learning active
language skills like speaking remains difficult to do independently. The devel-
opment of AI can help address this issue by providing voice chat technology that
can understand the speech of non-native speakers from various backgrounds. This
study aims to compare the capabilities of a multimodal model with combinations of
open-source and closed-source ASR and LLM in understanding English spoken by
Indonesian non-native speakers. The eval- uation follows the CRISP-DM
framework, which consists of business understanding, data understanding, data
preparation, modeling, and evaluation. A dataset consisting of 26 evaluation
subjects asking 20 questions on general knowledge and mathe- matics topics in 2
versions—using each subject’s phrasing and a scripted version was collected. The
evaluation was conducted by calculating WER, accuracy, and cosine similarity of
the LLM model’s answers based on ASR model transcriptions. The results show
that multimodal models like GPT-4o have superior transcription and question-
answering capabilities compared to combinations of ASR and both open-source and
closed-source LLM models, with WER 0.0967, accuracy 0.8269, and cosine
similarity 0.8964. This was followed by the best-performing closed-source ASR
and LLM combination, Claude 3.5 Sonnet and Amazon Transcribe, with WER
0.1045, accuracy 0.7856, and cosine similarity 0.9194. Lastly, the best-performing
open-source ASR and LLM model combination, DeepSeek-V2 and Canary- 1B,
achieved WER 0.1665, accuracy 0.7471, and cosine similarity 0.9093. |
format |
Final Project |
author |
Angelina Eunike Leman, Gresya |
spellingShingle |
Angelina Eunike Leman, Gresya COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS |
author_facet |
Angelina Eunike Leman, Gresya |
author_sort |
Angelina Eunike Leman, Gresya |
title |
COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS |
title_short |
COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS |
title_full |
COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS |
title_fullStr |
COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS |
title_full_unstemmed |
COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS |
title_sort |
comparing chatgpt-4oâs capabilities to combination of open-sourced along with closed-sourced asrs and llms in understanding spoken english from indonesian non- native speakers |
url |
https://digilib.itb.ac.id/gdl/view/84976 |
_version_ |
1822998856355282944 |