COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS

Globalization has influenced English to become one of the languages that continue to develop as an international communication medium. However, learning active language skills like speaking remains difficult to do independently. The devel- opment of AI can help address this issue by providing v...

Full description

Saved in:

Bibliographic Details
Main Author:	Angelina Eunike Leman, Gresya
Format:	Final Project
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/84976
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:84976
spelling	id-itb.:849762024-08-19T11:53:05ZCOMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS Angelina Eunike Leman, Gresya Indonesia Final Project ASR, LLM, multimodal model, speech recognition INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/84976 Globalization has influenced English to become one of the languages that continue to develop as an international communication medium. However, learning active language skills like speaking remains difficult to do independently. The devel- opment of AI can help address this issue by providing voice chat technology that can understand the speech of non-native speakers from various backgrounds. This study aims to compare the capabilities of a multimodal model with combinations of open-source and closed-source ASR and LLM in understanding English spoken by Indonesian non-native speakers. The eval- uation follows the CRISP-DM framework, which consists of business understanding, data understanding, data preparation, modeling, and evaluation. A dataset consisting of 26 evaluation subjects asking 20 questions on general knowledge and mathe- matics topics in 2 versions—using each subject’s phrasing and a scripted version was collected. The evaluation was conducted by calculating WER, accuracy, and cosine similarity of the LLM model’s answers based on ASR model transcriptions. The results show that multimodal models like GPT-4o have superior transcription and question- answering capabilities compared to combinations of ASR and both open-source and closed-source LLM models, with WER 0.0967, accuracy 0.8269, and cosine similarity 0.8964. This was followed by the best-performing closed-source ASR and LLM combination, Claude 3.5 Sonnet and Amazon Transcribe, with WER 0.1045, accuracy 0.7856, and cosine similarity 0.9194. Lastly, the best-performing open-source ASR and LLM model combination, DeepSeek-V2 and Canary- 1B, achieved WER 0.1665, accuracy 0.7471, and cosine similarity 0.9093. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Globalization has influenced English to become one of the languages that continue to develop as an international communication medium. However, learning active language skills like speaking remains difficult to do independently. The devel- opment of AI can help address this issue by providing voice chat technology that can understand the speech of non-native speakers from various backgrounds. This study aims to compare the capabilities of a multimodal model with combinations of open-source and closed-source ASR and LLM in understanding English spoken by Indonesian non-native speakers. The eval- uation follows the CRISP-DM framework, which consists of business understanding, data understanding, data preparation, modeling, and evaluation. A dataset consisting of 26 evaluation subjects asking 20 questions on general knowledge and mathe- matics topics in 2 versions—using each subject’s phrasing and a scripted version was collected. The evaluation was conducted by calculating WER, accuracy, and cosine similarity of the LLM model’s answers based on ASR model transcriptions. The results show that multimodal models like GPT-4o have superior transcription and question- answering capabilities compared to combinations of ASR and both open-source and closed-source LLM models, with WER 0.0967, accuracy 0.8269, and cosine similarity 0.8964. This was followed by the best-performing closed-source ASR and LLM combination, Claude 3.5 Sonnet and Amazon Transcribe, with WER 0.1045, accuracy 0.7856, and cosine similarity 0.9194. Lastly, the best-performing open-source ASR and LLM model combination, DeepSeek-V2 and Canary- 1B, achieved WER 0.1665, accuracy 0.7471, and cosine similarity 0.9093.
format	Final Project
author	Angelina Eunike Leman, Gresya
spellingShingle	Angelina Eunike Leman, Gresya COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
author_facet	Angelina Eunike Leman, Gresya
author_sort	Angelina Eunike Leman, Gresya
title	COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
title_short	COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
title_full	COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
title_fullStr	COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
title_full_unstemmed	COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS
title_sort	comparing chatgpt-4oâs capabilities to combination of open-sourced along with closed-sourced asrs and llms in understanding spoken english from indonesian non- native speakers
url	https://digilib.itb.ac.id/gdl/view/84976
_version_	1822998856355282944

COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS

Similar Items

COMPARING CHATGPT-4OâS CAPABILITIES TO COMBINATION OF OPEN-SOURCED ALONG WITH CLOSED-SOURCED ASRS AND LLMS IN UNDERSTANDING SPOKEN ENGLISH FROM INDONESIAN NON- NATIVE SPEAKERS