RELIGIOUS DOMAIN-INDONESIAN SIRAH NABAWIYAH QUESTION ANSWERING USING GENERATIVE-LLM

Indonesia is a country with the largest population of Muslims in the world. There are two main sources of information in Islam, the Holy Qur'an and the Book of Hadith, in addition, Sirah Nabawiyah is other important literature. The Sirah Nabawiyah is a historical literature on the prophetic...

Full description

Saved in:
Bibliographic Details
Main Author: Razif Rizqullah, Muhammad
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/80969
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:80969
spelling id-itb.:809692024-03-16T12:20:44ZRELIGIOUS DOMAIN-INDONESIAN SIRAH NABAWIYAH QUESTION ANSWERING USING GENERATIVE-LLM Razif Rizqullah, Muhammad Indonesia Theses QASiNa, reading comprehension, multiple choices, Masked-LM, Generative-LLM INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/80969 Indonesia is a country with the largest population of Muslims in the world. There are two main sources of information in Islam, the Holy Qur'an and the Book of Hadith, in addition, Sirah Nabawiyah is other important literature. The Sirah Nabawiyah is a historical literature on the prophetic journey and biography in Islam that refers to the two main sources. In current Question Answering (QA) research, there have been studies on the Quran and the Hadith, but none have used the Sirah Nabawiyah, especially for the Indonesian language. We use Sirah Nabawiyah literature to build a novel dataset for QA. Manually building a new dataset requires a lot of human effort and cost, so Generative-LLM was used to assist in some parts of the process. The result is the Question Answering Sirah Nabawiyah (QASiNa) dataset for reading comprehension (QASiNa-RC), multiple choices (QASiNa-MC), and Sirah Nabawiyah corpus (SiNaCorpus). QASiNa-RC testing was conducted for reading comprehension task using mBERT, XLM-RoBERTa, and IndoBERT. QASiNa-MC testing was conducted for multiple choices QA tasks using open-source Generative-LLMs, namely mGPT, XGLM, BLOOM, and BLOOMZ. Furthermore, GPT-3.5 and GPT-4 were also used to test both datasets. The evaluation results of QASiNa-RC showed XLM-RoBERTa as the best model with an EM value of 58.40%, while the GPT-3.5 and GPT-4 models made excessive interpretations. The evaluation of QASiNa-MC showed BLOOMZ 1.7B as the best model with an accuracy of 27.76% and increased to 28.62% after corpus-tuning. The GPT-3.5 and GPT-4 models achieved better results with accuracy 56.60% and 72.40% respectively. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Indonesia is a country with the largest population of Muslims in the world. There are two main sources of information in Islam, the Holy Qur'an and the Book of Hadith, in addition, Sirah Nabawiyah is other important literature. The Sirah Nabawiyah is a historical literature on the prophetic journey and biography in Islam that refers to the two main sources. In current Question Answering (QA) research, there have been studies on the Quran and the Hadith, but none have used the Sirah Nabawiyah, especially for the Indonesian language. We use Sirah Nabawiyah literature to build a novel dataset for QA. Manually building a new dataset requires a lot of human effort and cost, so Generative-LLM was used to assist in some parts of the process. The result is the Question Answering Sirah Nabawiyah (QASiNa) dataset for reading comprehension (QASiNa-RC), multiple choices (QASiNa-MC), and Sirah Nabawiyah corpus (SiNaCorpus). QASiNa-RC testing was conducted for reading comprehension task using mBERT, XLM-RoBERTa, and IndoBERT. QASiNa-MC testing was conducted for multiple choices QA tasks using open-source Generative-LLMs, namely mGPT, XGLM, BLOOM, and BLOOMZ. Furthermore, GPT-3.5 and GPT-4 were also used to test both datasets. The evaluation results of QASiNa-RC showed XLM-RoBERTa as the best model with an EM value of 58.40%, while the GPT-3.5 and GPT-4 models made excessive interpretations. The evaluation of QASiNa-MC showed BLOOMZ 1.7B as the best model with an accuracy of 27.76% and increased to 28.62% after corpus-tuning. The GPT-3.5 and GPT-4 models achieved better results with accuracy 56.60% and 72.40% respectively.
format Theses
author Razif Rizqullah, Muhammad
spellingShingle Razif Rizqullah, Muhammad
RELIGIOUS DOMAIN-INDONESIAN SIRAH NABAWIYAH QUESTION ANSWERING USING GENERATIVE-LLM
author_facet Razif Rizqullah, Muhammad
author_sort Razif Rizqullah, Muhammad
title RELIGIOUS DOMAIN-INDONESIAN SIRAH NABAWIYAH QUESTION ANSWERING USING GENERATIVE-LLM
title_short RELIGIOUS DOMAIN-INDONESIAN SIRAH NABAWIYAH QUESTION ANSWERING USING GENERATIVE-LLM
title_full RELIGIOUS DOMAIN-INDONESIAN SIRAH NABAWIYAH QUESTION ANSWERING USING GENERATIVE-LLM
title_fullStr RELIGIOUS DOMAIN-INDONESIAN SIRAH NABAWIYAH QUESTION ANSWERING USING GENERATIVE-LLM
title_full_unstemmed RELIGIOUS DOMAIN-INDONESIAN SIRAH NABAWIYAH QUESTION ANSWERING USING GENERATIVE-LLM
title_sort religious domain-indonesian sirah nabawiyah question answering using generative-llm
url https://digilib.itb.ac.id/gdl/view/80969
_version_ 1822997061715361792