PHYSICS PROBLEM GENERATION THROUGH PATTERN MATCHING AND LARGE LANGUAGE MODELS

Question generation is a frequently researched area in AI for academic needs, aimed at creating natural language text questions that are semantically accurate and syntactically cohesive. This generation can be used to create a variety of questions to reduce the number of cheating by committed by...

Full description

Saved in:
Bibliographic Details
Main Author: Marchotridyo
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/82425
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Question generation is a frequently researched area in AI for academic needs, aimed at creating natural language text questions that are semantically accurate and syntactically cohesive. This generation can be used to create a variety of questions to reduce the number of cheating by committed by students. This thesis investigates how to generate physics questions. Physics questions are chosen because previous research has not addressed them. Additionally, generating physics questions involves not only generating numbers but also the question text. There are two main processes involved in generating physics questions: generating variables in the questions (in the form of numbers) and paraphrasing the generated questions. The question generation process begins by creating a data structure to represent the content of a question, including its text, variables, answers, and explanations. Variables in questions are identified using regular expression-based pattern matching and then filled with random values when the question is generated. Random value assignment follows rules defined for these variables. Once the questions are generated, they are paraphrased using various large language models (LLMs), namely Pegasus and T5 for fine-tuned models, and ChatGPT-3.5 Turbo and Mistral 7B for directly prompted models. The paraphrasing performance of each model is compared using several automatic paraphrase evaluation metrics, including n-gram based metrics like BLEU, METEOR, and ROUGE, a language model-based automatic evaluation method called ParaScore, and human evaluation. The results of this thesis indicate that the LLMs, specifically ChatGPT-3.5 Turbo and Mistral 7B, are highly effective at paraphrasing questions based on human evaluation. The research also shows that n-gram based automatic evaluation metrics like BLEU, METEOR, and ROUGE are insufficient for evaluating the complexity of paraphrasing results, whereas the language model-based automatic evaluation metric, ParaScore, aligns well with human evaluation results.