PHYSICS PROBLEM GENERATION THROUGH PATTERN MATCHING AND LARGE LANGUAGE MODELS
Question generation is a frequently researched area in AI for academic needs, aimed at creating natural language text questions that are semantically accurate and syntactically cohesive. This generation can be used to create a variety of questions to reduce the number of cheating by committed by...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/82425 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Question generation is a frequently researched area in AI for academic needs, aimed at
creating natural language text questions that are semantically accurate and syntactically
cohesive. This generation can be used to create a variety of questions to reduce the number of
cheating by committed by students. This thesis investigates how to generate physics
questions. Physics questions are chosen because previous research has not addressed them.
Additionally, generating physics questions involves not only generating numbers but also the
question text. There are two main processes involved in generating physics questions:
generating variables in the questions (in the form of numbers) and paraphrasing the generated
questions. The question generation process begins by creating a data structure to represent the
content of a question, including its text, variables, answers, and explanations. Variables in
questions are identified using regular expression-based pattern matching and then filled with
random values when the question is generated. Random value assignment follows rules
defined for these variables. Once the questions are generated, they are paraphrased using
various large language models (LLMs), namely Pegasus and T5 for fine-tuned models, and
ChatGPT-3.5 Turbo and Mistral 7B for directly prompted models. The paraphrasing
performance of each model is compared using several automatic paraphrase evaluation
metrics, including n-gram based metrics like BLEU, METEOR, and ROUGE, a language
model-based automatic evaluation method called ParaScore, and human evaluation. The
results of this thesis indicate that the LLMs, specifically ChatGPT-3.5 Turbo and Mistral 7B,
are highly effective at paraphrasing questions based on human evaluation. The research also
shows that n-gram based automatic evaluation metrics like BLEU, METEOR, and ROUGE
are insufficient for evaluating the complexity of paraphrasing results, whereas the language
model-based automatic evaluation metric, ParaScore, aligns well with human evaluation
results. |
---|