Data cleaning and refinement for code related AI systems
This final-year project will cover data cleaning and refinement to improve the quality of data for various natural language processing (NLP) projects, such as code clone detection and code-to-text conversion. The project will focus on using the PyTorch library to train a masked language model to det...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/166197 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-166197 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1661972023-04-28T15:39:31Z Data cleaning and refinement for code related AI systems Tay, Arron Hong Yi Liu Yang School of Computer Science and Engineering yangliu@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence This final-year project will cover data cleaning and refinement to improve the quality of data for various natural language processing (NLP) projects, such as code clone detection and code-to-text conversion. The project will focus on using the PyTorch library to train a masked language model to detect code clones in the POJ104 dataset and generate text from code for the code-text dataset. This will be achieved by using one of several pre-trained models, including OpenAI-GPT, BERT, RoBERTa, or DistilBERT. The refinement process will involve detailed data analysis and scripts to preprocess the data. Within these projects taken to do refinement, the project usually includes scripts which carry out loading, preprocessing code data, creating and training model, evaluation and lastly, testing against the answers. During the training phase, libraries such as transformers and tensorboard will be used to provide the pre-trained models and tokenizers, as well as visualizations for the training process model. The trained model will be evaluated at every epoch to check its performance of the training model. After the training phase, a model will be built to be tested against a set of answers to measure the performance model. Mean Average Precision and Smooth bleu4 Score will be used to measure performance for code-duplication project and code-to-text project, respectively. This report will cover the methods used to improve the data quality and discuss the techniques used for refinement which aims to improve the accuracy and efficiency of NLP projects for code that could benefit a wider range of applications in different industries. Bachelor of Engineering (Computer Science) 2023-04-24T05:46:51Z 2023-04-24T05:46:51Z 2023 Final Year Project (FYP) Tay, A. H. Y. (2023). Data cleaning and refinement for code related AI systems. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/166197 https://hdl.handle.net/10356/166197 en application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Tay, Arron Hong Yi Data cleaning and refinement for code related AI systems |
description |
This final-year project will cover data cleaning and refinement to improve the quality of data for various natural language processing (NLP) projects, such as code clone detection and code-to-text conversion. The project will focus on using the PyTorch library to train a masked language model to detect code clones in the POJ104 dataset and generate text from code for the code-text dataset. This will be achieved by using one of several pre-trained models, including OpenAI-GPT, BERT, RoBERTa, or DistilBERT. The refinement process will involve detailed data analysis and scripts to preprocess the data. Within these projects taken to do refinement, the project usually includes scripts which carry out loading, preprocessing code data, creating and training model, evaluation and lastly, testing against the answers. During the training phase, libraries such as transformers and tensorboard will be used to provide the pre-trained models and tokenizers, as well as visualizations for the training process model. The trained model will be evaluated at every epoch to check its performance of the training model. After the training phase, a model will be built to be tested against a set of answers to measure the performance model. Mean Average Precision and Smooth bleu4 Score will be used to measure performance for code-duplication project and code-to-text project, respectively. This report will cover the methods used to improve the data quality and discuss the techniques used for refinement which aims to improve the accuracy and efficiency of NLP projects for code that could benefit a wider range of applications in different industries. |
author2 |
Liu Yang |
author_facet |
Liu Yang Tay, Arron Hong Yi |
format |
Final Year Project |
author |
Tay, Arron Hong Yi |
author_sort |
Tay, Arron Hong Yi |
title |
Data cleaning and refinement for code related AI systems |
title_short |
Data cleaning and refinement for code related AI systems |
title_full |
Data cleaning and refinement for code related AI systems |
title_fullStr |
Data cleaning and refinement for code related AI systems |
title_full_unstemmed |
Data cleaning and refinement for code related AI systems |
title_sort |
data cleaning and refinement for code related ai systems |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/166197 |
_version_ |
1765213812908097536 |