PERFORMANCE COMPARISON OF SERIAL, RANDOM PARALLEL, AND LSH PARALLEL SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE (SMOTE) IN FLIGHT DELAY PREDICTION

While working with big data, it is sometimes difficult to ensure that all the data is immediately usable. The challenge of dealing with unbalanced data is sometimes one of the obstacles that need to be resolved in order to obtain more representative data. In the context of the aviation industry,...

Full description

Saved in:
Bibliographic Details
Main Author: Rizal Alifio, Ahmad
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/58043
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:While working with big data, it is sometimes difficult to ensure that all the data is immediately usable. The challenge of dealing with unbalanced data is sometimes one of the obstacles that need to be resolved in order to obtain more representative data. In the context of the aviation industry, airlines that can anticipate delayed flights are able to take quick action to address these conditions before a flight delay results in another delay. Even so, the data for late flights is not as readily available as flights that arrive on time. This problem can be overcome by applying an oversampling method. Furthermore, the issue of execution time is also important when working with big data. Therefore, a parallel implementation was chosen in the hope of giving better results, without drastically increasing the running time of the algorithm. In this final project, research has been carried out on the synthetic minority oversampling technique (SMOTE) method as a form of oversampling that is often used. Parallel implementation is done for two partition mechanisms: random and locality sensitive hashing. Evaluation is then carried out by comparing performance metrics on the random forest and K-Nearest Neighbor classification model. After the experiment, it was found that the distribution of datasets using locality sensitive hashing can reduce the trend of increasing execution time in relation to increasing the amount of data, although the evaluation results show a lower recall value than the serial SMOTE method. On the other hand, the random distribution of datasets shows unsatisfactory results, both in terms of duration and accuracy