PERFORMANCE COMPARISON OF SERIAL, RANDOM PARALLEL, AND LSH PARALLEL SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE (SMOTE) IN FLIGHT DELAY PREDICTION
While working with big data, it is sometimes difficult to ensure that all the data is immediately usable. The challenge of dealing with unbalanced data is sometimes one of the obstacles that need to be resolved in order to obtain more representative data. In the context of the aviation industry,...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/58043 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | While working with big data, it is sometimes difficult to ensure that all the data is
immediately usable. The challenge of dealing with unbalanced data is sometimes one of the
obstacles that need to be resolved in order to obtain more representative data. In the context
of the aviation industry, airlines that can anticipate delayed flights are able to take quick
action to address these conditions before a flight delay results in another delay. Even so, the
data for late flights is not as readily available as flights that arrive on time. This problem can
be overcome by applying an oversampling method. Furthermore, the issue of execution time
is also important when working with big data. Therefore, a parallel implementation was
chosen in the hope of giving better results, without drastically increasing the running time of
the algorithm.
In this final project, research has been carried out on the synthetic minority oversampling
technique (SMOTE) method as a form of oversampling that is often used. Parallel
implementation is done for two partition mechanisms: random and locality sensitive hashing.
Evaluation is then carried out by comparing performance metrics on the random forest and
K-Nearest Neighbor classification model.
After the experiment, it was found that the distribution of datasets using locality sensitive
hashing can reduce the trend of increasing execution time in relation to increasing the amount
of data, although the evaluation results show a lower recall value than the serial SMOTE
method. On the other hand, the random distribution of datasets shows unsatisfactory results,
both in terms of duration and accuracy |
---|