HEURISTIC BASED CACHE EVICTION ALGORITHM FOR THE SPARK DISTRIBUTED SYSTEM

he utilization of in-memory cache is a crucial technique in distributed systems, particularly in large-scale data processing that requires storage of intermediate computation results. Apache Spark is one of the distributed systems for large-scale data processing that uses in-memory cache in the form...

Full description

Saved in:
Bibliographic Details
Main Author: Santoso, Nathanael
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/85174
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:he utilization of in-memory cache is a crucial technique in distributed systems, particularly in large-scale data processing that requires storage of intermediate computation results. Apache Spark is one of the distributed systems for large-scale data processing that uses in-memory cache in the form of Resilient Distributed Datasets (RDD). By storing the lineage of application execution in the form of a Directed Acyclic Graph (DAG), Apache Spark can ensure the correctness and robustness of the system. In its operation, Apache Spark relies on the Least Recently Used (LRU) cache eviction algorithm to periodically clean the cache, but LRU has weaknesses in the context of Spark as it does not consider the available DAG of computation, often discarding RDDs known to be reused. Moreover, LRU does not consider the size or computation time of an RDD, leading to inefficient eviction. In this research, a new cache eviction algorithm, Normalized Heuristic Weight (NHW), was designed with a heuristic approach to improve the cache management performance of Apache Spark. The NHW algorithm and a profiler were implemented by adding a new module to the source code and tested against several common test cases. From the test results, NHW improved application performance by 13.37 percent compared to LRU and an average of 9.64 percent compared to other benchmark algorithms. Also, NHW reduced the duration of the pause phase in the garbage collection process by 8.73 percent compared to LRU.