HEURISTIC BASED CACHE EVICTION ALGORITHM FOR THE SPARK DISTRIBUTED SYSTEM

he utilization of in-memory cache is a crucial technique in distributed systems, particularly in large-scale data processing that requires storage of intermediate computation results. Apache Spark is one of the distributed systems for large-scale data processing that uses in-memory cache in the form...

Full description

Saved in:
Bibliographic Details
Main Author: Santoso, Nathanael
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/85174
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:85174
spelling id-itb.:851742024-08-19T20:04:59ZHEURISTIC BASED CACHE EVICTION ALGORITHM FOR THE SPARK DISTRIBUTED SYSTEM Santoso, Nathanael Indonesia Final Project Cache Eviction Algorithm; Apache Spark; Heuristic; Normalized Heuristic Weight; in-memory cache INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/85174 he utilization of in-memory cache is a crucial technique in distributed systems, particularly in large-scale data processing that requires storage of intermediate computation results. Apache Spark is one of the distributed systems for large-scale data processing that uses in-memory cache in the form of Resilient Distributed Datasets (RDD). By storing the lineage of application execution in the form of a Directed Acyclic Graph (DAG), Apache Spark can ensure the correctness and robustness of the system. In its operation, Apache Spark relies on the Least Recently Used (LRU) cache eviction algorithm to periodically clean the cache, but LRU has weaknesses in the context of Spark as it does not consider the available DAG of computation, often discarding RDDs known to be reused. Moreover, LRU does not consider the size or computation time of an RDD, leading to inefficient eviction. In this research, a new cache eviction algorithm, Normalized Heuristic Weight (NHW), was designed with a heuristic approach to improve the cache management performance of Apache Spark. The NHW algorithm and a profiler were implemented by adding a new module to the source code and tested against several common test cases. From the test results, NHW improved application performance by 13.37 percent compared to LRU and an average of 9.64 percent compared to other benchmark algorithms. Also, NHW reduced the duration of the pause phase in the garbage collection process by 8.73 percent compared to LRU. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description he utilization of in-memory cache is a crucial technique in distributed systems, particularly in large-scale data processing that requires storage of intermediate computation results. Apache Spark is one of the distributed systems for large-scale data processing that uses in-memory cache in the form of Resilient Distributed Datasets (RDD). By storing the lineage of application execution in the form of a Directed Acyclic Graph (DAG), Apache Spark can ensure the correctness and robustness of the system. In its operation, Apache Spark relies on the Least Recently Used (LRU) cache eviction algorithm to periodically clean the cache, but LRU has weaknesses in the context of Spark as it does not consider the available DAG of computation, often discarding RDDs known to be reused. Moreover, LRU does not consider the size or computation time of an RDD, leading to inefficient eviction. In this research, a new cache eviction algorithm, Normalized Heuristic Weight (NHW), was designed with a heuristic approach to improve the cache management performance of Apache Spark. The NHW algorithm and a profiler were implemented by adding a new module to the source code and tested against several common test cases. From the test results, NHW improved application performance by 13.37 percent compared to LRU and an average of 9.64 percent compared to other benchmark algorithms. Also, NHW reduced the duration of the pause phase in the garbage collection process by 8.73 percent compared to LRU.
format Final Project
author Santoso, Nathanael
spellingShingle Santoso, Nathanael
HEURISTIC BASED CACHE EVICTION ALGORITHM FOR THE SPARK DISTRIBUTED SYSTEM
author_facet Santoso, Nathanael
author_sort Santoso, Nathanael
title HEURISTIC BASED CACHE EVICTION ALGORITHM FOR THE SPARK DISTRIBUTED SYSTEM
title_short HEURISTIC BASED CACHE EVICTION ALGORITHM FOR THE SPARK DISTRIBUTED SYSTEM
title_full HEURISTIC BASED CACHE EVICTION ALGORITHM FOR THE SPARK DISTRIBUTED SYSTEM
title_fullStr HEURISTIC BASED CACHE EVICTION ALGORITHM FOR THE SPARK DISTRIBUTED SYSTEM
title_full_unstemmed HEURISTIC BASED CACHE EVICTION ALGORITHM FOR THE SPARK DISTRIBUTED SYSTEM
title_sort heuristic based cache eviction algorithm for the spark distributed system
url https://digilib.itb.ac.id/gdl/view/85174
_version_ 1822283049543401472