DEVELOPMENT OF JOIN OPERATIONS FOR DISTRIBUTED DATA IN CASSANDRA DATABASE MANAGEMENT SYSTEM

The previous work by Adyatma (2022) successfully developed a library for per- forming join operations in the Cassandra database. However, there were several challenges that persisted, such as the library performing operations within a single machine and only supporting join operations. This is...

Full description

Saved in:
Bibliographic Details
Main Author: Anugrah Putra, Widya
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/76529
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:The previous work by Adyatma (2022) successfully developed a library for per- forming join operations in the Cassandra database. However, there were several challenges that persisted, such as the library performing operations within a single machine and only supporting join operations. This is in contrast to typical scenarios where join operations are combined with other operations like selection. To address these issues, an analysis was conducted on how Cassandra communica- tes the state of its machines within the cluster, how data is stored in a distributed environment, and how data is retrieved in such a setting. After the analysis, da- ta retrieval from Cassandra was improved by utilizing the existing token ranges in Cassandra. Further analysis was performed on various solution alternatives for selecting machi- nes to perform tasks. Options included using dedicated machines, employing load balancers, and utilizing multiple worker machines. The last alternative, utilizing multiple worker machines, was chosen for its efficiency. Subsequently, the selected solutions were implemented in the form of a library. Through testing, it was found that the developed library possessed the intended functionality and exhibited better performance compared to the work by Adyatma (2022). In comparison to Datastax’s Spark Cassandra Connector, the developed library outperformed it on small and medium-sized datasets but underperformed on large datasets. This discrepancy is due to the memory usage not being fully optimized, leading to significant overhead when handling large datasets.