DEVELOPMENT OF JOIN OPERATIONS FOR DISTRIBUTED DATA IN CASSANDRA DATABASE MANAGEMENT SYSTEM
The previous work by Adyatma (2022) successfully developed a library for per- forming join operations in the Cassandra database. However, there were several challenges that persisted, such as the library performing operations within a single machine and only supporting join operations. This is...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/76529 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | The previous work by Adyatma (2022) successfully developed a library for per-
forming join operations in the Cassandra database. However, there were several
challenges that persisted, such as the library performing operations within a single
machine and only supporting join operations. This is in contrast to typical scenarios
where join operations are combined with other operations like selection.
To address these issues, an analysis was conducted on how Cassandra communica-
tes the state of its machines within the cluster, how data is stored in a distributed
environment, and how data is retrieved in such a setting. After the analysis, da-
ta retrieval from Cassandra was improved by utilizing the existing token ranges in
Cassandra.
Further analysis was performed on various solution alternatives for selecting machi-
nes to perform tasks. Options included using dedicated machines, employing load
balancers, and utilizing multiple worker machines. The last alternative, utilizing
multiple worker machines, was chosen for its efficiency.
Subsequently, the selected solutions were implemented in the form of a library.
Through testing, it was found that the developed library possessed the intended
functionality and exhibited better performance compared to the work by Adyatma
(2022). In comparison to Datastax’s Spark Cassandra Connector, the developed
library outperformed it on small and medium-sized datasets but underperformed
on large datasets. This discrepancy is due to the memory usage not being fully
optimized, leading to significant overhead when handling large datasets. |
---|