IMPLEMENTATION OF CLOUD NATIVE INFRASTRUCTURE DESIGN FOR AI/ML EXPERIMENTS USING KUBERNETES

This research focuses on the design implementation of cloud-native infrastructure for AI/ML experiments using Kubernetes. The main goal of this research is to build an infrastructure that can support AI/ML experiments with high flexibility and scalability. Kubernetes was chosen as the container o...

Full description

Saved in:
Bibliographic Details
Main Author: Anindhita Chandra, Indira
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/82270
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:This research focuses on the design implementation of cloud-native infrastructure for AI/ML experiments using Kubernetes. The main goal of this research is to build an infrastructure that can support AI/ML experiments with high flexibility and scalability. Kubernetes was chosen as the container orchestration platform due to its ability to manage dynamic and complex workloads. The design implementation involves configuring a Kubernetes cluster consisting of several nodes, including master and worker nodes equipped with GPUs and CPUs. Additional configurations include a storage class for storage management, load balancer settings for network traffic distribution, and a monitoring platform using Prometheus and Grafana to monitor system performance. Kubeflow is also integrated as the main framework to facilitate the management of AI/ML experiments. This process ensures that the infrastructure can be operated and optimized according to user needs. Testing was conducted to evaluate the performance and efficiency of the built infrastructure. Accessibility testing involved several usage scenarios with various devices, including PCs, laptops, and phones. Additionally, resource usage testing was carried out with various scenarios, involving multiple users accessing and running AI/ML workloads with different configurations. Analysis of the test results shows that the built cloud-native infrastructure has several key advantages. The system not only supports dynamic scalability but also improves resource usage efficiency. The use of container technology and Kubernetes orchestration allows for real-time addition or reduction of resources. This technology is crucial for AI/ML experiments that require high computation. Additionally, the implemented monitoring platform enables continuous performance monitoring, facilitating the identification and resolution of potential issues. This research successfully demonstrates that the design and implementation of cloud-native infrastructure using Kubernetes can significantly improve efficiency and effectiveness in managing AI/ML workloads. This infrastructure not only supports various computational needs but also provides the flexibility and scalability required for a dynamic research environment.