Epsilon, a cluster scheduler for Kubernetes clusters

The adoption of shared computer clusters for executing high-performance computing workloads has allowed many organizations with limited financial capabilities to access computing power that otherwise might be too costly for them to build. However, to support these heterogeneous workloads, cluster sc...

Full description

Saved in:
Bibliographic Details
Main Author: Neo, Alex Jing Hui
Other Authors: Lee Bu Sung, Francis
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/147629
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The adoption of shared computer clusters for executing high-performance computing workloads has allowed many organizations with limited financial capabilities to access computing power that otherwise might be too costly for them to build. However, to support these heterogeneous workloads, cluster schedulers are getting increasingly complex to develop and maintain as features are added based on different workload requirements. Kubernetes is a container orchestration platform designed to simplify the deployment of containers in a computer cluster. Kubernetes provides a monolithic cluster scheduler, which is responsible for allocating resources to containers. The author developed Epsilon as a cluster scheduler for Kubernetes using the microservices model as a foundation. In partnership with AsiaConnect, Epsilon's goal is to act as a scheduler and a starting point for examining the viability of using microservices to create a scheduler that helps developers implement updates quicker and support for heterogeneous workloads. Being a microservices-based scheduler, Epsilon is build using multiple microservices with strict service boundaries and are kept simple in design to prevent increasing code complexity due to modifications or feature updates. Epsilon contains multiple microservices for different functionalities, with some microservices making up the core system and the remaining as supporting features. The core microservices consist of the Coordinator, Scheduler, and Queue microservices which are responsible for the monitoring of new pods, scheduling of new pods, and communication between microservices respectively. Support microservices include the Autoscaler and Retry microservices which are responsible for automated scaling of scheduler services and rescheduling of failed pods respectively. One of Epsilon goals is to allow developers to commit changes quickly. Epsilon achieves this by splitting up the scheduler code into multiple smaller microservices. By spreading out the scheduler code in this way, developers can develop or update different components of the scheduler concurrently, reducing the time taken to commit the changes. Epsilon’s distributed nature also provide an opportunity to scale in or out the scheduler microservices to improve performance and resiliency due to having multiple identical copies of the scheduler microservices operating concurrently. Epsilon was deployed and tested on a 54 node Kubernetes cluster using Amazon Web Services EC2 instances. To improve the scheduler’s resiliency, Epsilon was deployed with 3 scheduler microservices. As multiple schedulers are making scheduling decisions concurrently, randomization is used as a mitigation technique to reduce the occurrence of scheduler conflicts between all 3 scheduler microservices. The experiments conducted included various tests related to the the scheduler’s performance, load balancing, and support for heterogeneous workloads. In one of the experiments, the scheduling performance of Epsilon was compared to the default Kubernetes scheduler. The experiment involves recording the time taken for each scheduler to schedule different amount of pods.