Physics-informed machine learning for green data center operations

The data center (DC) industry is rapidly growing in recent years to meet the ever-increasing cloud computing and storage demands. The dramatically increasing DC scale brings substantial challenges for DC operations that aim to maintain business continuity and reduce operating costs. Current DCs are...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Ruihang
Other Authors: Tan Rui
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/172421
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The data center (DC) industry is rapidly growing in recent years to meet the ever-increasing cloud computing and storage demands. The dramatically increasing DC scale brings substantial challenges for DC operations that aim to maintain business continuity and reduce operating costs. Current DCs are mostly operated in a reactive manner in that they adopt feedback controllers and rely on empirical best practices. However, existing operating principles only focus on maintaining temperatures within certain ranges without taking system power usage into account. To achieve low-power DC operations, proactive and intelligent solutions are highly desirable. Machine learning (ML) approaches based on deep neural networks have been considered for developing such solutions. However, applying these advanced ML algorithms to DC operations faces two major challenges. First, ML often requires a large volume of training data, including those from abnormal cases which are hard to obtain from a stably operated DC. Second, the prevailing risk-aversion mindset in the DC industry hinders the wide deployment of ML-based solutions. To unleash the potential of ML for DC operations, this thesis proposes to integrate DC's ``physics priors" into the learning and deployment of the ML algorithms. The proposed physics-informed ML solutions advance DC operations in the following three stages. Firstly, this thesis aims to build predictive models to characterize the thermodynamics and power usage of a DC. To improve the model accuracy and reduce computation overhead, the thesis first proposes a knowledge-based model calibration and reduction approach for data hall thermodynamics model optimization. The evaluation shows the method achieves sub-1C temperature prediction error while accelerating the simulations by thousand times. Secondly, this thesis develops prescriptive models to instruct the DC cooling control with ML-based techniques. To address the challenges of enforcing thermal safety constraints during state exploration, this thesis designs a physics-guided learning framework that applies offline imitation learning and online post-hoc rectification to prevent thermal unsafety. In particular, the post-hoc rectification searches for the minimum modification to the ML-recommended action such that the rectified action will not result in thermal unsafety. The rectification is designed based on the previously calibrated thermodynamics models. The evaluation shows the proposed approach saves 14% to 26% power usage compared with conventional feedback control while satisfying safety constraints during the ML training. Thirdly, this thesis adapts the ML-based policy to the evolving DC environment. To expedite the adaptation with safety considerations, this thesis develops a physics-informed lifelong learning approach by supervising data collection with the previously identified transition model, fitting power usage and residual thermal models, pretraining the agent by interacting with these models, and deploying the agent for further fine-tuning. The proposed approach uses known physical laws to inform the modeling of transition and power usage for improving the extrapolation ability to unseen states. The evaluation shows that our approach saves 5.7% to 13.8% power usage compared with conventional feedback control and adapts 8x to 10x faster than native fine-tuning with at most 0.74C temperature overshoot. In summary, the proposed solutions that integrate state-of-the-art ML algorithms and physics priors can accurately simulate a DC and optimize it to achieve intelligent and low-power operations.