Cluster-wide task slowdown detection in cloud system

Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scal...

Full description

Saved in:
Bibliographic Details
Main Authors: CHEN, Feiyi, ZHANG, Yingying, FAN, Lunting, LIANG, Yuxuan, PANG, Guansong, WEN, Qingsong, DENG, Shuiguang
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9755
https://ink.library.smu.edu.sg/context/sis_research/article/10755/viewcontent/2408.04236v1.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10755
record_format dspace
spelling sg-smu-ink.sis_research-107552024-12-16T03:18:08Z Cluster-wide task slowdown detection in cloud system CHEN, Feiyi ZHANG, Yingying FAN, Lunting LIANG, Yuxuan PANG, Guansong WEN, Qingsong DENG, Shuiguang Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scale cloud computing clusters, it becomes impractical and inefficient. Moreover, single-task slowdowns are very common and do not necessarily indicate a malfunction of a cluster due to its violent fluctuation nature in a virtual environment. Thus, we shift our attention to cluster-wide task slowdowns by utilizing the duration time distribution of tasks across a cluster, so that the computation complexity is not relevant to the number of tasks. The task duration time distribution often exhibits compound periodicity and local exceptional fluctuations over time. Though transformer-based methods are one of the most powerful methods to capture these time series normal variation patterns, we empirically find and theoretically explain the flaw of the standard attention mechanism in reconstructing subperiods with low amplitude when dealing with compound periodicity. To tackle these challenges, we propose SORN (i.e., Skimming Off subperiods in descending amplitude order and Reconstructing Non-slowing fluctuation), which consists of a Skimming Attention mechanism to reconstruct the compound periodicity and a Neural Optimal Transport module to distinguish cluster-wide slowdowns from other exceptional fluctuations. Furthermore, since anomalies in the training set are inevitable in a practical scenario, we propose a picky loss function, which adaptively assigns higher weights to reliable time slots in the training set. Extensive experiments demonstrate that SORN outperforms state-of-the-art methods on multiple real-world industrial datasets. 2024-09-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9755 info:doi/10.1145/3637528.3671936 https://ink.library.smu.edu.sg/context/sis_research/article/10755/viewcontent/2408.04236v1.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Task slowdown detection Time series Unsupervised anomaly detection AIOps Anomaly detection Cloud computing Slow task detection Databases and Information Systems
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Task slowdown detection
Time series
Unsupervised anomaly detection
AIOps
Anomaly detection
Cloud computing
Slow task detection
Databases and Information Systems
spellingShingle Task slowdown detection
Time series
Unsupervised anomaly detection
AIOps
Anomaly detection
Cloud computing
Slow task detection
Databases and Information Systems
CHEN, Feiyi
ZHANG, Yingying
FAN, Lunting
LIANG, Yuxuan
PANG, Guansong
WEN, Qingsong
DENG, Shuiguang
Cluster-wide task slowdown detection in cloud system
description Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scale cloud computing clusters, it becomes impractical and inefficient. Moreover, single-task slowdowns are very common and do not necessarily indicate a malfunction of a cluster due to its violent fluctuation nature in a virtual environment. Thus, we shift our attention to cluster-wide task slowdowns by utilizing the duration time distribution of tasks across a cluster, so that the computation complexity is not relevant to the number of tasks. The task duration time distribution often exhibits compound periodicity and local exceptional fluctuations over time. Though transformer-based methods are one of the most powerful methods to capture these time series normal variation patterns, we empirically find and theoretically explain the flaw of the standard attention mechanism in reconstructing subperiods with low amplitude when dealing with compound periodicity. To tackle these challenges, we propose SORN (i.e., Skimming Off subperiods in descending amplitude order and Reconstructing Non-slowing fluctuation), which consists of a Skimming Attention mechanism to reconstruct the compound periodicity and a Neural Optimal Transport module to distinguish cluster-wide slowdowns from other exceptional fluctuations. Furthermore, since anomalies in the training set are inevitable in a practical scenario, we propose a picky loss function, which adaptively assigns higher weights to reliable time slots in the training set. Extensive experiments demonstrate that SORN outperforms state-of-the-art methods on multiple real-world industrial datasets.
format text
author CHEN, Feiyi
ZHANG, Yingying
FAN, Lunting
LIANG, Yuxuan
PANG, Guansong
WEN, Qingsong
DENG, Shuiguang
author_facet CHEN, Feiyi
ZHANG, Yingying
FAN, Lunting
LIANG, Yuxuan
PANG, Guansong
WEN, Qingsong
DENG, Shuiguang
author_sort CHEN, Feiyi
title Cluster-wide task slowdown detection in cloud system
title_short Cluster-wide task slowdown detection in cloud system
title_full Cluster-wide task slowdown detection in cloud system
title_fullStr Cluster-wide task slowdown detection in cloud system
title_full_unstemmed Cluster-wide task slowdown detection in cloud system
title_sort cluster-wide task slowdown detection in cloud system
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/9755
https://ink.library.smu.edu.sg/context/sis_research/article/10755/viewcontent/2408.04236v1.pdf
_version_ 1819113129078947840