The design of a fault management framework for cloud

High performance computing systems can have high failure rates as they feature a large number of servers and components with intensive workload. The availability of the system can be easily compromised if the failure of these subsystems is not handled correctly. This research proposes a framework of...

Full description

Saved in:

Bibliographic Details
Main Authors:	Chalermarrewong, Thanyalak, Achalakul, Tiranee, See, Simon Chong Wee
Other Authors:	School of Mechanical and Aerospace Engineering
Format:	Conference or Workshop Item
Language:	English
Published:	2013
Online Access:	https://hdl.handle.net/10356/97591 http://hdl.handle.net/10220/11862
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-97591
record_format	dspace
spelling	sg-ntu-dr.10356-975912020-03-07T13:26:33Z The design of a fault management framework for cloud Chalermarrewong, Thanyalak Achalakul, Tiranee See, Simon Chong Wee School of Mechanical and Aerospace Engineering International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (9th : 2012 : Phetchaburi, Thailand) High performance computing systems can have high failure rates as they feature a large number of servers and components with intensive workload. The availability of the system can be easily compromised if the failure of these subsystems is not handled correctly. This research proposes a framework of proactive fault tolerance for enterprise cloud computing systems. The main idea is to create an effective prediction model focusing on hardware failure. The proposed framework features two major components: monitoring and availability analysis. For each machine, the availability analysis module tracks historical states, and predicts the machine future state. Depending on the predicted state, the resource manager decides whether the machine requires task migration to prevent possible losses. By using task migration, the framework eliminates the cost of job replication and back up. The framework also includes the adequacy checking function into availability analysis in order to periodically evaluate and adjust the prediction model. The framework can thus be adopted by heterogeneous datacenters. The energy efficiency can be improved as the impact of the failure to the datacenters reduces. 2013-07-18T04:31:11Z 2019-12-06T19:44:24Z 2013-07-18T04:31:11Z 2019-12-06T19:44:24Z 2012 2012 Conference Paper Chalermarrewong, T., Achalakul, T., & See, S. C. W. (2012). The design of a fault management framework for cloud. 2012 9th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). https://hdl.handle.net/10356/97591 http://hdl.handle.net/10220/11862 10.1109/ECTICon.2012.6254358 en © 2012 IEEE.
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
description	High performance computing systems can have high failure rates as they feature a large number of servers and components with intensive workload. The availability of the system can be easily compromised if the failure of these subsystems is not handled correctly. This research proposes a framework of proactive fault tolerance for enterprise cloud computing systems. The main idea is to create an effective prediction model focusing on hardware failure. The proposed framework features two major components: monitoring and availability analysis. For each machine, the availability analysis module tracks historical states, and predicts the machine future state. Depending on the predicted state, the resource manager decides whether the machine requires task migration to prevent possible losses. By using task migration, the framework eliminates the cost of job replication and back up. The framework also includes the adequacy checking function into availability analysis in order to periodically evaluate and adjust the prediction model. The framework can thus be adopted by heterogeneous datacenters. The energy efficiency can be improved as the impact of the failure to the datacenters reduces.
author2	School of Mechanical and Aerospace Engineering
author_facet	School of Mechanical and Aerospace Engineering Chalermarrewong, Thanyalak Achalakul, Tiranee See, Simon Chong Wee
format	Conference or Workshop Item
author	Chalermarrewong, Thanyalak Achalakul, Tiranee See, Simon Chong Wee
spellingShingle	Chalermarrewong, Thanyalak Achalakul, Tiranee See, Simon Chong Wee The design of a fault management framework for cloud
author_sort	Chalermarrewong, Thanyalak
title	The design of a fault management framework for cloud
title_short	The design of a fault management framework for cloud
title_full	The design of a fault management framework for cloud
title_fullStr	The design of a fault management framework for cloud
title_full_unstemmed	The design of a fault management framework for cloud
title_sort	design of a fault management framework for cloud
publishDate	2013
url	https://hdl.handle.net/10356/97591 http://hdl.handle.net/10220/11862
_version_	1681039653635883008

The design of a fault management framework for cloud

Similar Items