Fault-tolerant computation meets network coding: optimal scheduling in parallel computing

In large-scale parallel computing systems, machines and the network suffer from non-negligible faults, often leading to system crashes. The traditional method to increase reliability is to restart the failed jobs. To avoid unnecessary time wasted on reboots, we propose an optimal scheduling strategy...

Full description

Saved in:

Bibliographic Details
Main Authors:	Li, Congduan, Zhang, Yiqian, Tan, Chee Wei
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2023
Subjects:	Engineering::Computer science and engineering Fault Tolerant Control Parallel Computing
Online Access:	https://hdl.handle.net/10356/172081
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-172081
record_format	dspace
spelling	sg-ntu-dr.10356-1720812023-11-22T01:21:46Z Fault-tolerant computation meets network coding: optimal scheduling in parallel computing Li, Congduan Zhang, Yiqian Tan, Chee Wei School of Computer Science and Engineering Engineering::Computer science and engineering Fault Tolerant Control Parallel Computing In large-scale parallel computing systems, machines and the network suffer from non-negligible faults, often leading to system crashes. The traditional method to increase reliability is to restart the failed jobs. To avoid unnecessary time wasted on reboots, we propose an optimal scheduling strategy to enable fault-tolerant reliable computation to protect the integrity of computation. Specifically, we determine the optimal redundancy-failure rate tradeoff to incorporate redundancy into parallel computing units running multiple-precision arithmetics, like the Chinese Remainder Theorem, that are useful for applications such as asymmetric cryptography and fast integer multiplication. Inspired by network coding in distributed storage for disk failures, we propose coding matrices to strategically map partial computation to available computing units, so that the central unit can reliably reconstruct the results of any failed machine without recalculations to yield the final correct computation output. We propose optimization-based algorithms to efficiently construct the optimal coding matrices subject to fault tolerance specifications. Performance evaluation demonstrates that the optimal scheduling effectively reduces the overall running time of parallel computing while resisting wide-ranging failure rates. Ministry of Education (MOE) This work was supported in part by the National Natural Science Foundation of China (NSFC) with grant no. 62271514, by the Science, Technology and Innovation Commission of Shenzhen Municipality with grant no. JCYJ20210324120002007, JCYJ20190807155617099, and ZDSYS20210623091807023, by the Research Fund of Guangxi Key Laboratory of Multi-source Information Mining & Security (MIMS20-01) and by the Ministry of Education, Singapore, under its Academic Research Fund (No. 022307 and AcRF RG91/22). 2023-11-22T01:21:46Z 2023-11-22T01:21:46Z 2023 Journal Article Li, C., Zhang, Y. & Tan, C. W. (2023). Fault-tolerant computation meets network coding: optimal scheduling in parallel computing. IEEE Transactions On Communications, 71(7), 3847-3860. https://dx.doi.org/10.1109/TCOMM.2023.3275166 0090-6778 https://hdl.handle.net/10356/172081 10.1109/TCOMM.2023.3275166 2-s2.0-85159794677 7 71 3847 3860 en 022307 RG91/22 IEEE Transactions on Communications © 2023 IEEE. All rights reserved.
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Fault Tolerant Control Parallel Computing
spellingShingle	Engineering::Computer science and engineering Fault Tolerant Control Parallel Computing Li, Congduan Zhang, Yiqian Tan, Chee Wei Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
description	In large-scale parallel computing systems, machines and the network suffer from non-negligible faults, often leading to system crashes. The traditional method to increase reliability is to restart the failed jobs. To avoid unnecessary time wasted on reboots, we propose an optimal scheduling strategy to enable fault-tolerant reliable computation to protect the integrity of computation. Specifically, we determine the optimal redundancy-failure rate tradeoff to incorporate redundancy into parallel computing units running multiple-precision arithmetics, like the Chinese Remainder Theorem, that are useful for applications such as asymmetric cryptography and fast integer multiplication. Inspired by network coding in distributed storage for disk failures, we propose coding matrices to strategically map partial computation to available computing units, so that the central unit can reliably reconstruct the results of any failed machine without recalculations to yield the final correct computation output. We propose optimization-based algorithms to efficiently construct the optimal coding matrices subject to fault tolerance specifications. Performance evaluation demonstrates that the optimal scheduling effectively reduces the overall running time of parallel computing while resisting wide-ranging failure rates.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Li, Congduan Zhang, Yiqian Tan, Chee Wei
format	Article
author	Li, Congduan Zhang, Yiqian Tan, Chee Wei
author_sort	Li, Congduan
title	Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
title_short	Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
title_full	Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
title_fullStr	Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
title_full_unstemmed	Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
title_sort	fault-tolerant computation meets network coding: optimal scheduling in parallel computing
publishDate	2023
url	https://hdl.handle.net/10356/172081
_version_	1783955645357096960

Fault-tolerant computation meets network coding: optimal scheduling in parallel computing

Similar Items