Fault-tolerant computation meets network coding: optimal scheduling in parallel computing

In large-scale parallel computing systems, machines and the network suffer from non-negligible faults, often leading to system crashes. The traditional method to increase reliability is to restart the failed jobs. To avoid unnecessary time wasted on reboots, we propose an optimal scheduling strategy...

Full description

Saved in:
Bibliographic Details
Main Authors: Li, Congduan, Zhang, Yiqian, Tan, Chee Wei
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2023
Subjects:
Online Access:https://hdl.handle.net/10356/172081
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-172081
record_format dspace
spelling sg-ntu-dr.10356-1720812023-11-22T01:21:46Z Fault-tolerant computation meets network coding: optimal scheduling in parallel computing Li, Congduan Zhang, Yiqian Tan, Chee Wei School of Computer Science and Engineering Engineering::Computer science and engineering Fault Tolerant Control Parallel Computing In large-scale parallel computing systems, machines and the network suffer from non-negligible faults, often leading to system crashes. The traditional method to increase reliability is to restart the failed jobs. To avoid unnecessary time wasted on reboots, we propose an optimal scheduling strategy to enable fault-tolerant reliable computation to protect the integrity of computation. Specifically, we determine the optimal redundancy-failure rate tradeoff to incorporate redundancy into parallel computing units running multiple-precision arithmetics, like the Chinese Remainder Theorem, that are useful for applications such as asymmetric cryptography and fast integer multiplication. Inspired by network coding in distributed storage for disk failures, we propose coding matrices to strategically map partial computation to available computing units, so that the central unit can reliably reconstruct the results of any failed machine without recalculations to yield the final correct computation output. We propose optimization-based algorithms to efficiently construct the optimal coding matrices subject to fault tolerance specifications. Performance evaluation demonstrates that the optimal scheduling effectively reduces the overall running time of parallel computing while resisting wide-ranging failure rates. Ministry of Education (MOE) This work was supported in part by the National Natural Science Foundation of China (NSFC) with grant no. 62271514, by the Science, Technology and Innovation Commission of Shenzhen Municipality with grant no. JCYJ20210324120002007, JCYJ20190807155617099, and ZDSYS20210623091807023, by the Research Fund of Guangxi Key Laboratory of Multi-source Information Mining & Security (MIMS20-01) and by the Ministry of Education, Singapore, under its Academic Research Fund (No. 022307 and AcRF RG91/22). 2023-11-22T01:21:46Z 2023-11-22T01:21:46Z 2023 Journal Article Li, C., Zhang, Y. & Tan, C. W. (2023). Fault-tolerant computation meets network coding: optimal scheduling in parallel computing. IEEE Transactions On Communications, 71(7), 3847-3860. https://dx.doi.org/10.1109/TCOMM.2023.3275166 0090-6778 https://hdl.handle.net/10356/172081 10.1109/TCOMM.2023.3275166 2-s2.0-85159794677 7 71 3847 3860 en 022307 RG91/22 IEEE Transactions on Communications © 2023 IEEE. All rights reserved.
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
Fault Tolerant Control
Parallel Computing
spellingShingle Engineering::Computer science and engineering
Fault Tolerant Control
Parallel Computing
Li, Congduan
Zhang, Yiqian
Tan, Chee Wei
Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
description In large-scale parallel computing systems, machines and the network suffer from non-negligible faults, often leading to system crashes. The traditional method to increase reliability is to restart the failed jobs. To avoid unnecessary time wasted on reboots, we propose an optimal scheduling strategy to enable fault-tolerant reliable computation to protect the integrity of computation. Specifically, we determine the optimal redundancy-failure rate tradeoff to incorporate redundancy into parallel computing units running multiple-precision arithmetics, like the Chinese Remainder Theorem, that are useful for applications such as asymmetric cryptography and fast integer multiplication. Inspired by network coding in distributed storage for disk failures, we propose coding matrices to strategically map partial computation to available computing units, so that the central unit can reliably reconstruct the results of any failed machine without recalculations to yield the final correct computation output. We propose optimization-based algorithms to efficiently construct the optimal coding matrices subject to fault tolerance specifications. Performance evaluation demonstrates that the optimal scheduling effectively reduces the overall running time of parallel computing while resisting wide-ranging failure rates.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Li, Congduan
Zhang, Yiqian
Tan, Chee Wei
format Article
author Li, Congduan
Zhang, Yiqian
Tan, Chee Wei
author_sort Li, Congduan
title Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
title_short Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
title_full Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
title_fullStr Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
title_full_unstemmed Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
title_sort fault-tolerant computation meets network coding: optimal scheduling in parallel computing
publishDate 2023
url https://hdl.handle.net/10356/172081
_version_ 1783955645357096960