Fault-tolerant computation meets network coding: optimal scheduling in parallel computing
In large-scale parallel computing systems, machines and the network suffer from non-negligible faults, often leading to system crashes. The traditional method to increase reliability is to restart the failed jobs. To avoid unnecessary time wasted on reboots, we propose an optimal scheduling strategy...
Saved in:
Main Authors: | , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/172081 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-172081 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1720812023-11-22T01:21:46Z Fault-tolerant computation meets network coding: optimal scheduling in parallel computing Li, Congduan Zhang, Yiqian Tan, Chee Wei School of Computer Science and Engineering Engineering::Computer science and engineering Fault Tolerant Control Parallel Computing In large-scale parallel computing systems, machines and the network suffer from non-negligible faults, often leading to system crashes. The traditional method to increase reliability is to restart the failed jobs. To avoid unnecessary time wasted on reboots, we propose an optimal scheduling strategy to enable fault-tolerant reliable computation to protect the integrity of computation. Specifically, we determine the optimal redundancy-failure rate tradeoff to incorporate redundancy into parallel computing units running multiple-precision arithmetics, like the Chinese Remainder Theorem, that are useful for applications such as asymmetric cryptography and fast integer multiplication. Inspired by network coding in distributed storage for disk failures, we propose coding matrices to strategically map partial computation to available computing units, so that the central unit can reliably reconstruct the results of any failed machine without recalculations to yield the final correct computation output. We propose optimization-based algorithms to efficiently construct the optimal coding matrices subject to fault tolerance specifications. Performance evaluation demonstrates that the optimal scheduling effectively reduces the overall running time of parallel computing while resisting wide-ranging failure rates. Ministry of Education (MOE) This work was supported in part by the National Natural Science Foundation of China (NSFC) with grant no. 62271514, by the Science, Technology and Innovation Commission of Shenzhen Municipality with grant no. JCYJ20210324120002007, JCYJ20190807155617099, and ZDSYS20210623091807023, by the Research Fund of Guangxi Key Laboratory of Multi-source Information Mining & Security (MIMS20-01) and by the Ministry of Education, Singapore, under its Academic Research Fund (No. 022307 and AcRF RG91/22). 2023-11-22T01:21:46Z 2023-11-22T01:21:46Z 2023 Journal Article Li, C., Zhang, Y. & Tan, C. W. (2023). Fault-tolerant computation meets network coding: optimal scheduling in parallel computing. IEEE Transactions On Communications, 71(7), 3847-3860. https://dx.doi.org/10.1109/TCOMM.2023.3275166 0090-6778 https://hdl.handle.net/10356/172081 10.1109/TCOMM.2023.3275166 2-s2.0-85159794677 7 71 3847 3860 en 022307 RG91/22 IEEE Transactions on Communications © 2023 IEEE. All rights reserved. |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering Fault Tolerant Control Parallel Computing |
spellingShingle |
Engineering::Computer science and engineering Fault Tolerant Control Parallel Computing Li, Congduan Zhang, Yiqian Tan, Chee Wei Fault-tolerant computation meets network coding: optimal scheduling in parallel computing |
description |
In large-scale parallel computing systems, machines and the network suffer from non-negligible faults, often leading to system crashes. The traditional method to increase reliability is to restart the failed jobs. To avoid unnecessary time wasted on reboots, we propose an optimal scheduling strategy to enable fault-tolerant reliable computation to protect the integrity of computation. Specifically, we determine the optimal redundancy-failure rate tradeoff to incorporate redundancy into parallel computing units running multiple-precision arithmetics, like the Chinese Remainder Theorem, that are useful for applications such as asymmetric cryptography and fast integer multiplication. Inspired by network coding in distributed storage for disk failures, we propose coding matrices to strategically map partial computation to available computing units, so that the central unit can reliably reconstruct the results of any failed machine without recalculations to yield the final correct computation output. We propose optimization-based algorithms to efficiently construct the optimal coding matrices subject to fault tolerance specifications. Performance evaluation demonstrates that the optimal scheduling effectively reduces the overall running time of parallel computing while resisting wide-ranging failure rates. |
author2 |
School of Computer Science and Engineering |
author_facet |
School of Computer Science and Engineering Li, Congduan Zhang, Yiqian Tan, Chee Wei |
format |
Article |
author |
Li, Congduan Zhang, Yiqian Tan, Chee Wei |
author_sort |
Li, Congduan |
title |
Fault-tolerant computation meets network coding: optimal scheduling in parallel computing |
title_short |
Fault-tolerant computation meets network coding: optimal scheduling in parallel computing |
title_full |
Fault-tolerant computation meets network coding: optimal scheduling in parallel computing |
title_fullStr |
Fault-tolerant computation meets network coding: optimal scheduling in parallel computing |
title_full_unstemmed |
Fault-tolerant computation meets network coding: optimal scheduling in parallel computing |
title_sort |
fault-tolerant computation meets network coding: optimal scheduling in parallel computing |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/172081 |
_version_ |
1783955645357096960 |