Soft-error tolerant design for satellite-board computations

High energy particles in the outer space could flip the state of the latches of the electronic devices. The upset of the latches is called soft error since it would not cause permanent damage to the device. However, the soft error may cause faults in a processor and lead to malfunction of the comput...

Full description

Saved in:
Bibliographic Details
Main Author: Zhang, Lei
Other Authors: Hsu Wen Jing
Format: Theses and Dissertations
Language:English
Published: 2016
Subjects:
Online Access:https://hdl.handle.net/10356/69206
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-69206
record_format dspace
spelling sg-ntu-dr.10356-692062023-03-04T00:38:55Z Soft-error tolerant design for satellite-board computations Zhang, Lei Hsu Wen Jing School of Computer Engineering Centre for High Performance Embedded Systems DRNTU::Engineering High energy particles in the outer space could flip the state of the latches of the electronic devices. The upset of the latches is called soft error since it would not cause permanent damage to the device. However, the soft error may cause faults in a processor and lead to malfunction of the computer system on the satellite. In this thesis, we implement a novel soft error fault tolerant scheme based on the LEON2/3 processor to protect the processor from soft errors. We verify the correctness and evaluate the overhead of this scheme, and we also determine the critical resource which should be protected. Our scheme includes two parts: sensor network and rollback scheme. A sensor is used to monitor a target register. It will assert if the monitored register is flipped because of a soft error. The rollback scheme is modified from a synchronization feature of the LEON2 and LEON3 processors. This feature originally aims to synchronize among Floating Point Unit (FPU), CACHE and Integer Unit (IU). We make use of it to stall the IU when a soft error is detected, and recover from the error by re-executing the current operation. To verify the correctness of the sensor and rollback scheme, we inject errors during the execution of a large number of instructions of LEON2/3. The results show that all instruction rollbacks are correct. To evaluate the overhead of the scheme, we determine the time and resource penalties of our scheme. The test results show that the scheme incurs only one extra clock penalty in about 90\% of test cases and increases 0.282% of resource usage of the original processor for adding one 32-bit sensor. To identify the critical resource for protection, we define the weight of instructions based on the frequency of instruction usage. Moreover, we monitor the number of accesses of the internal registers/bits in LEON3. Then we compute the impact factor (IF) for each internal register and status bit according to the register access frequencies and instruction weight. Using this approach, we could figure out the most critical resources according to the impact factor. The results show that there are 233 registers and status bits in total. Of which, 91 of them have 100% IF which suggests they are the critical resources. 43 of them have an IF ranging between 1.9% and 97.6%, which means they are less important. The IF of the remaining 89 is 0 because they are not used in our target application. Our work shows that the sensor network and rollback scheme can protect the processor from soft errors while incurring minimal penalties. By analyzing the vulnerability of registers and status bits, we can selectively deploy limited resource on the most critical registers/bits. Therefore, our scheme could effectively improve the robustness of computer systems for satellite applications. MASTER OF ENGINEERING (SCE) 2016-11-28T02:50:20Z 2016-11-28T02:50:20Z 2016 Thesis https://hdl.handle.net/10356/69206 10.32657/10356/69206 en 109 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering
spellingShingle DRNTU::Engineering
Zhang, Lei
Soft-error tolerant design for satellite-board computations
description High energy particles in the outer space could flip the state of the latches of the electronic devices. The upset of the latches is called soft error since it would not cause permanent damage to the device. However, the soft error may cause faults in a processor and lead to malfunction of the computer system on the satellite. In this thesis, we implement a novel soft error fault tolerant scheme based on the LEON2/3 processor to protect the processor from soft errors. We verify the correctness and evaluate the overhead of this scheme, and we also determine the critical resource which should be protected. Our scheme includes two parts: sensor network and rollback scheme. A sensor is used to monitor a target register. It will assert if the monitored register is flipped because of a soft error. The rollback scheme is modified from a synchronization feature of the LEON2 and LEON3 processors. This feature originally aims to synchronize among Floating Point Unit (FPU), CACHE and Integer Unit (IU). We make use of it to stall the IU when a soft error is detected, and recover from the error by re-executing the current operation. To verify the correctness of the sensor and rollback scheme, we inject errors during the execution of a large number of instructions of LEON2/3. The results show that all instruction rollbacks are correct. To evaluate the overhead of the scheme, we determine the time and resource penalties of our scheme. The test results show that the scheme incurs only one extra clock penalty in about 90\% of test cases and increases 0.282% of resource usage of the original processor for adding one 32-bit sensor. To identify the critical resource for protection, we define the weight of instructions based on the frequency of instruction usage. Moreover, we monitor the number of accesses of the internal registers/bits in LEON3. Then we compute the impact factor (IF) for each internal register and status bit according to the register access frequencies and instruction weight. Using this approach, we could figure out the most critical resources according to the impact factor. The results show that there are 233 registers and status bits in total. Of which, 91 of them have 100% IF which suggests they are the critical resources. 43 of them have an IF ranging between 1.9% and 97.6%, which means they are less important. The IF of the remaining 89 is 0 because they are not used in our target application. Our work shows that the sensor network and rollback scheme can protect the processor from soft errors while incurring minimal penalties. By analyzing the vulnerability of registers and status bits, we can selectively deploy limited resource on the most critical registers/bits. Therefore, our scheme could effectively improve the robustness of computer systems for satellite applications.
author2 Hsu Wen Jing
author_facet Hsu Wen Jing
Zhang, Lei
format Theses and Dissertations
author Zhang, Lei
author_sort Zhang, Lei
title Soft-error tolerant design for satellite-board computations
title_short Soft-error tolerant design for satellite-board computations
title_full Soft-error tolerant design for satellite-board computations
title_fullStr Soft-error tolerant design for satellite-board computations
title_full_unstemmed Soft-error tolerant design for satellite-board computations
title_sort soft-error tolerant design for satellite-board computations
publishDate 2016
url https://hdl.handle.net/10356/69206
_version_ 1759856340454342656