Soft-error tolerant design for satellite-board computations

High energy particles in the outer space could flip the state of the latches of the electronic devices. The upset of the latches is called soft error since it would not cause permanent damage to the device. However, the soft error may cause faults in a processor and lead to malfunction of the comput...

Full description

Saved in:
Bibliographic Details
Main Author: Zhang, Lei
Other Authors: Hsu Wen Jing
Format: Theses and Dissertations
Language:English
Published: 2016
Subjects:
Online Access:https://hdl.handle.net/10356/69206
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:High energy particles in the outer space could flip the state of the latches of the electronic devices. The upset of the latches is called soft error since it would not cause permanent damage to the device. However, the soft error may cause faults in a processor and lead to malfunction of the computer system on the satellite. In this thesis, we implement a novel soft error fault tolerant scheme based on the LEON2/3 processor to protect the processor from soft errors. We verify the correctness and evaluate the overhead of this scheme, and we also determine the critical resource which should be protected. Our scheme includes two parts: sensor network and rollback scheme. A sensor is used to monitor a target register. It will assert if the monitored register is flipped because of a soft error. The rollback scheme is modified from a synchronization feature of the LEON2 and LEON3 processors. This feature originally aims to synchronize among Floating Point Unit (FPU), CACHE and Integer Unit (IU). We make use of it to stall the IU when a soft error is detected, and recover from the error by re-executing the current operation. To verify the correctness of the sensor and rollback scheme, we inject errors during the execution of a large number of instructions of LEON2/3. The results show that all instruction rollbacks are correct. To evaluate the overhead of the scheme, we determine the time and resource penalties of our scheme. The test results show that the scheme incurs only one extra clock penalty in about 90\% of test cases and increases 0.282% of resource usage of the original processor for adding one 32-bit sensor. To identify the critical resource for protection, we define the weight of instructions based on the frequency of instruction usage. Moreover, we monitor the number of accesses of the internal registers/bits in LEON3. Then we compute the impact factor (IF) for each internal register and status bit according to the register access frequencies and instruction weight. Using this approach, we could figure out the most critical resources according to the impact factor. The results show that there are 233 registers and status bits in total. Of which, 91 of them have 100% IF which suggests they are the critical resources. 43 of them have an IF ranging between 1.9% and 97.6%, which means they are less important. The IF of the remaining 89 is 0 because they are not used in our target application. Our work shows that the sensor network and rollback scheme can protect the processor from soft errors while incurring minimal penalties. By analyzing the vulnerability of registers and status bits, we can selectively deploy limited resource on the most critical registers/bits. Therefore, our scheme could effectively improve the robustness of computer systems for satellite applications.