Scalable techniques for extending lifetime reliability of manycore systems

Manycore systems are increasingly sought for computing complex applications from diverse domains. The absence of effective load balancing among the many cores accelerates premature failures, thereby shortening the lifetime reliability. This is especially so when the reduced device dimensions, increa...

Full description

Saved in:
Bibliographic Details
Main Author: Rathore, Vijeta
Other Authors: Thambipillai Srikanthan
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/144282
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Manycore systems are increasingly sought for computing complex applications from diverse domains. The absence of effective load balancing among the many cores accelerates premature failures, thereby shortening the lifetime reliability. This is especially so when the reduced device dimensions, increased power density, and elevated chip temperature in the nano-era pose a severe threat to the lifetime reliability. In this thesis, several novel and scalable techniques for extending the lifetime reliability of manycore systems while meeting system requirements, including performance, power, and temperature have been proposed. A performance-aware dynamic task mapping is proposed first for extending the lifetime reliability of multicore systems under periodic workloads. The proposed process variation-aware mapping methodology relies on a sensor-based measurement of the core aging due to negative-bias temperature instability (NBTI) wearout mechanism. It capitalizes on an extensive design space exploration performed at design-time and employs a responsive online task mapping technique to alleviate rapid aging. Evaluation of the proposed method for a 16-core system demonstrates up to 2.5X extension of system lifetime reliability over performance-greedy and temperature-lowering task mapping techniques while meeting the performance requirements. Next, a hierarchical task mapping is proposed to overcome the scalability limitations. A locality-based core grouping was introduced to reduce the size of the design space. The mapping-spread is controlled to support scalability and the sleeping cores are exploited for temperature mitigation towards lifetime reliability gains. The effect of process variation on the core frequency, power, as well as aging, are taken into account to facilitate uniform aging while meeting the performance, power budget, and maximum safe temperature constraints. Evaluation of the proposed method is performed using a proposed lifetime reliability simulator for process variation-affected 64-core and 256-core systems. The results demonstrate that the proposed method leads up to 60% less degradation of system lifetime reliability at the end of five years, compared to state-of-the-art aging-aware mapping methods. In order to take advantage of the execution slack for extending lifetime reliability, a dynamic per-core voltage-frequency management technique is proposed next. It relies on a greedy-based per-core voltage-frequency selection and employs mathematical models of aging and process variation to account for their impact on the dynamic voltage frequency scaling (DVFS) voltage-frequency levels. The proposed technique paves the way for meeting the performance requirements of the applications under tighter power and temperature constraints. Extensive experiments are conducted to compare the proposed technique against two state-of-the-art aging-aware mapping methods, for 64-core and 256-core systems running applications from PARSEC and SPLASH-2 benchmark suites. When compared to existing methods, the proposed technique achieves up to a 3.6X extended system lifetime reliability, along with a 62% fewer failed cores at the end of ten years. A reinforcement learning (RL)-based dynamic task mapping is proposed next to support aperiodic workloads. The proposed technique relies on dynamic segregation of the cores into frequency-wise bins of fixed maximum size to construct mapping heuristics to facilitate scalability. In addition, it adaptively capitalizes on the specific aging-behavior of the applications with different mapping heuristics. Unlike existing aging-management methods that rely on the uniform treatment of all the cores, the proposed technique relies on a performance-centric management strategy for saving cores from uneven aging. Experiments are conducted on a 256-core system to compare the proposed technique against the hierarchical mapping technique and a state-of-the-art aging optimization method. When compared to the other two approaches, the RL-based mapping technique improves the maximum operating frequency of, respectively, 69% and 82% of the cores. Also, it leads to up to 13% less degradation of system lifetime reliability at the end of ten thousand scheduling intervals simulated. Communication-aware mapping heuristics supported by flexible-sized core binning enhancements are proposed to the RL-based mapping technique. In addition, a holistic aging model, combining multiple wearout mechanisms, has been introduced to encapsulate real-life conditions. Experiments were conducted to compare the proposed method with state-of-the-art aging optimization technique on a 256-core system. The results confirm that the proposed method consistently results in less aging for all the cores and a 97% less degradation of system lifetime reliability, compared to the other mapping techniques. Moreover, the communication requirements and frequency constraints of the applications are also fully met. The major contributions of the research presented in this thesis to the state-of-the-art are scalable and efficient system-level resource management techniques for extending the lifetime reliability of manycore systems. Finally, the proposed techniques for task mapping and DVFS have led to a scalable and efficient methodology for extending lifetime reliability of manycore systems executing complex applications without requiring prior knowledge of the application characteristics.