2024-03-28T09:35:34Zhttps://www.tdx.cat/oai/requestoai:www.tdx.cat:10803/3976702017-08-29T13:05:38Zcom_10803_183col_10803_196
nam a 5i 4500
Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications
[Barcelona] :
Universitat Politècnica de Catalunya,
2016
Accés lliure
http://hdl.handle.net/10803/397670
cr |||||||||||
AAMMDDs2016 sp ||||fsm||||0|| 0 eng|c
Subasi, Omer,
autor
1 recurs en línia (186 pàgines)
Tesi
Doctorat
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
2016
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
Tesis i dissertacions electròniques
Labarta Mancho, Jesús,
supervisor acadèmic
Ünsal, Osman,
supervisor acadèmic
TDX
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future Exascale systems predict the rates to be on the order of minutes. As a result, efficient fault tolerance solutions are needed to be able to tolerate frequent failures.
A fault tolerance solution for future HPC and Exascale systems must be low-cost, efficient and highly scalable. It should have low overhead in fault-free execution and provide fast restart because long-running applications are expected to experience many faults during the execution. Meanwhile task-based dataflow parallel programming models (PM) are becoming a popular paradigm in HPC applications at large scale. For instance, we see the adaptation of task-based dataflow parallelism in OpenMP 4.0, OmpSs PM, Argobots and Intel Threading Building Blocks.
In this thesis we propose fault-tolerance solutions for task-parallel dataflow HPC applications. Specifically, first we design and implement a checkpoint/restart and message-logging framework to recover from errors. We then develop performance models to investigate the benefits of our task-level frameworks when integrated with system-wide checkpointing. Moreover, we design and implement selective task replication mechanisms to detect and recover from silent data corruptions in task-parallel dataflow HPC applications. Finally, we introduce a runtime-based coding scheme to detect and recover from memory errors in these applications.
Considering the span of all of our schemes, we see that they provide a fairly high failure coverage where both computation and memory is protected against errors.
p
ES-BaCBU
cat
rda
ES-BaCBU
text
txt
rdacontent
informàtic
c
rdamedia
recurs en línia
cr
rdacarrier