On algorithmic reductions in task-parallel programming models

Author

Ciesko, Jan

Director

Badia Sala, Rosa M. (Rosa Maria)

Codirector

Labarta Mancho, Jesús

Date of defense

2017-07-24

Pages

178 p.



Department/Institute

Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors

Abstract

Wide adoption of parallel processing hardware in mainstream computing as well as the interest for efficient parallel programming in developer communities increase the demand for programming models that offer support for common algorithmic patterns. An algorithmic pattern of particular interest are reductions. Reductions are iterative memory updates of a program variable and appear in many applications. While their definition is simple, their variety of implementations including the use of different loop constructs and calling patterns makes their support in parallel programming models difficult. Further, their characteristic update operation over arbitrary data types that requires atomicity makes their execution computationally expensive and scalable execution challenging. These challenges and their relevance makes reductions a benchmark for compilers, runtime systems and hardware architectures today. This work advances research on algorithmic reductions. It improves their programmability by adding support for task-parallel and array-type reductions. Task-parallel reductions occur in while-loops and recursive algorithms. While for each recursive algorithm an iterative formulation exists, while-loop programs represent a super class of for-loop computable programs and therefore cannot be transformed or substituted. This limitation requires an explicit support for reduction algorithms that fall within this class. Since tasks are suited for a concurrent formulation of these algorithms, the presented work focuses on language extension to the task construct in OmpSs and OpenMP. In the first section of this work we present a generic support for task-parallel reductions in OmpSs and OpenMP and introduce the ideas of reduction scope, reduction domains and static and on-demand memory allocation. With this foundation and the feedback received from the OpenMP language review board, we develop a formalized proposal to add support for task-parallel reductions in OpenMP. This engagement led to a fruitful outcome as our proposal has been accepted into OpenMP recently. As a first step towards support of array-type reduction in a task-parallel programming model, we present a landscape of support techniques and group them by their underlying strategy. Techniques follow either the strategy of direct access (atomics), redirection or iteration ordering. We call techniques that implement redirection into thread-private data containers as techniques with alternative memory layouts (AMLs) and techniques that are based on iteration ordering as techniques with alternative iteration space (AIS). A universal support of AML-based techniques in parallel programming models can be achieved by defining basic interface methods allocate, get and reduce. As examples for new techniques that implement this interface, we present CachedPrivate and PIBOR. CachedPrivate implements a software cache to reduce communication caused by irregular accesses to remote nodes on distributed memory systems. PIBOR implements Privatization with In-lined Block-ordering, a technique that improves data locality by redirecting accesses into thread-local bins. Both techniques implement a get-method that returns a private memory storage for each update operation of the reduction loop. As an example of a technique with an alternative iteration space (AIS), we present Commutative Reductions (ComRed). This technique uses an inspector-executor execution model to generate knowledge about memory access patterns and memory overlaps between participating tasks. This information is used during the execution phase to schedule tasks with overlaps commutatively. We show that this execution model requires only a small set of additional language constructs. Performance results obtained throughout different Chapters of this work demonstrate that software techniques can improve application performance by a factor of 2-4.


La amplia adopción de hardware de procesamiento paralelo para la computación de propósito general, así como el interés por una programación paralela eficiente en la comunidad de desarrolladores, han aumentado la demanda de modelos de programación que ofrezcan soporte para patrones algorítmicos comunes. Un patrón algorítmico de particular interés son las reducciones. Las reducciones son actualizaciones iterativas de memoria de una variable del programa y aparecen en muchas aplicaciones. Aunque su definición es simple, su variedad de implementaciones, incluyendo el uso de diferentes construcciones de bucle y patrones de llamada, hace que su soporte en los modelos de programación paralelos sea difícil y requiera un cuidadoso diseño en lo que respecta a programabilidad, transparencia y rendimiento. Además, la necesidad de atomicidad en la ejecución de estas operaciones hace que sean costosas desde el punto de vista computacional y difícilmente escalables. Estos desafíos y su relevancia convierten a esta clase de operaciones en una referencia para medir el rendimiento de compiladores, sistemas en tiempo de ejecución y arquitecturas de hardware actuales. Impulsados por la necesidad de disponer de una implementación eficiente en nuestro modelo de programación paralelo, hemos desarrollado nuevas ideas que presentamos en este trabajo. Nuestras contribuciones son las siguientes: en primer lugar, añadimos soporte para reducciones de tareas paralelas (para bucles while y funciones recursivas) en el modelo de programación OmpSs y desarrollamos una propuesta para su inclusión en la especificación de OpenMP. En segundo lugar, desarrollamos nuevas técnicas para acelerar las reducciones irregulares y casi-regulares de tipo array y evaluamos su impacto mediante diferentes aplicaciones en varias arquitecturas. En tercer lugar, mostramos cómo estas técnicas pueden ser soportadas en OmpSs y OpenMP. Asimismo, mostramos que las reducciones se benefician de sistemas en tiempo de ejecución inteligentes implementando un esquema inspector-ejecutor. Nuestra propuesta de reducción de tareas paralelas ha sido aceptada recientemente en el estándar OpenMP.

Subjects

004 - Computer science and technology. Computing. Data processing

Knowledge Area

Àrees temàtiques de la UPC::Informàtica

Documents

TJC1de1.pdf

6.335Mb

 

Rights

L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by-nc-nd/4.0/
L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by-nc-nd/4.0/

This item appears in the following Collection(s)