During SC14, Michael Klemm from Intel and myself teamed up to give an OpenMP 4.0 overview talk at the OpenMP booth. Our goal was to touch on all important aspects, from thread binding over tasking to accelerator support, and to entertain our audience in doing so. Although not all jokes translate from German to English as we intended, I absolutely think that the resulting video is a fun-oriented 25-minutes run-down of OpenMP 4.0 and worth sharing here:
Since 2001 already, the IT Center (formerly: Center for Computing and Communication) of RWTH Aachen University offers a one week HPC workshop on Parallel Programming during spring time. This course is not restricted to scientists and engineers from our university, in fact we have about 30% of external attendees each time. This year we were very happy about a record attendance of up to 85 persons for the OpenMP lectures on Wednesday. As usual we publish all course materials online, but this year we also created screencasts from all presentations. That means you see the slides and the live demos and you hear the presenter talk. This blog post contains links to both the screencasts as well as the other course material, sorted by topic.
We have three talks as an introduction to OpenMP from Wednesday and two talks on selected topics from Thursday, which were vectorization and tools.
Introduction to OpenMP Programming (part 1), by Christian Terboven:
Getting OpenMP up to Speed, by Ruud van der Pas:
Introduction to OpenMP Programming (part 2), by Christian Terboven:
Vectorization with OpenMP, by Dirk Schmidl:
Tools for OpenMP Programming, by Dirk Schmidl:
We have two talks as an introduction to MPI and one on using the Vampir toolchain, all from Tuesday.
Introduction to MPI Programming (part 1), by Hristo Iliev:
Introduction to MPI Programming (part 2), by Hristo Iliev:
Introduction to VampirTrace and Vampir by Hristo Iliev:
Intel Xeon Phi
We put a special focus on presenting this architecture and we have one overview talk and one talk on using OpenMP 4.0 constructs for this architecture.
Programming the Intel Xeon Phi Coprocessor Overview, by Tim Cramer:
OpenMP 4.0 for Accelerators, by Christian Terboven:
Quoting from openmp.org: OpenMP, the de-facto standard for parallel programming on shared memory systems, continues to extend its reach beyond pure HPC to include embedded systems, real time systems, and accelerators. Release Candidate 1 of the OpenMP 4.0 API specifications currently under development is now available for public discussion. This update includes thread affinity, initial support for Fortran 2003, SIMD constructs to vectorize both serial and parallelized loops, user-defined reductions, and sequentially consistent atomics. The OpenMP ARB plans to integrate the Technical Report on directives for attached accelerators, as well as more new features, in a final Release Candidate 2, to appear sometime in the first Quarter of 2013, followed by the finalized full 4.0 API specifications soon thereafter.
Expect big news on OpenMP 4.0 for next week’s SC12. The OpenMP Language Committee – responsible for developing the standard – always planned to release the next version of the standard as a draft for public comment in time for SC12. We worked very hard during the last weeks to stay within our schedule. And we will do the following:
Release OpenMP 4.0 RC1 as a draft for public review. This document will be in a pretty good shape and will represent the foundation of OpenMP 4.0. It will contain several new features, to be discussed and explained during SC12 at our booth and/or the OpenMP BoF. Among these new features is the SIMD construct, to vectorize both serial as well as parallelized loops, taskgroups (no task dependencies yet), thread binding via places (I talked a lot on this already), array sectioning, basic support for Fortran 2003, and some other minor corrections and improvements.
Publish a Technical Report on OpenMP for Accelerators, more specifically on “Directives for Attached Accelerators”. This was always planned to be the major addition for OpenMP 4.0. However, integrating support for accelerators with the rest of OpenMP is a hard task and a lot of work, and it is not 100% done yet. There were many discussion on how to deal with this situation: do as outlined here, wait for just some more weeks, come up with a completely new schedule and wait until we are completely done, … . Almost all technical aspects have been discussed and answered. But the wording is not yet completed. And support for NVIDIA-like GPUs might not be optimal. However, I personally think the proposal is really good and the big opportunity in making the current state of work public is that the HPC community can take a look at it, think about it, comment on it, and possibly improve it. It is already online: http://openmp.org/wp/openmp-specifications/.
Hoping for constructive feedback and taking the additional time to work on the OpenMP for Accelerator extension, the current plan is to come up with a second draft for public comment (RC2) in January 2013 and then finalize the standard quickly after. Quickly in terms of a few weeks. This plan is still ambitious, but I think this is a good plan.
If you want to learn more, come to the OpenMP booth, and come to the BoF on Tuesday afternoon, 17:30h, which unluckily I will not be able to attend myself :-/. Listen to what the people will show you and let us know what you like and what you dislike.
Exascale machines will employ significantly more threads than today, but even on current architectures controlling thread affinity is crucial to fuel all the cores and to maintain data affinity, but both MPI and OpenMP lack a solution to this problem – this is the first sentence of our IWOMP 2012 paper with the same title as this blog post. The need for thread affinity in OpenMP has been demonstrated several times at several occasions. Inside the OpenMP Language Committee we formed the Affinity Subcommittee and we are working on this topic since several years now. Meanwhile almost all vendors have introduced their own extensions to support thread affinity, but they are all different and thus offer a clearly suboptimal user experience. Furthermore, they do not support nested OpenMP and in general they are static, meaning that only one affinity setting can be used for the whole program. For OpenMP 4.0, which is expected to be released as a draft in November 2012, we have a good thread affinity proposal on the table that not only standardizes existing vendor extensions, but also will add additional capabilities. This blog post will present this proposal along with some information why things are the way the are. I welcome any comments or questions via email.
When we started thinking about Affinity in general, we first tried to define a machine model or rather a machine abstraction and intended to use that to bind threads to cores as well as to possibly define a data layout. Over time I got convinced that this is not the right approach. Whatever method we used to describe the machine topology, we always envisioned systems that would be very complicated to be described. But furthermore, describing the system could end up being a task to be performed by the user, which I think is too complicated for most of them. We also do not want to enforce users to think about an explicit mapping of threads to cores, because for 95 % of the OpenMP programmers we think this is too low level. And last but not least, when there would be a new machine that could not be comfortably described by our method, OpenMP develops too slowly to be extended to support that.
To overcome this problem, the current proposal as developed by Alexandre E. Eichenberger, myself and the members of the OpenMP Language Committee Affinity Subcommittee, introduced the concepts of a place and a place-list. A place is defined as a set of execution units capable of executing OpenMP threads. For now you may think of a place like a set of cores. A place-list is an ordered list of places, the ordered attribute is important. It can be defined by either using abstract names or rather constructing the places by enumerating the cores. The place-list will be used together with an affinity policy to bind the OpenMP threads in a team of a parallel region to the places in the list. It can be specified via the new environment variable OMP_PLACES (the name might still change). Lets illustrate that with an example: The figure below depicts a very standard system (node 0) with two sockets (socket 0 and socket 1), every socket having four cores (core 0 to core 3 on socket 0) and finally every core has two hardware-threads (t0 and t1), i.e. every core can execute two threads simultaneously.
Lets construct a place-list consisting of eight places, every place to be a physical core consisting of two hardware-threads (I often call those logical-threads). All of the following methods are equivalent, but we expect almost all users to use the first option:
As for now we will define three abstract names to describe the place-list: hwthreads, cores and sockets. It is up to the implementation to define what is meant to be a “core” for instance, but of course we will provide some hints. The wording on that is not yet completed, but it will be something along the lines of hwthreads := smallest unit of execution capable of executing an OpenMP thread; cores := set of execution units in which more than one hardware-thread share some resources such as caches; sockets := physical package of multiple cores.
Of course defining a place-list does not lead to any thread affinity. As I said above, the place list is just used to define the places the threads of a parallel region can be bound to. In our proposal, the user does not have to define an explicit mapping of threads to places (or execution units in a place) – instead, the user can specify a so-called affinity policy via the new affinity clause which can be put on a parallel region. Our proposal consists of currently three affinity policies that allow to exploit the place-list in several possible ways (the names might still change):
SPREAD: spread OpenMP threads as evenly as possible among the places. The place-list will be partitioned, so that subsequent threads (i.e. nested OpenMP) will only be allocated within the partition. Given the place-list outlined above, this policy would provide most dedicated hardware resources to the OpenMP program.
CLOSE: pack OpenMP threads near to the master thread. There is no partitioning. Given the place-list from above, this policy would be used if sharing of resources among threads is desirable.
MASTER: collocate OpenMP threads with the master thread (in the same place). This will ensure maximum locality to the master thread.
It is important to understand that these affinity policies influence the allocation of threads to places – not directly to the system topology. In my example the (ordered!) place-list was designed so that two threads far apart from each other also end up on physical cores far apart in the system. Although we expect this to be the standard use case, it does not necessarily have to be this way.
Lets take a closer look at what the affinity policies do by looking at some examples. The figure below shows what SPREAD will do. The green box denotes the place-list, and for every number of threads >=2 the place-list will be partitioned when a parallel region with this affinity clause is encountered. This will support nested OpenMP, as we will see later on. Every thread will receive its own sub-place-list. If there are more threads than places, more than one thread has to be allocated per place. This will occur so that if threads i and i+1 are put together in one place, this will also be the case for the OpenMP thread ids i and i+1 (in this example with 16 threads: threads with OpenMP thread id 0 and 1 are on place 0).
Lets also take a brief look at the two other affinity policies we are proposing, namely CLOSE and MASTER. Both are exampled in the figure below. For CLOSE, threads i and i+1 are meant to reside on place j and j+1, unless more than one thread will be allocated per place. For MASTER, all threads will be put into the same place the master thread is running on, unless this cannot be fulfilled by the implementation for any reason.
When discussing the proprietary support offered by OpenMP implementers, I said that their solutions are static for the whole program lifetime. In our proposal the initial place-list is fixed, but the affinity policy might of course be set dynamically. Furthermore, the figure below shows how nested OpenMP is supported. The outer parallel region uses the SPREAD affinity policy to create partitions and to maximize resource usage. The inner parallel region uses CLOSE to stay within the respective partition.
Whenever a new feature is intended to go into the OpenMP specification, we require the existence of at least one reference implementation to not only prove implementability, but also to get an estimation of the effort it takes to be implemented. The reference implementation for this proposal was done by Alexandre E. Eichenberger in an experimental OpenMP runtime for the IBM BlueGene/Q system. Our proposal does not affect performance critical parts of the implementation, “just” the thread selection and allocation parts. According to Alexandre’s findings the total overhead was less than 1 %, which is in the order of system noise.
Finally, let me summarize a few important properties / implications that I did not discuss in detail so far:
If the place-list is constructed by enumerating the cores, it will be done with the same naming scheme as used by the operating system. This approach is also used by all vendor-proprietary extensions and removes the need to define an explicit naming scheme, which might confuse users if it is different from the operation system and also might become inappropriate for future system topologies that we would not foresee today.
Every implementation will provide a default place-list to an OpenMP program. It has to document what the default place-list is. I guess that implementations will provide something like cores or hwthreads as a default. This corresponds to the behavior that the number of threads to be used if not specified by the user is also implementation defined (some implementations use just 1 thread, others as many as there are cores in the system).
When one (or more) threads are allocated to a place, they are allowed to migrate within this place if it contains more than one execution unit (i.e. physical core). This will allow for both an explicit thread-to-core binding as well as a more flexible as threads to a socket, for example, depending on how the place-list is constructed as well as which affinity policy is used.
The binding of the initial thread may occur as early as the runtime decides to be appropriate, but not later than when the first parallel region is encountered.
Thanks for reading until down here. More details can be found in the paper which is published by Springer in IWOMP 2012. Again, I welcome any comments or questions via email.