Heterogeneous Habanero-C

The Hetrogeneous Habanero-C (HC) language under development in the Habanero project at Rice University provides and implementation of the Habanero execution model for modern heterogeneous (CPU + GPU) architectures.

Overview

The Heterogeneous Habanero-C (H2C) language, compiler and runtime framework is specifically desgined to achieve portability, productivity and performance on modern heterogeneous (CPU+ GPU) architectures. The main goal is take a machine indpendent program written in H2C and generate a machine specific executable. Some highlights of H2C include:

Minimal intuitive language extensions makes it easier to write new programs and port existing programs with little effort.
Two stage compilation targets both domain experts as well as ninja parallel programmers.
Shared Virtual Memory (SVM) supports rescursive pointer data structures on the GPU.
Meta datalayout framework generates a target specific data layout.
Embedded DSLs for stencils and re-use patterns take advantage of local scratchpad buffers.

H2C requires the underlying platform to support OpenCL. The H2C compiler uses a Machine Description(MDes) file provided by an user or automatically generated by an auto-tuner to generate a target specific executable. H2C uses an offload model wherein the CPU is the host and both CPU and GPU are devices.

A short summary of the H2C framework is included below. Details on the underlying implementation technologies can be found in the Habanero publications web page. The H2C implementation is still evolving at an early stage. If you would like to try out H2C, please contact one of the following people: Deepak Majeti, or Vivek Sarkar.

H2C Language Summary

The language constructs are classified into communication, computation and synchronization constructs.

Communication Constructs

Heterogeneous Habanero-C extends the Habanero async constructs to target modern heterogeneous architectures. The async construct, async, is used to asynchronously transfer the data among multiple devices. One can easily overlap computation with the asynchronous data transfers. The finish statement, finish <stmt>, ensures all the data transfers within <stmt> have completed.

async [copyin (var1, var2, ...)] [copyout (var1, var2, ...)] [at (dev1, dev2, ...)] [partition (ratio)];

copyin clause is used to specify the data that needs to be copied to the device from the host

copyout clause is used to specify the data that needs to be copied to the host from the device

at clause is used to specify the target devices

partition clause is used to specify the ratio of partition

Computation Constructs

forasync [in (var1, var2, ...)] [point (ind1, ind2, ...)] [range (siz1, siz2, ...)] [seq (seq1, seq2, ...)][at (dev1, dev2, ...)] [partition (ratio)]{Body}

Loop indices in each dimension are specified by the point clause.

The number of iterations in each dimension is specified by the size clause.

The tile size is specified by the seq clause.

forasync is lowered and implemented for CPUs in two different ways as follows.

1) Chunked Scheduling: Loop iterations are chunked into blocks of lengths specified by the seq clause.(Default option)
2) Recursive Scheduling: Loop iterations are recursively partitioned until the size of a block size specified by the seq clause is reached. This is similar to the TBB style. (use '-hcc:recursive' option when compiling your program)

forasync targets heterogeneous platforms by automatically generating host code and OpenCL device code.

Note: The semantics of forasync does not include a barrier. An explicit finish must enclose the forasync to synchronize all the iterations.

H2C Compiler and Runtime Framework

H2C uses a two-step compilation.

Current H2C limitations

There are some limitations and pitfalls in the current implementation of the HC programming model. These limitations are not inherent to the programming model, but rather are a result of incompleteness in the current compiler or runtime implementation.

1) Pointers to stack variables (including stack-allocated arrays) cannot be reused across "suspendable" points. A suspendable function is a function that can directly or indirectly call a function containing an async statement or a finish statement. A suspendable point is an async statement, the end of a finish statement, or a call to a suspendable function.

Work-around: copy these stack variables to the heap. There is no limitation on the reuse of heap pointers across suspendable points.

2) Pointers to HC functions (functions that contain HC constructs or call other HC functions) are not supported in HC.

Work-around: only use pointers to C functions.

3) const modifiers are not supported for function parameters or local variables in HC programs.

Work-around: remove these 'const' modifiers. The semantics of a correct program will remain unchanged, since the only purpose of the 'const' modifiers is to enforce additional compiler checking.

4) The number of tasks registered on a phaser cannot be larger than the number of worker threads specified with the -nproc option when an HC program is invoked. Otherwise, a deadlock may occur.

5) HC function calls must be in canonical form; either being a statement or being the right hand-side of an assignment.

Canonicalized HC function usage

foo();
b = foo();
a = b + c; // 'foo' value must first be assigned to a variable

Acknowledgement

Partial support for Habanero-C was provided through the CDSC program of the National Science Foundation with an award in the 2009 Expedition in Computing Program.

Space shortcuts

Child pages