Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3
Section

The Hetrogeneous Habanero-C (HCH2C) language under development in the Habanero project at Rice University provides and an implementation of the Habanero execution model for modern heterogeneous (CPU + GPU) architectures.

Table of Contents


Overview

The Heterogeneous Habanero-C (H2C) language, compiler and runtime framework is specifically desgined designed to achieve portability, productivity and performance on modern heterogeneous (CPU+ GPU) architectures. The main goal is to take a machine indpendent independent program written in H2C and generate a machine specific executable.

Some highlights of H2C include:

  1. Minimal intuitive Minimal, intuitive, language extensions to Habaneromakes it easier to write new programs and port existing programs.
  2. Two stage stages compilation to target targets both domain experts as well as ninja and ninja parallel programmers. 
  3. Shared Virtual Memory (SVM) to target rescursive  supports recursive pointer data structures on GPUthe GPU
  4. Meta datalayout framework to generate Data layout framework generates target specific data layout.

 

Overall H2C compilation FrameworkImage Removed

Unlike HJ which needs a JVM to run, Habanero-C is designed to be mapped onto hardware platforms with lightweight system software stacks, such as the Customizable Heterogeneous Platform (CHP) being developed in the NSF Expeditions Center for Domain-Specific Computing (CDSC) which includes CPUs, GPUs, and FPGAs.  The C foundation also makes it easier to integrate HC with communication middleware for cluster systems, such as MPI and GASNet.

  1. Embedded DSLs for stencils and re-use patterns take advantage of local scratchpad buffers.

H2C requires the underlying platform to support OpenCL. The H2C compiler relies on a Machine Description (MDes) file. It can be provided by the user or automatically generated by an auto-tuner. H2C uses an offload model wherein the CPU is the host and both CPUs and GPUs are devicesThe Habanero-C compiler is written in C++ and is built on top of the ROSE compiler infrastructure, which was also used in the DARPA-funded PACE project at Rice University.  The bulk of the Habanero-C runtime has been written from scratch in portable ANSI C.  However, a few library routines for low-level synchronization and atomic operations are written in assembly language for the target platform.  To date, the Habanero-C runtime has been ported and tested on Intel X86, Cyclops 64, Power7, Sun Niagara 2 and Intel SCC multicore platforms.

A short summary of the H2C language H2C framework is included below.  Details Details on the underlying implementation technologies can be found in the Habanero publications web page. The H2C implementation is still evolving at an early stage. If you would like to try out H2C, please contact one of the following people:    Deepak Majeti, or or Vivek Sarkar.

Habanero-C has two basic primitives for the task parallel programming model borrowed from X10: async and finish. The async statement, async <stmt>, causes the parent task to fork a new child task that executes <stmt>. Execution of the async statement returns immediately, i.e., the parent task can proceed to its following statement without waiting for the child task to complete.  The

 


H2C Language Summary

 The language constructs are classified into communication, computation and synchronization constructs.

Communication Constructs

The async construct, is used to asynchronously transfer data among multiple devices. One can easily overlap computation with the asynchronous data transfers.

The finish statement, finish <stmt>,

 performs a join operation that causes the parent task to execute <stmt> and then wait until all the tasks created within <stmt> have terminated (including transitively spawned tasks).  The Habanero-C runtime uses a work-stealing scheduler that supports work-first and help-first policies along with places for locality

ensures all the data transfers within <stmt> have completed.

  • Habanero-C uses phasers for synchronization. Phasers are programming constructs that unify collective and point-to-point synchronization in task parallel programming. Phasers are designed for ease of use and safety, helping programmer productivity in task parallel programming and debugging. The use of phasers guarantees two safety properties: deadlock-freedom and phase-ordering. These properties, along with the generality of its use for dynamic parallelism, distinguish phasers from other synchronization constructs such as barriers, counting semaphores and X10 clocks. In Habanero-C tasks can register on a phaser in on of the 3 modes: SIGNAL_WAIT_MODE, SIGNAL_ONLY_MODE, WAIT_ONLY_MODE.
  • For locality, Habanero-C uses Hierarchical Place Trees(HPTs). HPTs abstract the underlying hardware using hierarchical trees, allowing the program to spawn tasks at places, which for example could be cores, groups of cores sharing cache, nodes, groups of nodes, or other devices such as GPUs or FPGAs. The work-stealing runtime takes advantage of the hardware hierarchy to preserve locality when executing tasks.

     

  • H2C Language Summary

    Constructs for Communication

     async [(place)] [IN (var1, var2, ...)] [COPYIN

     async  [copyin

    (var1, var2, ...)] [

    COPYOUT 

    copyout (var1, var2, ...)] [

    AT

    at (dev1, dev2, ...)] [

    PARTITION

    partition (ratio)]

      Stmt 

    ;

    • 'copyin' clause is used to specify the data that needs to be copied to the device from the host
    • 'copyout' clause is used to specify the data that needs to be copied to the host from the device
    • 'at' clause is used to specify the targeted devices
    • 'partition' clause is used to specify the ratio of partition

    Computation Constructs

    The forasync construct is a data/task parallel loop. It is the programmer's responsibility to ensure that loop iterations are independent.

    - Asynchronously start a new task to execute Stmt in parallel with the parent. A destination place can optionally be specified for where the task should execute. The place can be obtained from the runtime using HC runtime functions (see HPT).

    - Any local variable declared in an outer scope that is used in the async has to be specified in an IN (for variables read by the async), OUT(for variables written by the async), or INOUT(for variables both read and written by the async) clauses. The IN/OUT/OUT clauses have copy-in/copy-out semantics for local variables; selected variables are copied in from the parent scope at the start of the async, and out into the parent scope at the end of the async task.

    - an AWAIT clause can optionally be specified, listing all the data-driven futures (DDF's) that the task should wait on before starting its execution.

    - a phased clause can optionally be specified, registering the async on all the phasers specified in the list (ph1, ph2, ...), or on all the phasers of the parent (if the list is not specified).

    finish Stmt

     - execute Stmt, but wait until all (transitively) spawned asyncs in Stmt's scope have terminated before advancing to the next statement.

    Constructs for Computation

      forasync   [in (var1, var2, ...)] [point (ind1, ind2, ...)] [

    size

    range (siz1, siz2, ...)] [seq (seq1, seq2, ...)][scratchpad (var1, var2, ...)][at (dev1, dev2, ...)] [partition (ratio)]{Body

    }

    • 'point' clause is used to specify the loop indices in each dimension
    • 'range' clause is used to specify the number of iterations in each dimension
    • 'seq' clause is used to specify the tile size or the work-group size
    • Body represents the loop iteration

    Synchronization Constructs

    The finish construct ensures all the tasks spawned inside it are completed.


    H2C Compiler and Runtime Framework

    Two Phase Compilation 

    In the first phase, the H2C compiler translates a H2C program down to a C program, OpenCL kernel and the corresponding host program. Parallelism experts can optionally choose to add optimized OpenCL kernels. The compiler uses a Machine Description(MDes) file to generate a target specific OpenCL kernel and communication. The MDes file can either be specified by the programmer or automatically generated by an auto-tuner.

    In the second phase, a standard C compiler is used to build an executable from the generated intermediate files along with the H2C runtime and OpenCL runtime.

     

    Image Added

     

    Runtime

    The H2C runtime includes a memory manger, scheduler and interfaces with the OpenCL runtime.


    Example H2C program

    Code Block
    languagecpp
    titleMatrix Multiply in H2C
    linenumberstrue
    finish{
         async copyin(a,b) at(dev);
         foo(); //asynchronously copy data while executing foo
    }
    finish{
        forasync in(a,b,c,m,n,p) point(i,j) range(0:m,0:n) seq(4,128) shared(a,b) at(dev){
            float temp =0;
     		for(int k=0;k<p;k++){
               temp += a[i*p+k]*b[k*n+j];
            }
            c[i*n+j] =temp;
       }
    }
    finish{
         async copyout(c) at(dev);
         bar(); //asynchronously copy data while executing bar
    }

     

    -- The semantics of the in clause is the same as in the async case.

    -- Loop indices in each dimension are specified by the point clause.

    -- The number of iterations in each dimension is specified by the size clause.

    -- The tile size is specified by the seq clause.

    forasync is lowered and implemented for CPUs in two different ways as follows.

    1) Chunked Scheduling: Loop iterations are chunked into blocks of lengths specified by the seq clause.(Default option)

    2) Recursive Scheduling: Loop iterations are recursively partitioned until the size of a block size specified by the seq clause is reached. This is similar to the TBB style. (use '-hcc:recursive' option when compiling your program)

    forasync targets heterogeneous platforms by automatically generating host code and OpenCL device code.

    Note: The semantics of forasync does not include a barrier. An explicit finish must  enclose the forasync to synchronize all the iterations.


    Current H2C limitations

    There are some limitations and pitfalls in the current implementation of the HC H2C programming model. These limitations are not inherent to the programming model, but rather are a result of incompleteness in the current compiler or runtime implementation.

    1) Pointers to stack variables (including stack-allocated arrays) cannot be reused across "suspendable" points. A suspendable function is a function that can directly or indirectly call a function containing an async statement or a finish statement. A suspendable point is an async statement, the end of a finish statement, or a call to a suspendable function.

    Work-around: copy these stack variables to the heap. There is no limitation on the reuse of heap pointers across suspendable points.

    2) Pointers to HC functions (functions that contain HC constructs or call other HC functions) are not supported in HC.

    Work-around: only use pointers to C functions.

    3) const modifiers are not supported for function parameters or local variables in HC programs.

    Work-around: remove these 'const' modifiers. The semantics of a correct program will remain unchanged, since the only purpose of the 'const' modifiers is to enforce additional compiler checking.

    4) The number of tasks registered on a phaser cannot be larger than the number of worker threads specified with the -nproc option when an HC program is invoked. Otherwise, a deadlock may occur.

    5) HC function calls must be in canonical form; either being a statement or being the right hand-side of an assignment.

    Code Block
    languagecpp
    titleCanonicalized HC function usage
    foo();
    b = foo();
    a = b + c; // 'foo' value must first be assigned to a variable

    Acknowledgement

    The forasync construct cannot be nested in the current implementation.

    2) There is no compiler check for correctness on the forasync body. Any code pattern not supported on the target device will result in runtime errors. 

    3) Pointer arithmetic on arrays which are communicated between CPU and GPU is not supported.


     


    Installation

     

    Dependencies:
    Rose Version: ROSE 0.9.5a: orion.cs.rice.edu:/home/vc8/habanero-git-repo/hc/ROSE.git 
    EDG with H2C keywords:  orion.cs.rice.edu:/home/vc8/habanero-git-repo/hc/ROSE-EDG.git 
    branch.dm14-hc-forasync.remote=origin
    branch.dm14-hc-forasync.merge=refs/heads/dm14-hc-forasync
    Boost Version:  boost_1_38_0

    Steps:
    1. Build boost_1_38_0
    2. Build Rose 0.9.5a
    3. Build Polyopt
    4. run h2cpolyopt.sh script located in H2C branch. This script copies the necessary files from the polyhedral installation
    5. Build H2C branch using the makefile provided

    Acknowledgement

    Partial support for Heterogeneous Partial support for Habanero-C was provided through the CDSC program of the National Science Foundation with an award in the 2009 Expedition in Computing Program.

    Page Tree