**COMP 322: Fundamentals of Parallel Programming** 

## Lecture 38: Algorithms based on Parallel Prefix (Scan) operations

Mack Joyner and Zoran Budimlić {mjoyner, zoran}@<u>rice.edu</u>

Acknowledgements:

- Book chapter on "Prefix Sums and Their Applications", Guy E. Blelloch, CMU
- Slides on "<u>Parallel prefix adders</u>", Kostas Vitoroulis, Concordia University

http://comp322.rice.edu

COMP 322 April 2018 Lecture 38



## Worksheet #37 problem statement: Parallelizing the Split step in Radix Sort

The Radix Sort algorithm loops over the bits in the binary representation of the keys, starting at the lowest bit, and executes a split operation for each bit as shown below. The split operation packs the keys with a 0 in the corresponding bit to the bottom of a vector, and packs the keys with a 1 to the top of the same vector. It maintains the order within both groups. The sort works because each split operation sorts the keys with respect to the current bit and maintains the sorted order of all the lower bits. Your task is to show how the split operation can be performed in parallel using scan operations, and to explain your answer.

```
[101 111 011 001 100 010 111 010]
1.A = [5 7 3 1 4 2 7 2]
2.A(0) = [1 1 1 1 0 0 1 0] //lowest bit
3.A \leftarrow split(A,A(0)) = [4 2 2 5 7 3 1 7]
4.A(1) = [0 1 1 0 1 1 0 1] // middle bit
5.A \leftarrow split(A,A(1)) = [4 5 1 2 2 7 3 7]
6.A(2) = [1 1 0 0 0 1 0 1] // highest bit
7.A \leftarrow split(A,A(2)) = [1 2 2 3 4 5 7 7]
```

## Worksheet #37 solution: Parallelizing the Split step in Radix Sort

| <pre>procedure split(A, Flags) I-down</pre>                                                                       |             |                  |   |             |   |   |             |             |               | ix sum |
|-------------------------------------------------------------------------------------------------------------------|-------------|------------------|---|-------------|---|---|-------------|-------------|---------------|--------|
| $I-up \leftarrow rev(n - scan(+, rev(Flags)) // rev = reverse$<br>in parallel for each index $i$<br>if (Flags[i]) |             |                  |   |             |   |   |             |             |               |        |
| else                                                                                                              |             | ← I-u<br>← I-d   |   | 1           |   |   |             |             |               |        |
| result                                                                                                            |             |                  | - | -           |   |   |             |             |               |        |
| ${ m A}$ Flags                                                                                                    |             | [5]              |   |             |   |   |             |             |               |        |
| I-down<br>I-up<br>Index                                                                                           | =<br>=<br>= | [0<br>[3]<br>] 3 |   | 0<br>5<br>5 |   |   | 1<br>7<br>1 | 2<br>7<br>7 | 2]<br>8<br>2] |        |
| $permute(A,  \mathrm{Index})$                                                                                     | =           | [4               | 2 | 2           | 5 | 7 | 3           | 1           | 7 ]           |        |

#### FIGURE 1.9

The split operation packs the elements with a 0 in the corresponding flag position to the bottom of a vector, and packs the elements with a 1 to the top of the same vector. The permute writes each element of A to the index specified by the corresponding position in Index.



## Parallelizing Prefix Sum (Lecture 13)

Observation: each prefix sum can be decomposed into reusable terms of power-of-2-size e.g.

$$X[6] = A[0] + A[1] + A[2] + A[3] + A[4] + A[5] + A[6]$$
  
=  $(A[0] + A[1] + A[2] + A[3]) + (A[4] + A[5]) + A[6]$ 

Approach:

- Combine reduction tree idea from Parallel Array Sum with partial sum idea
  from Sequential Prefix Sum
- Use an "upward sweep" to perform parallel reduction, while storing partial sum terms in tree nodes
- Use a "downward sweep" to compute prefix sums while reusing partial sum terms stored in upward sweep



## **Parallel Pre-scan Sum: Upward Sweep**

Upward sweep is just like Parallel Reduction, except that partial sums are also stored along the way

- 1. Receive values from left and right children
- 2. Compute left+right and store in box
- 3. Send left+right value to parent



## Parallel Pre-scan Sum: Downward Sweep

- 1. Receive value from parent (root receives 0)
- 2. Send parent's value to LEFT child (prefix sum for elements to left of left child's subtree)
- 3. Send parent's value+ left child's box value to RIGHT child (prefix sum for elements to left of right child's subtree)
- 4. Add A[i] to get inclusive prefix sum





## **Parallel Scan Sum: Upward Sweep**



## Parallel Scan Sum: Downward Sweep





## Worksheet #37 solution: Parallelizing the Split step in Radix Sort

| <pre>procedure split(A, Flags) I-down</pre>                                                                       |             |                  |   |             |   |   |             |             |               | ix sum |
|-------------------------------------------------------------------------------------------------------------------|-------------|------------------|---|-------------|---|---|-------------|-------------|---------------|--------|
| $I-up \leftarrow rev(n - scan(+, rev(Flags)) // rev = reverse$<br>in parallel for each index $i$<br>if (Flags[i]) |             |                  |   |             |   |   |             |             |               |        |
| else                                                                                                              |             | ← I-u<br>← I-d   |   | 1           |   |   |             |             |               |        |
| result                                                                                                            |             |                  | - | -           |   |   |             |             |               |        |
| ${ m A}$ Flags                                                                                                    |             | [5]              |   |             |   |   |             |             |               |        |
| I-down<br>I-up<br>Index                                                                                           | =<br>=<br>= | [0<br>[3]<br>] 3 |   | 0<br>5<br>5 |   |   | 1<br>7<br>1 | 2<br>7<br>7 | 2]<br>8<br>2] |        |
| $permute(A,  \mathrm{Index})$                                                                                     | =           | [4               | 2 | 2           | 5 | 7 | 3           | 1           | 7 ]           |        |

#### FIGURE 1.9

The split operation packs the elements with a 0 in the corresponding flag position to the bottom of a vector, and packs the elements with a 1 to the top of the same vector. The permute writes each element of A to the index specified by the corresponding position in Index.



# **Binary Addition**

| c<br>+     | <b>X</b> 3 | <b>C</b> 2<br><b>X</b> 2 | <sup>C1</sup> <b>X</b> 1 | <sup>C0</sup> <b>X</b> 0 |  |
|------------|------------|--------------------------|--------------------------|--------------------------|--|
|            | <b>y</b> 3 | <b>y</b> 2               | <b>y</b> 1               | <b>y</b> 0               |  |
| <b>S</b> 4 | <b>S</b> 3 | <b>S</b> 2               | <b>S</b> 1               | <b>S</b> 0               |  |

This is the pen and paper addition of two 4-bit binary numbers **x** and **y**. **c** represents the generated carries.

**s** represents the produced sum bits.

A **stage** of the addition is the set of **x** and **y** bits being used to produce the appropriate sum and carry bits. For example the highlighted bits  $x_2$ ,  $y_2$  constitute **stage 2** which generates carry  $c_2$  and sum  $s_2$ .

Each stage *i* adds bits  $a_i$ ,  $b_i$ ,  $c_{i-1}$  and produces bits  $s_i$ ,  $c_i$ The following hold:

| a <sub>i</sub> | b <sub>i</sub> | Ci               | Comment:                                 | Formal definition:                       |
|----------------|----------------|------------------|------------------------------------------|------------------------------------------|
| 0              | 0              | 0                | The stage "kills" an incoming carry.     | "Kill" bit: $k_i = \overline{x_i + y_i}$ |
| 0              | 1              | С <sub>і-1</sub> | The stage "propagates" an incoming carry | "Propagate" bit: $p_i = x_i \oplus y_i$  |
| 1              | 0              | С <sub>і-1</sub> | The stage "propagates" an incoming carry | $P_i  a_i \in \mathcal{F}_i$             |
| 1              | 1              | 1                | The stage "generates" a carry out        | "Generate" bit: $g_i = x_i \bullet y_i$  |

# **Binary Addition**

| a <sub>i</sub> | b <sub>i</sub> | C <sub>i</sub>   | Comment:                                 | Formal definition:                       |
|----------------|----------------|------------------|------------------------------------------|------------------------------------------|
| 0              | 0              | 0                | The stage "kills" an incoming carry.     | "Kill" bit: $k_i = \overline{x_i + y_i}$ |
| 0              | 1              | C <sub>i-1</sub> | The stage "propagates" an incoming carry | "Propagate" bit: $n = r \oplus v$        |
| 1              | 0              | C <sub>i-1</sub> | The stage "propagates" an incoming carry | $p_i = x_i \oplus y_i$                   |
| 1              | 1              | 1                | The stage "generates" a carry out        | "Generate" bit: $g_i = x_i \bullet y_i$  |

The carry  $c_i$  generated by a stage *i* is given by the equation:

$$c_i = g_i + p_i \cdot c_{i-1} = x_i \cdot y_i + (x_i \oplus y_i) \cdot c_{i-1}$$

This equation can be simplified to:

$$c_i = x_i \cdot y_i + (x_i + y_i) \cdot c_{i-1} = g_i + a_i \cdot c_{i-1}$$

The "a<sub>i</sub>" term in the equation being the "alive" bit.

The later form of the equation uses an OR gate instead of an XOR which is a more efficient gate when implemented in CMOS technology. Note that:

$$a_i = k_i$$

Where  $k_i$  is the "kill" bit defined in the table above.

#### COMP 322, Spring 2018 (M. Joyner, Z. Budimlić)

11

## Binary addition as a prefix sum problem.

- We define a new operator: " ° "
- Input is a vector of pairs of 'propagate' and 'generate' bits:

$$(g_n, p_n)(g_{n-1}, p_{n-1})...(g_0, p_0)$$

Output is a new vector of pairs:

$$(G_n, P_n)(G_{n-1}, P_{n-1})...(G_0, P_0)$$

 Each pair of the output vector is calculated by the following definition:

$$(G_i, P_i) = (g_i, p_i) \circ (G_{i-1}, P_{i-1})$$

$$(G_i, P_i) = (g_i, p_i) \circ (G_{i-1}, P_{i-1})$$

$$(F_i) = (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) \circ (g_i, p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_i) = (g_i + p_i) \circ (g_i, p_$$

## 1973: Kogge-Stone adder



#### The Kogge-Stone adder has:

- Low depth
- □ High node count (implies more area).
- $\Box$  Minimal fan-out of 1 at each node (implies faster<sub>3</sub> performance).

# Summary

• A parallel prefix adder can be seen as a 3-stage process:



- There exist various architectures for the carry calculation part.
- Trade-offs in these architectures involve the
  - $\Box$  area of the adder
  - □ its depth
  - $\hfill\square$  the fan-out of the nodes
  - $\Box$  the overall wiring network.

14

## Parallel Algorithms, Computation Graphs, and Circuits

- Today's lecture shows that parallel algorithms, computation graphs, and circuits represent different approaches to *parallel computational thinking*
- A parallel algorithm unfolds into a computation graph when executing
- A circuit represents an "unrolled" computation graph in hardware e.g., see bitonic sorting network in <u>https://en.wikipedia.org/wiki/Bitonic\_sorter</u>





## **Announcements & Reminders**

- HW5 is due April 20th with automatic extension until April 23rd
  - You may use any remaining slip days after the extension
- Optional quiz available for Unit 10
  - Also due April 20th
  - We will pick the best 9 of your 10 quiz scores
  - As always, please post any issues with quiz questions on Piazza
- April 20th: Course review (scope of Exam 2) in class
- **Fill out evaluation survey for graduate TAs:** https://riceuniversity.co1.gualtrics.com/jfe/form/SV\_bmvhF1ZNdhQPpoF

