# **Provable GPU Data-Races in Static Race Detection**

#### Characterizing True Data-Race Alarms in a Behavioral Type

Dennis Liew, Tiago Cogumbreiro, Julien Lange



# **Overview**

Introduction

BabyCUDA: Syntax

**BabyCUDA: Semantics** 

BabyCUDA: Type System

Conclusion

# Introduction

Introduction

BabyCUDA: Syntax

BabyCUDA: Semantics

BabyCUDA: Type System

Conclusion

# **The CUDA Programming Model**

CUDA is an extension of C, handling parallel code.

CUDA follows the Single-Instruction-Multiple-Threads (SIMT) model, all threads execute a copy of the GPU program (kernel).

\*CUDA's programming model is similar to OpenCL.

```
void saxpy(int n, float a, float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = a*x[i] + y[i];
}</pre>
```

Single-Precision A·X Plus Y (saxpy) is the "Hello World" of CUDA, our running example

# **Data-races in CUDA**

**Data-races** : When two or more threads access the same memory location, and at least one is a Write; causing unwanted nondeterminism.



# Faial [CAV'21] : DRF Checker

```
verify@7744c486fb5e:/artifact/source/faial/tutorial$ faial saxpy.cu
Program is data-race free!
```

```
void saxpy(int n, float a, float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = a*x[i] + y[i];
}</pre>
```

\*Faial is **sound but incomplete.** 

| <pre>verify@7744c486fb5e:/artifact/source/faial/tutorial\$ faial saxpyracy.cu     DATA RACE ERROR ***</pre> |                                                                       |  |  |  |  |  |  |
|-------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|--|--|--|--|--|--|
| Array: y[1]<br>T1 mode: W<br>T2 mode: R                                                                     |                                                                       |  |  |  |  |  |  |
| Globals                                                                                                     | <pre>void saxpyracy(int n, float a, float *x, float *y) Value {</pre> |  |  |  |  |  |  |
| blockDim.x                                                                                                  | <pre>2 int i = blockIdx.x*blockDim.x + threadIdx.x;</pre>             |  |  |  |  |  |  |
| blockIdx.x                                                                                                  | <pre>0 if (i &lt; n) y[i+1] = a*x[i] + y[i];</pre>                    |  |  |  |  |  |  |
| gridDim.x                                                                                                   | 1                                                                     |  |  |  |  |  |  |
| n                                                                                                           | 2                                                                     |  |  |  |  |  |  |
|                                                                                                             |                                                                       |  |  |  |  |  |  |
| Locals                                                                                                      | T1 T2                                                                 |  |  |  |  |  |  |
| threadIdx.x                                                                                                 | 0 1                                                                   |  |  |  |  |  |  |

[CAV'21]: Tiago Cogumbreiro, Julien Lange, Dennis Liew & Hannah Zicarelli : Checking Data-Race Freedom of GPU Kernels, Compositionally.

# **Theory Behind Faial [CAV'21]**



- a compositional analysis for DRF, based on memory access protocols (MAPs).
- protocols are **behavioral types** that codify the way threads interact over shared memory.
- mechanized proofs of our theoretical results.
- Faial outperforms the state-of- the-art, (verify at least 1.42× more realworld kernels).

# **False Alarms in Faial**

Over approximates by ignoring array contents (what is being read from / written to arrays).

This over approximation in MAPs is what makes Faial scalable.

The downsides are False Alarms caused by data-dependent kernels.



readindex is actually data-race free (DRF) but Faial reports it as Racy.

# **False Alarms in Static Analysis**

Static Analyzers suffer from false positives (false alarms)

[CAV'21] evaluated several static analyzers on *readindex*, and most report a false alarm.

[ICSE'13], [CACM'18] show that false alarms hinder the adoption and use of static analyzers in industrial settings.

[ICSE'13]: Brittany Johnson, Yoonki Song, Emerson Murphy-Hill & Robert Bowdidge : Why Don't Software Developers Use Static Analysis Tools to Find Bugs?

[CACM'18]: Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon & Ciera Jaspan : Lessons from Building Static Analysis Tools at Google.

### **Can we prove certain alarms as True Alarms?**

| faial read- | index.cu<br>E ERROR *** | ifact/datasets/correctness/faial/synthetic\$ |               |       | /artifact/source/faial/tutorial\$ faial saxpyracy.cu          |
|-------------|-------------------------|----------------------------------------------|---------------|-------|---------------------------------------------------------------|
|             |                         | <pre>void readindex (int* x)</pre>           | Globals       | Value | <pre>void saxpyracy(int n, float a, float *x, float *y)</pre> |
| Globals     | Value                   | {                                            | blockDim.x    | 2     | {                                                             |
| blockDim.x  | 2                       | <pre>x[threadIdx.x] = threadIdx.x;</pre>     | blockIdx.x    | <br>0 | <pre>int i = blockIdx.x*blockDim.x + threadIdx.x;</pre>       |
| Z           | 0                       | <pre>int z = x[threadIdx.x];</pre>           | <br>gridDim.x |       | if $(i < n) y[i+1] = a*x[i] + y[i];$                          |
|             |                         | x[z] = 0;                                    |               |       | }                                                             |
|             |                         | }                                            | n<br>         | 2     |                                                               |
| Locals      | T1 T2                   |                                              |               |       |                                                               |
| threadIdx.x | 1 0<br>                 |                                              | Locals        | T1 T2 | -                                                             |
|             |                         |                                              |               |       |                                                               |

Unknown

Provably True Alarm

# **Our Approach**

Solving the problem of **False Alarms**:

Specify a core language (BabyCUDA) and a behavioral type system, for which the analysis of Faial is sound and complete for well-typed programs.





# **Theoretical Pipeline**





# **BabyCUDA: Syntax**

Introduction

BabyCUDA: Syntax

BabyCUDA: Semantics

BabyCUDA: Type System

Conclusion

# **Source Syntax: BabyCUDA**

$$\mathcal{B} \ni b ::= A[n] := n \mid \text{let } x = A[n] \text{ in } b$$
$$\mid b; b \mid \text{if } c \{b\} \text{ else } \{b\}$$
$$\mid \text{for } x \in n..m \{b\} \mid \text{skip}$$



# **Target Syntax: Memory Access Protocols (Faial)**

# $\mathcal{U} \ni \quad u \quad ::= \quad \text{skip} \mid o[i] \mid u; u \mid \text{if } c \{u\} \text{ else } \{u\} \mid \text{ for } x \in n..m \{u\}$

Following, we show how to infer a Memory Access Protocol from a BabyCUDA program

| Simplified saxpy in BabyCUDA                       | Simplified <i>saxpy as a</i> Memory Access<br>Protocol (Faial) |  |  |
|----------------------------------------------------|----------------------------------------------------------------|--|--|
| if (tid < n) { let $x = A[tid]$ in $A[tid] := x$ } | <pre>if (tid &lt; n) { rd[tid]; wr[tid] }</pre>                |  |  |
| <pre>else { skip }</pre>                           | <pre>else { skip }</pre>                                       |  |  |

# **BabyCUDA: Operational** Semantics

Introduction

BabyCUDA: Syntax

BabyCUDA: Semantics

BabyCUDA: Type System

Conclusion

#### **BabyCUDA: Operational Semantics**



# **BabyCUDA: Operational Semantics**

| Semantics                                                         | lastwrite(j                                                                                                             | $(H) = k$ $A, b \downarrow_{H,i} A$ $H, b \downarrow H$                       |  |  |  |
|-------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|--|--|--|
| LASTWRITE-CURR<br>$\exists i: P(i) = (R, W)$ $W(j) = k$           | LASTWRITE-PREV<br>$\forall i: P(i) = (R, W) \implies j \notin d$                                                        |                                                                               |  |  |  |
| lastwrite(j, P::H) = k                                            | lastwrite(j, P::H) = lastwrite(j, H) $lastwrite(j, [])$                                                                 |                                                                               |  |  |  |
| READ<br>$n \downarrow j$ $(R \cup \{j\}, W), b[x := lastwrite(j)$ | $\{i: (R,W)\}::H)]\downarrow_{Hi}B$                                                                                     | WRITE $n \downarrow j \qquad m \downarrow k$                                  |  |  |  |
| $A, \text{let } x = \mathbb{A}[n] \text{ in } b \downarrow$       | /                                                                                                                       | $\overline{(R,W),\mathbb{A}[n]:=m\downarrow_{H,i}(R,W[j\mapsto k])}$          |  |  |  |
| SEQ                                                               | F-T                                                                                                                     | IF-F                                                                          |  |  |  |
| $A, b_1 \downarrow_{H,i} B  B, b_2 \downarrow_{H,i} C$            | $c \downarrow \texttt{true} \ A, b_1 \downarrow_{H,i} B$                                                                | $c \downarrow \texttt{false} \ \ A, b_2 \downarrow_{H,i} B$                   |  |  |  |
| $A, b_1; b_2 \downarrow_{H,i} C$                                  | A, if $c \{b_1\}$ else $\{b_2\} \downarrow_{H,i} B$                                                                     | $\overline{A, \text{if } c \{b_1\} \text{ else } \{b_2\} \downarrow_{H,i} B}$ |  |  |  |
| FOR-1 FOR-2                                                       |                                                                                                                         |                                                                               |  |  |  |
| $(n \ge m) \downarrow \texttt{true}$ $(n < n)$                    | $(n) \downarrow \texttt{true} \qquad A, b[x \coloneqq n] \downarrow_{H,i}$                                              | $B \qquad B, for^{U} \ x \in n+1m \ \{b\} \downarrow_{H,i} C$                 |  |  |  |
| $\overline{A, for^{U} x \in nm \{b\} \downarrow_{H,i} A}$         | $A, for^{U} x \in nm \{b\} \downarrow_{H,i} C$                                                                          |                                                                               |  |  |  |
| SKIP                                                              | $\frac{Q = \bigcup\{i: A \mid P(i), b[tid := i] \downarrow_{H,i} A \land i \in \mathcal{T}\}}{P::H, b \downarrow Q::H}$ |                                                                               |  |  |  |
| $\overline{A, skip}\downarrow_{H,i} \overline{A}$                 | $P::H,b\downarrow g$                                                                                                    | Q::H                                                                          |  |  |  |

# **BabyCUDA: Parallel semantics**

$$\frac{Q = \bigcup\{i: A \mid P(i), b[\mathsf{tid} := i] \downarrow_{H,i} A \land i \in \mathcal{T}\}}{P::H, b \downarrow Q::H}$$

- For every thread, evaluate the BabyCUDA program, b yielding accesses A.
- Replace tid with their unique thread identifier i.
- Merge all *i:A* into the current phase.



 $\{0: (R_0, W_0), 1: (R_1, W_1), 2: (R_2, W_2)\}\$ 

# **BabyCUDA: Type System**

Introduction

BabyCUDA: Syntax

BabyCUDA: Semantics

BabyCUDA: Type System

Conclusion

# saxpy and saxpyracy are Well-typed







#### **BabyCUDA: Behavioral type system**



#### 24

 $\{tia, n\} \vdash iet x = A[tia] in A[tia + 1] := x \triangleright ra[tia]; wr[tia + 1]$   $\{tia, n\} \vdash skip \triangleright skip$  $\{ \mathsf{tid}, n \} \vdash \mathsf{if} \ (\mathsf{tid} < n) \ \{ \mathsf{let} \ x = \mathsf{A}[\mathsf{tid}] \ \mathsf{in} \ \mathsf{A}[\mathsf{tid}+1] := x \} \ \mathsf{else} \ \{\mathsf{skip}\} \blacktriangleright \mathsf{if} \ (\mathsf{tid} < n) \ \{\mathsf{rd}[\mathsf{tid}] \ ; \ \mathsf{wr}[\mathsf{tid}+1] \} \ \mathsf{else} \ \{\mathsf{skip}\}$ 

$$\frac{1}{\{\text{tid}, n\} \vdash A[\text{tid}+1] := x \blacktriangleright wr[\text{tid}+1]} \overset{\text{WRITE}}{= x \vdash rd[\text{tid}] \cdot wr[\text{tid}+1]} \overset{\text{READ}}{= x \vdash skin} \frac{1}{\{\text{tid}, n\} \vdash skin} \overset{\text{SKIP}}{= x \vdash skin}$$

#### readindex ill-typed derivation



# **BabyCUDA: Main result**

**Theorem 1** (Correctness). Let  $H, b \downarrow H'$  and  $u \downarrow \Lambda$ . If  $\{tid\} \vdash b \models u$ , H is DRF, then H' is DRF if, and only if, P is DRF.

# Well-typed BabyCUDA programs are analyzed correctly (soundly and completely) by Faial.

Memory Access Protocols of well-typed programs preserve and reflect the concurrent accesses of the source program.

# Conclusion

Introduction BabyCUDA: Syntax BabyCUDA: Semantics BabyCUDA: Type System Conclusion

# **Related Work**

[POPL'19] introduce the first DRF static analysis for **multithreaded** programs that is sound and complete, for a subset of all programs.

\*generally inapplicable (or irrelevant) to GPU programming

[POPL'15] develop a deductive system to prove completeness of program analyses over an abstract domain.

[POPL'19]: Nikos Gorogiannis, Peter W. O'Hearn & Ilya Sergey (2019): A True Positives Theorem for a Static Race Detector.

[POPL'15]: Roberto Giacobazzi, Francesco Logozzo & Francesco Ranzato (2015): Analyzing Program Analyses.

# **Conclusion and Future Work**

Introduced a **behavioral type system** that characterizes **true data-races** in Memory Access Protocols (Faial).

Main result guarantees that Faial's analysis are sound and complete.

Future work include: completing the Coq formalization and implementation.

having an empirical evaluation of the type system's applicability.



# BabyCUDA END

$$i \in \mathcal{T} \quad P(i), b[\text{tid} := i] \downarrow_{H,i} A \qquad \qquad \bigcup \{i: A\}$$
  
if  $(\text{tid} < n) \{\text{let } x = A[\text{tid}] \text{ in } A[\text{tid}] := x\} \text{ else } \{\text{skip}\}$   
if  $(\text{tid} < n) \{\text{let } x = A[1] \text{ in } A[1] := x\} \text{ else } \{\text{skip}\}$   
if  $(\text{tid} < n) \{\text{let } x = A[2] \text{ in } A[2] := x\} \text{ else } \{\text{skip}\}$   
$$[\{0: (\{0\}, \{0:0\}), 1: (\{1\}, \{1:1\}), 2: (\{2\}, \{2:2\})\}]$$

# void readindex (int\* x) { x[threadIdx.x] = threadIdx.x; int z = x[threadIdx.x]; x[z] = 0; }

 $\{\mathsf{tid}\} \nvDash \mathsf{A}[\mathsf{tid}] := \mathsf{tid}; \mathsf{let} x = \mathsf{A}[\mathsf{tid}] \mathsf{in} \mathsf{A}[x] := 0 \triangleright \mathsf{wr}[\mathsf{tid}]; \mathsf{rd}[\mathsf{tid}]; \mathsf{wr}[x]$