This document attempts to describe the Apple G13 GPU architecture, as used
in the M1 SoC. This is based on reverse engineering and is likely to have mistakes.
The Metal Shading Language is typically used to program these GPUs,
and this document uses Metal terminology. For example a CPU SIMD-lane
is a Metal thread, and a CPU thread is a Metal SIMD-group.
The G13 architecture has 32 threads per SIMD-group. Each SIMD-group has a
stack pointer (sp), a program counter (pc), a 32-bit
execution mask (exec_mask), and up to 128 general purpose registers.
General purpose registers each store one 32-bit value per thread. Each register
can be accessed as a 32-bit register, named r0 to r127,
the low 16-bits of the register r0l to r127l, or
the high 16-bits of the register r0h to r127h.
Some instructions may also use pairs of contiguous 32-bit registers to operate
on 64-bit values, and memory operations may use up to four contiguous 16 or
32-bit registers (in both cases, encoded as the first register).
A number of physical registers are allocated to each SIMD-group, and registers
from r0 to r(N-1) may be used, but accesses to higher
register numbers are not valid, and may read or corrupt data from other
SIMD-groups. Using fewer registers (e.g. by using 16-bit types instead of
32-bit types) allows more SIMD-groups to fit in the physical register file
(higher occupancy), which improves performance.
Certain instructions are hardcoded to use early registers. r0l
tracks the execution mask stack, and r1 is used as the link register.
Shared state is less well understood, but includes 256 32-bit uniform registers
named u0 to u255 and similarly accessible as their
16-bit halves, u0l to u255h. These are used for values
that are the same across all threads, such as threads_per_grid, or
addresses of buffers.
Conditional Execution
Each thread within a SIMD-group may be deactivated, meaning the values in registers
for that thread will keep their current value. Whether or not each thread is active
is tracked in a 32-bit execution mask (in this document, by convention, a one bit
indicates the thread is active and a zero bit indicates it is not).
r0l tracks the execution mask stack. When used with flow-control
instructions, the value in r0l indicates how many 'pop' operations
will be needed to re-enable an inactive thread, or zero if the thread is active.
The execution mask stack instructions (pop_exec, if_*cmp,
else_*cmp and while_*cmp) are typically used to manage
r0l. They also update exec_mask based on the value in
r0l, and are the only known way to manipulate exec_mask.
However, r0l may be manipulated with other instructions. It should
be initialised to zero prior to using execution mask stack instructions, and
break statements may be implemented by conditionally moving
a value (corresponding to the number of stack-levels to break out of) to
r0l and using a pop_exec 0 (which deactivates any
threads that now have non-zero values in r0l).
The jmp_exec_none instruction may be used to jump over an if
statement body if no threads are active, and the jmp_exec_any may be used
to jump back to the start of the loop only if one or more threads are still executing
the loop.
Execution masking generally prevents reading or writing values from inactive threads,
however SIMD shuffle instructions can read from inactive threads in some cases, which
can be valuable for research or debugging purposes (e.g. it allows observing non-zero
values in r0l).
Register Cache
The GPU has a register cache, which keeps the contents of recently used general
purpose registers more quickly accessible. When instructions read or write GPRs,
they usually allow hints for the access to be encoded. The cache hint,
on a source or destination operand, indicates the value will be used again,
and should be cached (meaning other values, where this hint was not used, will
be preferred for eviction). The discard hint (on a source operand)
invalidates the value in the register cache after all operands have been read,
without writing it back to the register file.
While the cache hint should only change performance, the
discard hint will make future reads undefined, which could
lead to confusing issues. discard should probably not
be used within conditional execution, as inactive threads within the SIMD-group
may contain data that has not been written back to the register file that would
probably also also get discarded. The behaviour of this hint when execution is
partially or completely masked has not been tested.
Either hint may be used multiple times even if the same operand appears twice,
and discard on a source register can be used with cache
on the same destination register.
Instructions
Instructions vary in length in multiples of two bytes up to twelve bytes (so far).
Some instructions have a long and short encoding. This is indicated by a bit
L, which, if zero, indicates the last two (or four) bytes of the
instruction are omitted, and any bits within those bytes should be read as zero.
So far only 12-byte instructions omit the last four bytes, and all others omit
the last two.
The encodings are described in little-endian, meaning bytes go right-to-left
(and top-to-bottom), but bits may be read in the usual numerical order.
Behaviour is mostly described in a Python-like pseudocode for now. In operand
descriptions, the : operator describes bit concatenation, with
the most-significant fields first. Elsewhere, values are considered to be
Python-style arbitrary-precision integers, and floating-point values are
considered to be arbitrary-precision floats (although double-precision with
round-to-odd may be an adequate approximation).
Move Instructions
mov (Move 16-bit Immediate)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
0
?
1
1
0
0
0
1
0
Dt
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
imm16
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
?
?
?
?
?
?
?
?
?
?
?
?
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
?
?
?
?
?
?
?
?
?
?
?
?
imm16
L
D
0
?
1
1
0
0
0
1
0
Dt
D = ALUDst(Dx:D, Dt)
D.broadcast_to_active(imm16)
mov (Move 32-bit Immediate)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
1
?
1
1
0
0
0
1
0
Dt
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
imm32
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
?
?
Dx
?
?
?
?
?
?
?
?
?
?
?
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
?
?
?
?
?
?
?
?
?
?
?
?
imm32
L
D
1
?
1
1
0
0
0
1
0
Dt
D = ALUDst(Dx:D, Dt)
D.broadcast_to_active(imm32)
get_sr (Move From Special Register)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
1
1
1
0
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
?
Dx
SRx
?
?
?
?
SR
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
SRx
?
?
?
?
SR
0
D
Dt
1
1
1
0
0
1
0
D = ALUDst(Dx:D, Dt)
SR = SReg32(SRx:SR)
for each active thread:
D[thread] = SR.read(thread)
Integer Arithmetic Instructions
iadd (Integer Add or Subtract)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
S
0
0
1
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
s1
Bs
Bt
B
N
As
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
?
?
?
?
?
?
?
?
?
?
s2
?
?
?
?
?
?
Dx
Ax
Bx
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
?
?
?
?
?
?
?
?
s2
?
?
?
?
?
?
Dx
Ax
Bx
s1
Bs
Bt
B
N
As
At
A
0
D
Dt
S
0
0
1
1
1
0
D = ALUDst64(Dx:D, Dt)
A = AddSrc(Ax:A, At, As)
B = AddSrc(Bx:B, Bt, Bs)
shift = s2:s1
for each active thread:
a = A[thread]
b = B[thread]
saturating = (S == 1 and shift == 0 and A.thread_bit_size <= 32 and
B.thread_bit_size <= 32 and D.thread_bit_size <= 32)
if N == 1:
b = -b
if shift < 5:
b <<= shift
else:
b = 0
result = a + b
if saturating:
signed = (As == 1 or Bs == 1)
result = saturate_integer(result, D.thread_bit_size, signed)
D[thread] = result
imadd (Integer Multiply-Add or Subtract)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
S
0
1
1
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
s1
Bs
Bt
B
N
As
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
?
?
Dx
Ax
Bx
Cx
s2
?
Cs
Ct
C
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
Bx
Cx
s2
?
Cs
Ct
C
s1
Bs
Bt
B
N
As
At
A
0
D
Dt
S
0
1
1
1
1
0
D = ALUDst64(Dx:D, Dt)
A = MulSrc(Ax:A, At, As)
B = MulSrc(Bx:B, Bt, Bs)
C = AddSrc(Cx:C, Ct, Cs)
shift = s2:s1
for each active thread:
a = A[thread]
b = B[thread]
c = C[thread]
saturating = (S == 1 and shift == 0 and C.thread_bit_size <= 32 and
D.thread_bit_size <= 32)
if N == 1:
c = -c
if shift < 5:
c <<= shift
else:
c = 0
result = a * b + c
if saturating:
signed = (As == 1 or Bs == 1 or Cs == 1)
result = saturate_integer(result, D.thread_bit_size, signed)
D[thread] = result
convert
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
0
1
1
1
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
srct
src
round
0
0
0
0
mode
47
46
45
44
43
42
41
40
?
?
Dx
0
0
srcx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
0
0
srcx
0
0
srct
src
round
0
0
0
0
mode
1
D
Dt
0
1
1
1
1
1
0
D = ALUDst(Dx:D, Dt)
src = ALUSrc(srcx:src, srct)
TODO()
Shift/Bitfield Instructions
bfi (Bitfield Insert/Shift Left)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
0
1
0
1
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
m1
Bt
B
0
0
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
m3
?
Dx
Ax
Bx
Cx
?
?
m2
Ct
C
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
m3
?
Dx
Ax
Bx
Cx
?
?
m2
Ct
C
m1
Bt
B
0
0
At
A
0
D
Dt
0
1
0
1
1
1
0
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1
for each active thread:
a = A[thread]
b = B[thread]
c = C[thread]
shift_amount = (c & 0x7F)
if m == 0:
mask = 0xFFFFFFFF
else:
mask = (1 << m) - 1
result = (a & ~(mask << shift_amount)) | ((b & mask) << shift_amount)
D[thread] = result
bfeil (Bitfield Extract and Insert Low/Shift Right)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
0
1
0
1
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
m1
Bt
B
0
0
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
m3
?
Dx
Ax
Bx
Cx
?
?
m2
Ct
C
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
m3
?
Dx
Ax
Bx
Cx
?
?
m2
Ct
C
m1
Bt
B
0
0
At
A
1
D
Dt
0
1
0
1
1
1
0
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1
for each active thread:
a = A[thread]
b = B[thread]
c = C[thread]
shift_amount = (c & 0x7F)
if m == 0:
mask = 0xFFFFFFFF
else:
mask = (1 << m) - 1
result = (a & ~mask) | ((b >> shift_amount) & mask)
D[thread] = result
extr (Extract From Register Pair)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
0
1
0
1
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
m1
Bt
B
0
1
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
m3
?
Dx
Ax
Bx
Cx
?
?
m2
Ct
C
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
m3
?
Dx
Ax
Bx
Cx
?
?
m2
Ct
C
m1
Bt
B
0
1
At
A
0
D
Dt
0
1
0
1
1
1
0
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1
for each active thread:
a = A[thread]
b = B[thread]
c = C[thread]
shift_amount = (c & 0x7F)
if m == 0:
mask = 0xFFFFFFFF
else:
mask = (1 << m) - 1
result = (((b << 32) | a) >> shift_amount) & mask
D[thread] = result
shlhi (Shift Left High and Insert)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
0
1
0
1
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
m1
Bt
B
1
0
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
m3
?
Dx
Ax
Bx
Cx
?
?
m2
Ct
C
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
m3
?
Dx
Ax
Bx
Cx
?
?
m2
Ct
C
m1
Bt
B
1
0
At
A
0
D
Dt
0
1
0
1
1
1
0
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1
for each active thread:
a = A[thread]
b = B[thread]
c = C[thread]
shift_amount = (c & 0x7F)
if m == 0:
mask = 0xFFFFFFFF
else:
mask = (1 << m) - 1
shifted_mask = mask << max(shift_amount-32, 0)
result = (((b << shift_amount) >> 32) & shifted_mask) | (a & ~shifted_mask)
D[thread] = result
shrhi (Shift Right High and Insert)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
0
1
0
1
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
m1
Bt
B
1
0
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
m3
?
Dx
Ax
Bx
Cx
?
?
m2
Ct
C
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
m3
?
Dx
Ax
Bx
Cx
?
?
m2
Ct
C
m1
Bt
B
1
0
At
A
1
D
Dt
0
1
0
1
1
1
0
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1
for each active thread:
a = A[thread]
b = B[thread]
c = C[thread]
shift_amount = (c & 0x7F)
if m == 0:
mask = 0xFFFFFFFF
else:
mask = (1 << m) - 1
shifted_mask = (mask << 32) >> min(shift_amount, 32)
result = (((b << 32) >> shift_amount) & shifted_mask) | (a & ~shifted_mask)
D[thread] = result
asr (Arithmetic Shift Right)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
0
1
0
1
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
?
Bt
B
0
1
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
?
?
Dx
Ax
Bx
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
Bx
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Bt
B
0
1
At
A
1
D
Dt
0
1
0
1
1
1
0
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
for each active thread:
a = A[thread]
b = B[thread]
shift_amount = (b & 0x7F)
result = sign_extend(a, A.thread_bit_size) >> shift_amount
D[thread] = result
asrh (Arithmetic Shift Right High)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
0
1
0
1
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
?
Bt
B
1
1
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
?
?
Dx
Ax
Bx
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
Bx
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Bt
B
1
1
At
A
1
D
Dt
0
1
0
1
1
1
0
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
for each active thread:
a = A[thread]
b = B[thread]
shift_amount = (b & 0x7F)
result = (sign_extend(a, A.thread_bit_size) << 32) >> shift_amount
D[thread] = result
Bit Manipulation Instructions
bitop (Bitwise Operation)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
1
1
1
1
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
tt3
tt2
Bt
B
tt1
tt0
At
A
47
46
45
44
43
42
41
40
?
?
Dx
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
Bx
tt3
tt2
Bt
B
tt1
tt0
At
A
0
D
Dt
1
1
1
1
1
1
0
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
for each active thread:
a = A[thread]
b = B[thread]
if tt0 == tt1 and tt2 == tt3 and tt0 != tt2:
UNDEFINED()
result = a
else:
result = 0
if tt0: result |= ~a & ~b
if tt1: result |= a & ~b
if tt2: result |= ~a & b
if tt3: result |= a & b
D[thread] = result
bitrev (Reverse Bits)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
0
1
1
1
1
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
0
0
0
1
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
?
?
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
?
?
0
0
0
0
0
0
0
0
0
0
0
0
0
1
At
A
0
D
Dt
0
1
1
1
1
1
0
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
for each active thread:
a = A[thread]
result = 0
i = 0
while i < 32:
if a & (1 << i):
result |= (1 << (31-i))
D[thread] = result
popcount (Population Count)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
0
1
1
1
1
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
0
0
1
0
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
?
?
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
?
?
0
0
0
0
0
0
0
0
0
0
0
0
1
0
At
A
0
D
Dt
0
1
1
1
1
1
0
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
for each active thread:
a = A[thread]
result = 0
i = 0
while i < 32:
if a & (1 << i):
result += 1
D[thread] = result
ffs (Find First Set)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
0
1
1
1
1
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
0
0
1
1
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
?
?
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
?
?
0
0
0
0
0
0
0
0
0
0
0
0
1
1
At
A
0
D
Dt
0
1
1
1
1
1
0
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
for each active thread:
a = A[thread]
result = -1
i = 31
while i >= 0:
if a & (1 << i):
result = i
break
i -= 1
D[thread] = result
Floating-Point Arithmetic
fmadd (Floating-Point Fused Multiply-Add)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
S
1
1
1
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
Bm
Bt
B
Am
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
?
?
Dx
Ax
Bx
Cx
?
?
Cm
Ct
C
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
Bx
Cx
?
?
Cm
Ct
C
Bm
Bt
B
Am
At
A
L
D
Dt
S
1
1
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
C = FloatSrc(Cx:C, Ct, Cm)
for each active thread:
a = A[thread]
b = B[thread]
c = C[thread]
result = fused_multiply_add(a, b, c)
D[thread] = result
D = FloatDst16(Dx:D, Dt, S)
A = FloatSrc16(Ax:A, At, Am)
B = FloatSrc16(Bx:B, Bt, Bm)
C = FloatSrc16(Cx:C, Ct, Cm)
for each active thread:
a = A[thread]
b = B[thread]
c = C[thread]
result = fused_multiply_add(a, b, c)
D[thread] = result
fadd (Floating-Point Add)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
S
1
0
1
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
Bm
Bt
B
Am
At
A
47
46
45
44
43
42
41
40
?
?
Dx
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
Bx
Bm
Bt
B
Am
At
A
1
D
Dt
S
1
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
for each active thread:
a = A[thread]
b = B[thread]
result = fused_multiply_add(a, 1.0, b)
D[thread] = result
fadd16 (Half Precision Floating-Point Add)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
S
1
0
0
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
Bm
Bt
B
?
Am
At
A
47
46
45
44
43
42
41
40
?
?
Dx
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
Bx
?
Bm
Bt
B
?
Am
At
A
1
D
Dt
S
1
0
0
1
1
0
D = FloatDst16(Dx:D, Dt, S)
A = FloatSrc16(Ax:A, At, Am)
B = FloatSrc16(Bx:B, Bt, Bm)
for each active thread:
a = A[thread]
b = B[thread]
result = fused_multiply_add(a, 1.0, b)
D[thread] = result
fmul (Floating-Point Multiply)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
S
0
1
1
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
Bm
Bt
B
Am
At
A
47
46
45
44
43
42
41
40
?
?
Dx
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
Bx
Bm
Bt
B
Am
At
A
1
D
Dt
S
0
1
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
for each active thread:
a = A[thread]
b = B[thread]
result = fused_multiply_add(a, b, 0.0)
D[thread] = result
fmul16 (Half Precision Floating-Point Multiply)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
S
0
1
0
1
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
Bm
Bt
B
?
Am
At
A
47
46
45
44
43
42
41
40
?
?
Dx
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
Bx
?
Bm
Bt
B
?
Am
At
A
1
D
Dt
S
0
1
0
1
1
0
D = FloatDst16(Dx:D, Dt, S)
A = FloatSrc16(Ax:A, At, Am)
B = FloatSrc16(Bx:B, Bt, Bm)
for each active thread:
a = A[thread]
b = B[thread]
result = fused_multiply_add(a, b, 0.0)
D[thread] = result
floor
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
0
0
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Am
At
A
L
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
for each active thread:
D[thread] = floor(A[thread])
ceil
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
0
0
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
1
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
1
0
0
0
0
Am
At
A
1
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
for each active thread:
D[thread] = ceil(A[thread])
trunc
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
0
0
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
1
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
1
0
0
0
0
0
Am
At
A
1
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
for each active thread:
D[thread] = trunc(A[thread])
rint
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
0
0
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
1
1
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
1
1
0
0
0
0
Am
At
A
1
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
for each active thread:
D[thread] = rint(A[thread])
rcp
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
1
0
0
0
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
1
0
0
0
Am
At
A
L
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
for each active thread:
D[thread] = reciprocal(A[thread])
rsqrt
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
1
0
0
1
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
1
0
0
1
Am
At
A
L
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
for each active thread:
D[thread] = rsqrt(A[thread])
rsqrt_special
rsqrt_special can be used to implement fast sqrt as
rsqrt_special(x) * x, by handling special-cases differently.
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
0
1
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
0
0
0
1
Am
At
A
L
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
for each active thread:
D[thread] = rsqrt_special(A[thread])
sin_pt_1
sin_pt_1 is used together with sin_pt_2 and
supporting ALU to compute the sine function. sin_pt_1 takes an angle
around the circle in the interval [0, 4) and produces an intermediate
result. This intermediate result is then passed to sin_pt_2, and the
two results are multipled to give sin. The argument reduction to [0, 4)
can be computed with a few ALU instructions: reduce(x) = 4
fract(x / tau), where tau is the circle constant
formerly known as twice pi. Calculating cosine follows from the
identity cos(x) = sin(x + tau/4). After multipling by
1/tau, the bias become 1/4 which can be added in the same
cycle via a fused multiply-add. Tangent should be lowered to a division
of sine and cosine.
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
1
0
1
0
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
1
0
1
0
Am
At
A
L
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
for each active thread:
D[thread] = sin_pt_1(A[thread])
sin_pt_2
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
1
1
1
0
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
1
1
1
0
Am
At
A
L
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
for each active thread:
D[thread] = sin_pt_2(A[thread])
log2
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
1
1
0
0
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
1
1
0
0
Am
At
A
L
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
for each active thread:
D[thread] = log2(A[thread])
exp2
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
1
1
0
1
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
1
1
0
1
Am
At
A
L
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
for each active thread:
D[thread] = exp2(A[thread])
dfdx
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
1
0
0
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
0
1
0
0
Am
At
A
L
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
TODO()
dfdy
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
S
0
0
1
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
1
1
0
Am
At
A
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
0
0
0
0
0
0
0
0
0
0
0
1
1
0
Am
At
A
L
D
Dt
S
0
0
1
0
1
0
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
TODO()
Flow Control Instructions
ret
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
reg32
?
?
0
0
1
0
1
0
0
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
reg32
?
?
0
0
1
0
1
0
0
reg32 = Reg32(reg32)
TODO()
stop
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
end_execution()
trap
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
TODO()
call
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
reg32
?
?
0
0
0
0
1
0
0
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
reg32
?
?
0
0
0
0
1
0
0
reg32 = Reg32(reg32)
TODO()
jmp_incomplete
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
0
0
0
0
0
0
off
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
0
0
0
0
0
0
0
off
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
TODO()
jmp_exec_any
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
off
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
off
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
if any(exec_mask):
next_pc = pc + sign_extend(off, 32)
jmp_exec_none
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
off
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
off
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
if not any(exec_mask):
next_pc = pc + sign_extend(off, 32)
call
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
off
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
off
1
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
next_pc = pc + sign_extend(off, 32)
for each active thread:
r1 = pc + 6
Execution Mask Stack Instructions
pop_exec
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
0
0
n
1
1
?
Dt
1
0
1
0
0
1
0
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
n
1
1
?
Dt
1
0
1
0
0
1
0
D = ImplicitR0L(Dt)
for each thread:
v = D[thread]
v -= n
if v < 0:
v = 0
D[thread] = v
exec_mask[thread] = (v == 0)
if_icmp
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cc
n
0
0
ccn
Dt
1
0
1
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
Bt
B
0
0
At
A
47
46
45
44
43
42
41
40
?
?
0
0
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
0
0
Ax
Bx
0
0
Bt
B
0
0
At
A
cc
n
0
0
ccn
Dt
1
0
1
0
0
1
0
D = ImplicitR0L(Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
for each thread:
v = D[thread]
if v != 0:
v += n
elif not cc.compare(A[thread], B[thread]):
v = 1
D[thread] = v
exec_mask[thread] = (v == 0)
if_fcmp
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cc
n
0
0
ccn
Dt
1
0
0
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
Bm
Bt
B
Am
At
A
47
46
45
44
43
42
41
40
?
?
0
0
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
0
0
Ax
Bx
Bm
Bt
B
Am
At
A
cc
n
0
0
ccn
Dt
1
0
0
0
0
1
0
D = ImplicitR0L(Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
for each thread:
v = D[thread]
if v != 0:
v += n
elif not cc.compare(A[thread], B[thread]):
v = 1
D[thread] = v
exec_mask[thread] = (v == 0)
while_icmp
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cc
n
1
0
ccn
Dt
1
0
1
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
Bt
B
0
0
At
A
47
46
45
44
43
42
41
40
?
?
0
0
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
0
0
Ax
Bx
0
0
Bt
B
0
0
At
A
cc
n
1
0
ccn
Dt
1
0
1
0
0
1
0
D = ImplicitR0L(Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
for each thread:
v = D[thread]
if v < n:
if cc.compare(A[thread], B[thread]):
v = 0
else:
v = n
D[thread] = v
exec_mask[thread] = (v == 0)
while_fcmp
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cc
n
1
0
ccn
Dt
1
0
0
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
Bm
Bt
B
Am
At
A
47
46
45
44
43
42
41
40
?
?
0
0
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
0
0
Ax
Bx
Bm
Bt
B
Am
At
A
cc
n
1
0
ccn
Dt
1
0
0
0
0
1
0
D = ImplicitR0L(Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
for each thread:
v = D[thread]
if v < n:
if cc.compare(A[thread], B[thread]):
v = 0
else:
v = n
D[thread] = v
exec_mask[thread] = (v == 0)
else_icmp
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cc
n
0
1
ccn
Dt
1
0
1
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
Bt
B
0
0
At
A
47
46
45
44
43
42
41
40
?
?
0
0
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
0
0
Ax
Bx
0
0
Bt
B
0
0
At
A
cc
n
0
1
ccn
Dt
1
0
1
0
0
1
0
D = ImplicitR0L(Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
for each thread:
v = D[thread]
if v == 0:
v = n
elif v == 1:
if cc.compare(A[thread], B[thread]):
v = 0
else:
v = 1
D[thread] = v
exec_mask[thread] = (v == 0)
else_fcmp
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cc
n
0
1
ccn
Dt
1
0
0
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
Bm
Bt
B
Am
At
A
47
46
45
44
43
42
41
40
?
?
0
0
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
0
0
Ax
Bx
Bm
Bt
B
Am
At
A
cc
n
0
1
ccn
Dt
1
0
0
0
0
1
0
D = ImplicitR0L(Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
for each thread:
v = D[thread]
if v == 0:
v = n
elif v == 1:
if cc.compare(A[thread], B[thread]):
v = 0
else:
v = 1
D[thread] = v
exec_mask[thread] = (v == 0)
Select Instructions
icmpsel (Integer Compare and Select)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
0
0
1
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
?
Bt
B
?
?
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
cc
Yt
Y
?
?
?
Xt
X
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
?
?
Dx
Ax
Bx
Xx
Yx
?
?
?
?
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
Bx
Xx
Yx
?
?
?
?
cc
Yt
Y
?
?
?
Xt
X
?
?
Bt
B
?
?
At
A
L
D
Dt
0
0
1
0
0
1
0
cc = ICondition(cc)
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
X = CmpselSrc(Xx:X, Xt, Dt)
Y = CmpselSrc(Yx:Y, Yt, Dt)
for each active thread:
if cc.compare(A[thread], B[thread]):
D[thread] = X[thread]
else:
D[thread] = Y[thread]
fcmpsel (Floating-Point Compare and Select)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
0
0
0
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
Bm
Bt
B
Am
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
cc
Yt
Y
?
?
?
Xt
X
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
?
?
Dx
Ax
Bx
Xx
Yx
?
?
?
?
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
Ax
Bx
Xx
Yx
?
?
?
?
cc
Yt
Y
?
?
?
Xt
X
Bm
Bt
B
Am
At
A
L
D
Dt
0
0
0
0
0
1
0
cc = FCondition(cc)
D = ALUDst(Dx:D, Dt)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
X = CmpselSrc(Xx:X, Xt, Dt)
Y = CmpselSrc(Yx:Y, Yt, Dt)
for each active thread:
if cc.compare(A[thread], B[thread]):
D[thread] = X[thread]
else:
D[thread] = Y[thread]
SIMD Group and Quad Group Instructions
icmp_ballot
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
D
Dt
0
1
1
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
Bt
B
0
0
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
cc
0
0
0
0
0
0
0
0
0
0
0
0
1
ccn
?
Dx
Ax
Bx
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cc
0
0
0
0
0
0
0
0
0
0
0
0
1
ccn
?
Dx
Ax
Bx
0
0
Bt
B
0
0
At
A
?
D
Dt
0
1
1
0
0
1
0
D = ALUDst(Dx:D, Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
result = 0
for each active thread:
a = A[thread]
b = B[thread]
if cc.compare(a, b):
result |= 1 << thread
D.broadcast_to_active(result)
icmp_quad_ballot
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
D
Dt
0
1
1
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
Bt
B
0
0
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
cc
0
0
0
0
0
0
0
0
0
0
0
0
0
ccn
?
Dx
Ax
Bx
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cc
0
0
0
0
0
0
0
0
0
0
0
0
0
ccn
?
Dx
Ax
Bx
0
0
Bt
B
0
0
At
A
?
D
Dt
0
1
1
0
0
1
0
D = ALUDst(Dx:D, Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
TODO()
fcmp_ballot
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
D
Dt
0
1
0
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
Bm
Bt
B
Am
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
cc
0
0
0
0
0
0
0
0
0
0
0
0
1
ccn
?
Dx
Ax
Bx
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cc
0
0
0
0
0
0
0
0
0
0
0
0
1
ccn
?
Dx
Ax
Bx
Bm
Bt
B
Am
At
A
?
D
Dt
0
1
0
0
0
1
0
D = ALUDst(Dx:D, Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
result = 0
for each active thread:
a = A[thread]
b = B[thread]
if cc.compare(a, b):
result |= 1 << thread
D.broadcast_to_active(result)
fcmp_quad_ballot
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
D
Dt
0
1
0
0
0
1
0
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
Bm
Bt
B
Am
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
cc
0
0
0
0
0
0
0
0
0
0
0
0
0
ccn
?
Dx
Ax
Bx
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cc
0
0
0
0
0
0
0
0
0
0
0
0
0
ccn
?
Dx
Ax
Bx
Bm
Bt
B
Am
At
A
?
D
Dt
0
1
0
0
0
1
0
D = ALUDst(Dx:D, Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
TODO()
simd_shuffle
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
1
1
0
1
1
1
1
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
0
0
Bt
B
0
1
At
A
47
46
45
44
43
42
41
40
0
?
Dx
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
?
Dx
Ax
Bx
0
0
Bt
B
0
1
At
A
0
D
Dt
1
1
0
1
1
1
1
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc16(Bx:B, Bt)
quad_values = []
for each quad:
quad_index = 0
for each thread in quad:
# NOTE: this is not execution masked, meaning any inactive thread can make
# simd_broadcast from the whole quad undefined (although it works fine if
# B is an immediate)
quad_index |= B[thread] & 3
quad_values.append(A[quad.start + quad_index])
for each active thread:
b = B[thread]
if b < 32:
result = quad_values[index >> 2]
D[thread] = result
simd_shuffle_down
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
D
Dt
1
1
0
1
1
1
1
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
1
1
Bt
B
0
1
At
A
47
46
45
44
43
42
41
40
0
?
Dx
Ax
Bx
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
?
Dx
Ax
Bx
1
1
Bt
B
0
1
At
A
0
D
Dt
1
1
0
1
1
1
1
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc16(Bx:B, Bt)
TODO()
Memory and Stack Instructions
wait
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
?
?
?
?
?
i
0
0
1
1
1
0
0
0
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
?
?
?
?
?
i
0
0
1
1
1
0
0
0
wait_for_loads()
ld/st_tile
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
D
Dt
load
0
0
1
0
0
1
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
?
?
?
F
?
?
?
?
?
?
?
?
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
?
?
?
?
?
?
mask
u0
rt
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
?
?
Dx
?
?
?
?
?
?
?
?
?
?
?
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
mask
u0
rt
?
?
?
?
F
?
?
?
?
?
?
?
?
?
D
Dt
load
0
0
1
0
0
1
D = ALUDst(Dx:D, Dt)
TODO()
ld_var
The last four bytes are omitted if L=0.
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
D
Dt
perspective
1
0
0
0
0
1
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
mask
?
?
?
?
?
?
?
?
index
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
?
?
Dx
?
?
?
?
?
?
?
?
?
?
?
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Dx
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
mask
?
?
?
?
?
?
?
?
index
L
D
Dt
perspective
1
0
0
0
0
1
D = ALUDst(Dx:D, Dt)
TODO()
uniform_store
uniform_store is used to initialise uniform registers.
R is stored to offset O, which is typically an
index in 16-bit units into the uniform registers. This is encoded like
(and possibly is) a store to device memory, and can move one 16-bit register
to initialise a 16-bit uniform, or two consecutive 16-bit registers to
initialise a 32-bit uniform.
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
R
0
F
1
0
0
0
1
0
1
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
?
1
1
1
unk
Ot
Ol
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
L
b
s
Rx
0
0
0
0
Oh
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
Ox
mask
0
0
Rt
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Ox
mask
0
0
Rt
?
L
b
s
Rx
0
0
0
0
Oh
?
?
1
1
1
unk
Ot
Ol
0
0
0
0
R
0
F
1
0
0
0
1
0
1
R = MemoryReg(Rx:R, Rt)
O = MemoryIndex(Ox:Oh:Ol, Ot)
TODO()
device_load
device_load initiates a load from device memory, the result
of which may be used after a wait.
The data can be unpacked from a variety of formats, or passed through as-is.
On each thread, up to four aligned values, each up to 32-bits, can be
read from a base address plus an offset (shifted left by the alignment,
with an optional additional left shift of up to two).
The number of values to read is described by a mask,
such that 0b0001 indicates one value, or 0b1111
loads four values. Non-contiguous masks skip values in memory,
but still write the result to contiguous registers.
Non-packed formats (8, 16, and 32-bit values) are zero
extended. All packed values are unpacked to 16-bit or 32-bit floating-point
values, depending on the size of the register. Bit-packed formats (rgb10a2,
rg11b10f and rgb9e5) are supported, but ignore the optional
shift and the mask. They always read an aligned 32-bit value, and write to the same number of
registers. However simple packed values (unorm8, snorm8,
unorm16, snorm16 and srgba8) do not have this limitation.
Unaligned addresses are rounded-down to the required alignment. The base address (A)
is a 64-bit value from either uniform or general-purpose registers. The offset (O) may
be a signed 16-bit immediate, or a signed or unsigned 32-bit general-purpose register.
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
R
F
0
0
0
0
1
0
1
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
u2
?
?
At
?
Ou
Ot
Ol
Al
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
L
?
?
?
s
Rx
Ah
Oh
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
Ox
mask
?
?
Rt
Fx
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Ox
mask
?
?
Rt
Fx
L
?
?
?
s
Rx
Ah
Oh
?
u2
?
?
At
?
Ou
Ot
Ol
Al
R
F
0
0
0
0
1
0
1
R = MemoryReg(Rx:R, Rt)
A = MemoryBase(Ah:Al, At)
O = MemoryIndex(Ox:Oh:Ol, Ot)
TODO()
device_store
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
R
F
1
0
0
0
1
0
1
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
u2
?
?
At
?
Ou
Ot
Ol
Al
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
L
?
?
?
s
Rx
Ah
Oh
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
Ox
mask
?
?
Rt
Fx
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Ox
mask
?
?
Rt
Fx
L
?
?
?
s
Rx
Ah
Oh
?
u2
?
?
At
?
Ou
Ot
Ol
Al
R
F
1
0
0
0
1
0
1
R = MemoryReg(Rx:R, Rt)
A = MemoryBase(Ah:Al, At)
O = MemoryIndex(Ox:Oh:Ol, Ot)
TODO()
stack_store
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
R
F
1
0
1
1
0
1
0
1
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
i6
?
?
?
i1
?
Ot
Ol
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
L
i5
?
?
Rx
?
i2
Oh
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
Ox
mask
Fx
Rt
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Ox
mask
Fx
Rt
?
L
i5
?
?
Rx
?
i2
Oh
?
i6
?
?
?
i1
?
Ot
Ol
0
0
0
0
R
F
1
0
1
1
0
1
0
1
R = MemoryReg(Rx:R, Rt)
O = MemoryIndex(Ox:Oh:Ol, Ot)
TODO()
stack_load
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
R
F
0
0
1
1
0
1
0
1
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
i6
?
?
?
i1
?
Ot
Ol
0
0
0
0
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
L
i5
?
?
Rx
?
i2
Oh
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
Ox
mask
Fx
Rt
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Ox
mask
Fx
Rt
?
L
i5
?
?
Rx
?
i2
Oh
?
i6
?
?
?
i1
?
Ot
Ol
0
0
0
0
R
F
0
0
1
1
0
1
0
1
R = MemoryReg(Rx:R, Rt)
O = MemoryIndex(Ox:Oh:Ol, Ot)
TODO()
stack_get_ptr
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
R
i0
0
0
1
1
0
1
0
1
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
?
?
?
?
i1
?
?
?
?
?
?
0
0
0
1
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
1
i3
?
?
Rx
?
i2
?
?
?
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
?
?
?
?
?
?
?
?
i4
1
0
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
?
?
?
?
?
?
i4
1
0
1
i3
?
?
Rx
?
i2
?
?
?
?
?
?
?
?
?
i1
?
?
?
?
?
?
0
0
0
1
R
i0
0
0
1
1
0
1
0
1
R = StackReg32(Rx:R)
TODO()
stack_adjust
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
?
?
?
?
i0
1
0
1
1
0
1
0
1
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
?
?
?
?
?
i1
0
1
v1
0
0
0
1
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
L
i3
?
?
?
?
?
i2
v2
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
v3
i4
?
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
v3
i4
?
?
L
i3
?
?
?
?
?
i2
v2
?
?
?
?
?
i1
0
1
v1
0
0
0
1
?
?
?
?
?
?
i0
1
0
1
1
0
1
0
1
v = v3:v2:v1
TODO()
threadgroup_load
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
R
Rt
?
1
1
?
1
0
0
1
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
mask
?
Ot
O
F
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
?
?
Rx
Ax
Ox
?
?
?
?
?
?
?
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Rx
Ax
Ox
?
?
?
?
?
?
?
?
mask
?
Ot
O
F
At
A
L
R
Rt
?
1
1
?
1
0
0
1
R = ThreadgroupMemoryReg(Rx:R, Rt)
A = ThreadgroupMemoryBase(Ax:A, At)
O = ThreadgroupIndex(Ox:O, Ot)
TODO()
threadgroup_store
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
R
Rt
?
0
1
?
1
0
0
1
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
mask
?
Ot
O
F
At
A
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
?
?
Rx
Ax
Ox
?
?
?
?
?
?
?
?
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
Rx
Ax
Ox
?
?
?
?
?
?
?
?
mask
?
Ot
O
F
At
A
L
R
Rt
?
0
1
?
1
0
0
1
R = ThreadgroupMemoryReg(Rx:R, Rt)
A = ThreadgroupMemoryBase(Ax:A, At)
O = ThreadgroupIndex(Ox:O, Ot)
TODO()
texture_sample
The last four bytes are omitted if L=0.
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
R
Rt
0
0
1
1
0
0
0
1
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
q2
D
q1
Ct
C
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
q3
n
Tt
T
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
q5
St
S
lod
mask
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
Ox
Sx
Ot
q6
O
Tx
Dx
Cx
Rx
q4
U
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Ox
Sx
Ot
q6
O
Tx
Dx
Cx
Rx
q4
U
q5
St
S
lod
mask
q3
n
Tt
T
q2
D
q1
Ct
C
L
R
Rt
0
0
1
1
0
0
0
1
R = SampleReg(Rx:R, Rt)
U = SampleUReg(U)
T = Texture(Tx:T, Tt)
S = Sampler(Sx:S, St)
C = Coords(Cx:C, Ct)
D = Lod(Dx:D)
O = SampleOff(Ox:O, Ot)
TODO()
texture_load
The last four bytes are omitted if L=0.
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
L
R
Rt
0
1
1
1
0
0
0
1
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
q2
D
q1
Ct
C
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
q3
n
Tt
T
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
q5
St
S
lod
mask
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
Ox
Sx
Ot
q6
O
Tx
Dx
Cx
Rx
q4
U
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Ox
Sx
Ot
q6
O
Tx
Dx
Cx
Rx
q4
U
q5
St
S
lod
mask
q3
n
Tt
T
q2
D
q1
Ct
C
L
R
Rt
0
1
1
1
0
0
0
1
R = SampleReg(Rx:R, Rt)
U = SampleUReg(U)
T = Texture(Tx:T, Tt)
S = Sampler(Sx:S, St)
C = Coords(Cx:C, Ct)
D = Lod(Dx:D)
O = SampleOff(Ox:O, Ot)
TODO()
threadgroup_barrier
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
?
?
?
?
?
?
0
1
1
0
1
0
0
0
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
?
?
?
?
?
?
?
?
0
1
1
0
1
0
0
0
TODO()
Operands
ALUDst
ALUDst(value, flags, max_size=32):
cache_flag = flags & 1
if flags & 2 and value & 1 and max_size >= 64:
return Reg64Reference(value >> 1, cache=cache_flag)
elif flags & 2 and max_size >= 32:
return Reg32Reference(value >> 1, cache=cache_flag)
else:
return Reg16Reference(value, cache=cache_flag)
ALUSrc(value, flags, max_size=32):
if flags == 0b0000:
return BroadcastImmediateReference(value)
if flags >> 2 == 0b01:
ureg = value | (flags & 1) << 8
if flags & 0b10:
if max_size < 32:
UNDEFINED()
return BroadcastUReg32Reference(ureg >> 1)
else:
return BroadcastUReg16Reference(ureg)
if flags & 0b11 == 0b00: UNDEFINED()
cache_flag = (flags & 0b11) == 0b10
discard_flag = (flags & 0b11) == 0b11
if flags >> 2 == 0b11 and max_size >= 64:
if value & 1: UNDEFINED()
return Reg64Reference(value >> 1, cache=cache_flag, discard=discard_flag)
if flags >> 2 >= 0b10 and max_size >= 32:
if flags >> 2 != 0b10: UNDEFINED()
if value & 1: UNDEFINED()
return Reg32Reference(value >> 1, cache=cache_flag, discard=discard_flag)
if max_size >= 16:
if flags >> 2 != 0b00: UNDEFINED()
return Reg16Reference(value, cache=cache_flag, discard=discard_flag)
MulSrc
MulSrc(value, flags, sx):
source = ALUSrc(value, flags, max_size=32)
if sx:
# Note: 8-bit immediates have already been zero-extended to 16-bit,
# so do not get sign extended.
return SignExtendWrapper(source, source.thread_bit_size)
else:
return source
AddSrc
AddSrc(value, flags, sx):
source = ALUSrc(value, flags, max_size=64)
if sx:
# Note: 8-bit immediates have already been zero-extended to 16-bit,
# so do not get sign extended.
return SignExtendWrapper(source, source.thread_bit_size)
else:
return source
CmpselSrc
CmpselSrc(value, flags, destination_flags):
if flags == 0b100:
return BroadcastImmediateReference(value)
if flags >> 1 == 0b11:
ureg = value | (flags & 1) << 8
if destination_flags & 2:
if ureg & 1: UNDEFINED()
return BroadcastUReg32Reference(ureg >> 1)
else:
return BroadcastUReg16Reference(ureg)
if flags >> 2 == 1: UNDEFINED()
if flags & 0b11 == 0b00: UNDEFINED()
cache_flag = (flags & 0b11) == 0b10
discard_flag = (flags & 0b11) == 0b11
if destination_flags & 2:
if value & 1: UNDEFINED()
return Reg32Reference(value >> 1, cache=cache_flag, discard=discard_flag)
else:
return Reg16Reference(value, cache=cache_flag, discard=discard_flag)
ICondition(value, n=0):
sign_extend = (value & 0b100) != 0
condition = value & 0b011
invert_result = (n != 0)
if condition == 0b00:
return IntEqualityComparison(sign_extend, invert_result)
if condition == 0b01:
return IntLessThanComparison(sign_extend, invert_result)
if condition == 0b10:
return IntGreaterThanComparison(sign_extend, invert_result)
FCondition
FCondition(condition, n=0):
invert_result = (n != 0)
if condition == 0b000:
return FloatEqualityComparison(invert_result)
if condition == 0b001:
return FloatLessThanComparison(invert_result)
if condition == 0b010:
return FloatGreaterThanComparison(invert_result)
if condition == 0b011:
return FloatLessThanNanLosesComparison(invert_result)
if condition == 0b101:
return FloatLessThanOrEqualComparison(invert_result)
if condition == 0b110:
return FloatGreaterThanOrEqualComparison(invert_result)
if condition == 0b111:
return FloatGreaterThanNanLosesComparison(invert_result)
MemoryIndex
MemoryIndex(value, flags):
if flags != 0:
return BroadcastImmediateReference(sign_extend(value, 16))
else:
if value & 1: UNDEFINED()
if value >= 0x100: UNDEFINED()
return Reg32Reference(value >> 1)
MemoryBase
MemoryBase(value, flags):
if value & 1: UNDEFINED()
if flags != 0:
return UReg64Reference(value >> 1)
else:
return Reg64Reference(value >> 1)
Helper Pseudocode
decode_float_immediate(value):
sign = (value & 0x80) >> 7
exponent = (value & 0x70) >> 4
fraction = value & 0xF
if exponent == 0:
result = fraction / 64.0
else:
fraction = 16.0 + fraction
exponent -= 7
result = fraction * (2.0 ** exponent)
if sign != 0:
result = -result
return result