This document attempts to describe the Apple G13 GPU architecture, as used in the M1 SoC. This is based on reverse engineering and is likely to have mistakes. The Metal Shading Language is typically used to program these GPUs, and this document uses Metal terminology. For example a CPU SIMD-lane is a Metal thread, and a CPU thread is a Metal SIMD-group.

The G13 architecture has 32 threads per SIMD-group. Each SIMD-group has a stack pointer (sp), a program counter (pc), a 32-bit execution mask (exec_mask), and up to 128 general purpose registers.

General purpose registers each store one 32-bit value per thread. Each register can be accessed as a 32-bit register, named r0 to r127, the low 16-bits of the register r0l to r127l, or the high 16-bits of the register r0h to r127h. Some instructions may also use pairs of contiguous 32-bit registers to operate on 64-bit values, and memory operations may use up to four contiguous 16 or 32-bit registers (in both cases, encoded as the first register).

A number of physical registers are allocated to each SIMD-group, and registers from r0 to r(N-1) may be used, but accesses to higher register numbers are not valid, and may read or corrupt data from other SIMD-groups. Using fewer registers (e.g. by using 16-bit types instead of 32-bit types) allows more SIMD-groups to fit in the physical register file (higher occupancy), which improves performance.

Certain instructions are hardcoded to use early registers. r0l tracks the execution mask stack, and r1 is used as the link register.

Shared state is less well understood, but includes 256 32-bit uniform registers named u0 to u255 and similarly accessible as their 16-bit halves, u0l to u255h. These are used for values that are the same across all threads, such as threads_per_grid, or addresses of buffers.

Conditional Execution

Each thread within a SIMD-group may be deactivated, meaning the values in registers for that thread will keep their current value. Whether or not each thread is active is tracked in a 32-bit execution mask (in this document, by convention, a one bit indicates the thread is active and a zero bit indicates it is not).

r0l tracks the execution mask stack. When used with flow-control instructions, the value in r0l indicates how many 'pop' operations will be needed to re-enable an inactive thread, or zero if the thread is active.

The execution mask stack instructions (pop_exec, if_*cmp, else_*cmp and while_*cmp) are typically used to manage r0l. They also update exec_mask based on the value in r0l, and are the only known way to manipulate exec_mask. However, r0l may be manipulated with other instructions. It should be initialised to zero prior to using execution mask stack instructions, and break statements may be implemented by conditionally moving a value (corresponding to the number of stack-levels to break out of) to r0l and using a pop_exec 0 (which deactivates any threads that now have non-zero values in r0l).

The jmp_exec_none instruction may be used to jump over an if statement body if no threads are active, and the jmp_exec_any may be used to jump back to the start of the loop only if one or more threads are still executing the loop.

Execution masking generally prevents reading or writing values from inactive threads, however SIMD shuffle instructions can read from inactive threads in some cases, which can be valuable for research or debugging purposes (e.g. it allows observing non-zero values in r0l).

Register Cache

The GPU has a register cache, which keeps the contents of recently used general purpose registers more quickly accessible. When instructions read or write GPRs, they usually allow hints for the access to be encoded. The cache hint, on a source or destination operand, indicates the value will be used again, and should be cached (meaning other values, where this hint was not used, will be preferred for eviction). The discard hint (on a source operand) invalidates the value in the register cache after all operands have been read, without writing it back to the register file.

While the cache hint should only change performance, the discard hint will make future reads undefined, which could lead to confusing issues. discard should probably not be used within conditional execution, as inactive threads within the SIMD-group may contain data that has not been written back to the register file that would probably also also get discarded. The behaviour of this hint when execution is partially or completely masked has not been tested.

Either hint may be used multiple times even if the same operand appears twice, and discard on a source register can be used with cache on the same destination register.

Instructions

Instructions vary in length in multiples of two bytes up to twelve bytes (so far). Some instructions have a long and short encoding. This is indicated by a bit L, which, if zero, indicates the last two (or four) bytes of the instruction are omitted, and any bits within those bytes should be read as zero. So far only 12-byte instructions omit the last four bytes, and all others omit the last two.

The encodings are described in little-endian, meaning bytes go right-to-left (and top-to-bottom), but bits may be read in the usual numerical order.

Behaviour is mostly described in a Python-like pseudocode for now. In operand descriptions, the : operator describes bit concatenation, with the most-significant fields first. Elsewhere, values are considered to be Python-style arbitrary-precision integers, and floating-point values are considered to be arbitrary-precision floats (although double-precision with round-to-odd may be an adequate approximation).


Move Instructions

mov (Move 16-bit Immediate)

1514131211109876543210
LD0?1100010
Dt
31302928272625242322212019181716
imm16
47464544434241403938373635343332
??Dx????????????
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??Dx????????????imm16LD0?1100010
Dt
D = ALUDst(Dx:D, Dt)

D.broadcast_to_active(imm16)

mov (Move 32-bit Immediate)

1514131211109876543210
LD1?1100010
Dt
4746454443424140393837363534333231302928272625242322212019181716
imm32
63626160595857565554535251504948
??Dx????????????
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??Dx????????????imm32LD1?1100010
Dt
D = ALUDst(Dx:D, Dt)

D.broadcast_to_active(imm32)

get_sr (Move From Special Register)

1514131211109876543210
0DDt1110010
31302928272625242322212019181716
??DxSRx????SR
313029282726252423222120191817161514131211109876543210
??DxSRx????SR0DDt1110010
D = ALUDst(Dx:D, Dt)
SR = SReg32(SRx:SR)

for each active thread:
  D[thread] = SR.read(thread)

Integer Arithmetic Instructions

iadd (Integer Add or Subtract)

1514131211109876543210
0DDtS001110
393837363534333231302928272625242322212019181716
s1BsBtBNAsAtA
636261605958575655545352515049484746454443424140
??????????s2??????DxAxBx
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??????????s2??????DxAxBxs1BsBtBNAsAtA0DDtS001110
D = ALUDst64(Dx:D, Dt)
A = AddSrc(Ax:A, At, As)
B = AddSrc(Bx:B, Bt, Bs)
shift = s2:s1

for each active thread:
  a = A[thread]
  b = B[thread]

  saturating = (S == 1 and shift == 0 and A.thread_bit_size <= 32 and
                B.thread_bit_size <= 32 and D.thread_bit_size <= 32)

  if N == 1:
    b = -b

  if shift < 5:
    b <<= shift
  else:
    b = 0

  result = a + b

  if saturating:
    signed = (As == 1 or Bs == 1)
    result = saturate_integer(result, D.thread_bit_size, signed)

  D[thread] = result

imadd (Integer Multiply-Add or Subtract)

1514131211109876543210
0DDtS011110
393837363534333231302928272625242322212019181716
s1BsBtBNAsAtA
636261605958575655545352515049484746454443424140
??DxAxBxCxs2?CsCtC
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBxCxs2?CsCtCs1BsBtBNAsAtA0DDtS011110
D = ALUDst64(Dx:D, Dt)
A = MulSrc(Ax:A, At, As)
B = MulSrc(Bx:B, Bt, Bs)
C = AddSrc(Cx:C, Ct, Cs)
shift = s2:s1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  saturating = (S == 1 and shift == 0 and C.thread_bit_size <= 32 and
                D.thread_bit_size <= 32)

  if N == 1:
    c = -c

  if shift < 5:
    c <<= shift
  else:
    c = 0

  result = a * b + c

  if saturating:
    signed = (As == 1 or Bs == 1 or Cs == 1)
    result = saturate_integer(result, D.thread_bit_size, signed)

  D[thread] = result

convert

1514131211109876543210
1DDt0111110
393837363534333231302928272625242322212019181716
00srctsrcround0000mode
4746454443424140
??Dx00srcx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??Dx00srcx00srctsrcround0000mode1DDt0111110
D = ALUDst(Dx:D, Dt)
src = ALUSrc(srcx:src, srct)

TODO()

Shift/Bitfield Instructions

bfi (Bitfield Insert/Shift Left)

1514131211109876543210
0DDt0101110
393837363534333231302928272625242322212019181716
m1BtB00AtA
636261605958575655545352515049484746454443424140
m3?DxAxBxCx??m2CtC
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
m3?DxAxBxCx??m2CtCm1BtB00AtA0DDt0101110
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  shift_amount = (c & 0x7F)

  if m == 0:
    mask = 0xFFFFFFFF
  else:
    mask = (1 << m) - 1

  result = (a & ~(mask << shift_amount)) | ((b & mask) << shift_amount)

  D[thread] = result

bfeil (Bitfield Extract and Insert Low/Shift Right)

1514131211109876543210
1DDt0101110
393837363534333231302928272625242322212019181716
m1BtB00AtA
636261605958575655545352515049484746454443424140
m3?DxAxBxCx??m2CtC
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
m3?DxAxBxCx??m2CtCm1BtB00AtA1DDt0101110
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  shift_amount = (c & 0x7F)

  if m == 0:
    mask = 0xFFFFFFFF
  else:
    mask = (1 << m) - 1

  result = (a & ~mask) | ((b >> shift_amount) & mask)

  D[thread] = result

extr (Extract From Register Pair)

1514131211109876543210
0DDt0101110
393837363534333231302928272625242322212019181716
m1BtB01AtA
636261605958575655545352515049484746454443424140
m3?DxAxBxCx??m2CtC
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
m3?DxAxBxCx??m2CtCm1BtB01AtA0DDt0101110
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  shift_amount = (c & 0x7F)

  if m == 0:
    mask = 0xFFFFFFFF
  else:
    mask = (1 << m) - 1

  result = (((b << 32) | a) >> shift_amount) & mask

  D[thread] = result

shlhi (Shift Left High and Insert)

1514131211109876543210
0DDt0101110
393837363534333231302928272625242322212019181716
m1BtB10AtA
636261605958575655545352515049484746454443424140
m3?DxAxBxCx??m2CtC
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
m3?DxAxBxCx??m2CtCm1BtB10AtA0DDt0101110
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  shift_amount = (c & 0x7F)

  if m == 0:
    mask = 0xFFFFFFFF
  else:
    mask = (1 << m) - 1

  shifted_mask = mask << max(shift_amount-32, 0)
  result = (((b << shift_amount) >> 32) & shifted_mask) | (a & ~shifted_mask)

  D[thread] = result

shrhi (Shift Right High and Insert)

1514131211109876543210
1DDt0101110
393837363534333231302928272625242322212019181716
m1BtB10AtA
636261605958575655545352515049484746454443424140
m3?DxAxBxCx??m2CtC
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
m3?DxAxBxCx??m2CtCm1BtB10AtA1DDt0101110
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  shift_amount = (c & 0x7F)

  if m == 0:
    mask = 0xFFFFFFFF
  else:
    mask = (1 << m) - 1

  shifted_mask = (mask << 32) >> min(shift_amount, 32)
  result = (((b << 32) >> shift_amount) & shifted_mask) | (a & ~shifted_mask)

  D[thread] = result

asr (Arithmetic Shift Right)

1514131211109876543210
1DDt0101110
393837363534333231302928272625242322212019181716
??BtB01AtA
636261605958575655545352515049484746454443424140
??DxAxBx????????????????
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBx??????????????????BtB01AtA1DDt0101110
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each active thread:
  a = A[thread]
  b = B[thread]

  shift_amount = (b & 0x7F)

  result = sign_extend(a, A.thread_bit_size) >> shift_amount

  D[thread] = result

asrh (Arithmetic Shift Right High)

1514131211109876543210
1DDt0101110
393837363534333231302928272625242322212019181716
??BtB11AtA
636261605958575655545352515049484746454443424140
??DxAxBx????????????????
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBx??????????????????BtB11AtA1DDt0101110
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each active thread:
  a = A[thread]
  b = B[thread]

  shift_amount = (b & 0x7F)

  result = (sign_extend(a, A.thread_bit_size) << 32) >> shift_amount

  D[thread] = result

Bit Manipulation Instructions

bitop (Bitwise Operation)

1514131211109876543210
0DDt1111110
393837363534333231302928272625242322212019181716
tt3tt2BtBtt1tt0AtA
4746454443424140
??DxAxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBxtt3tt2BtBtt1tt0AtA0DDt1111110
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each active thread:
  a = A[thread]
  b = B[thread]

  if tt0 == tt1 and tt2 == tt3 and tt0 != tt2:
    UNDEFINED()
    result = a
  else:
    result = 0
    if tt0: result |= ~a & ~b
    if tt1: result |=  a & ~b
    if tt2: result |= ~a &  b
    if tt3: result |=  a &  b

  D[thread] = result

bitrev (Reverse Bits)

1514131211109876543210
0DDt0111110
31302928272625242322212019181716
000001AtA
47464544434241403938373635343332
??DxAx??00000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx??00000000000001AtA0DDt0111110
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)

for each active thread:
  a = A[thread]

  result = 0

  i = 0
  while i < 32:
    if a & (1 << i):
      result |= (1 << (31-i))

  D[thread] = result

popcount (Population Count)

1514131211109876543210
0DDt0111110
31302928272625242322212019181716
000010AtA
47464544434241403938373635343332
??DxAx??00000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx??00000000000010AtA0DDt0111110
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)

for each active thread:
  a = A[thread]

  result = 0

  i = 0
  while i < 32:
    if a & (1 << i):
      result += 1

  D[thread] = result

ffs (Find First Set)

1514131211109876543210
0DDt0111110
31302928272625242322212019181716
000011AtA
47464544434241403938373635343332
??DxAx??00000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx??00000000000011AtA0DDt0111110
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)

for each active thread:
  a = A[thread]

  result = -1

  i = 31
  while i >= 0:
    if a & (1 << i):
      result = i
      break
    i -= 1

  D[thread] = result

Floating-Point Arithmetic

fmadd (Floating-Point Fused Multiply-Add)

1514131211109876543210
LDDtS111010
393837363534333231302928272625242322212019181716
BmBtBAmAtA
636261605958575655545352515049484746454443424140
??DxAxBxCx??CmCtC
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBxCx??CmCtCBmBtBAmAtALDDtS111010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
C = FloatSrc(Cx:C, Ct, Cm)

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  result = fused_multiply_add(a, b, c)

  D[thread] = result

fmadd16 (Half Precision Floating-Point Fused Multiply-Add)

1514131211109876543210
LDDtS110110
393837363534333231302928272625242322212019181716
?BmBtB?AmAtA
636261605958575655545352515049484746454443424140
??DxAxBxCx???CmCtC
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBxCx???CmCtC?BmBtB?AmAtALDDtS110110
D = FloatDst16(Dx:D, Dt, S)
A = FloatSrc16(Ax:A, At, Am)
B = FloatSrc16(Bx:B, Bt, Bm)
C = FloatSrc16(Cx:C, Ct, Cm)

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  result = fused_multiply_add(a, b, c)

  D[thread] = result

fadd (Floating-Point Add)

1514131211109876543210
1DDtS101010
393837363534333231302928272625242322212019181716
BmBtBAmAtA
4746454443424140
??DxAxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBxBmBtBAmAtA1DDtS101010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

for each active thread:
  a = A[thread]
  b = B[thread]

  result = fused_multiply_add(a, 1.0, b)

  D[thread] = result

fadd16 (Half Precision Floating-Point Add)

1514131211109876543210
1DDtS100110
393837363534333231302928272625242322212019181716
?BmBtB?AmAtA
4746454443424140
??DxAxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBx?BmBtB?AmAtA1DDtS100110
D = FloatDst16(Dx:D, Dt, S)
A = FloatSrc16(Ax:A, At, Am)
B = FloatSrc16(Bx:B, Bt, Bm)

for each active thread:
  a = A[thread]
  b = B[thread]

  result = fused_multiply_add(a, 1.0, b)

  D[thread] = result

fmul (Floating-Point Multiply)

1514131211109876543210
1DDtS011010
393837363534333231302928272625242322212019181716
BmBtBAmAtA
4746454443424140
??DxAxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBxBmBtBAmAtA1DDtS011010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

for each active thread:
  a = A[thread]
  b = B[thread]

  result = fused_multiply_add(a, b, 0.0)

  D[thread] = result

fmul16 (Half Precision Floating-Point Multiply)

1514131211109876543210
1DDtS010110
393837363534333231302928272625242322212019181716
?BmBtB?AmAtA
4746454443424140
??DxAxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBx?BmBtB?AmAtA1DDtS010110
D = FloatDst16(Dx:D, Dt, S)
A = FloatSrc16(Ax:A, At, Am)
B = FloatSrc16(Bx:B, Bt, Bm)

for each active thread:
  a = A[thread]
  b = B[thread]

  result = fused_multiply_add(a, b, 0.0)

  D[thread] = result

floor

1514131211109876543210
LDDtS001010
31302928272625242322212019181716
0000AmAtA
47464544434241403938373635343332
??DxAx0000000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000000000AmAtALDDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = floor(A[thread])

ceil

1514131211109876543210
1DDtS001010
31302928272625242322212019181716
0000AmAtA
47464544434241403938373635343332
??DxAx0000000001
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000010000AmAtA1DDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = ceil(A[thread])

trunc

1514131211109876543210
1DDtS001010
31302928272625242322212019181716
0000AmAtA
47464544434241403938373635343332
??DxAx0000000010
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000100000AmAtA1DDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = trunc(A[thread])

rint

1514131211109876543210
1DDtS001010
31302928272625242322212019181716
0000AmAtA
47464544434241403938373635343332
??DxAx0000000011
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000110000AmAtA1DDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = rint(A[thread])

rcp

1514131211109876543210
LDDtS001010
31302928272625242322212019181716
1000AmAtA
47464544434241403938373635343332
??DxAx0000000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000001000AmAtALDDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = reciprocal(A[thread])

rsqrt

1514131211109876543210
LDDtS001010
31302928272625242322212019181716
1001AmAtA
47464544434241403938373635343332
??DxAx0000000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000001001AmAtALDDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = rsqrt(A[thread])

rsqrt_special

rsqrt_special can be used to implement fast sqrt as rsqrt_special(x) * x, by handling special-cases differently.

1514131211109876543210
LDDtS001010
31302928272625242322212019181716
0001AmAtA
47464544434241403938373635343332
??DxAx0000000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000000001AmAtALDDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = rsqrt_special(A[thread])

sin_pt_1

sin_pt_1 is used together with sin_pt_2 and supporting ALU to compute the sine function. sin_pt_1 takes an angle around the circle in the interval [0, 4) and produces an intermediate result. This intermediate result is then passed to sin_pt_2, and the two results are multipled to give sin. The argument reduction to [0, 4) can be computed with a few ALU instructions: reduce(x) = 4 fract(x / tau), where tau is the circle constant formerly known as twice pi. Calculating cosine follows from the identity cos(x) = sin(x + tau/4). After multipling by 1/tau, the bias become 1/4 which can be added in the same cycle via a fused multiply-add. Tangent should be lowered to a division of sine and cosine.

1514131211109876543210
LDDtS001010
31302928272625242322212019181716
1010AmAtA
47464544434241403938373635343332
??DxAx0000000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000001010AmAtALDDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = sin_pt_1(A[thread])

sin_pt_2

1514131211109876543210
LDDtS001010
31302928272625242322212019181716
1110AmAtA
47464544434241403938373635343332
??DxAx0000000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000001110AmAtALDDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = sin_pt_2(A[thread])

log2

1514131211109876543210
LDDtS001010
31302928272625242322212019181716
1100AmAtA
47464544434241403938373635343332
??DxAx0000000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000001100AmAtALDDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = log2(A[thread])

exp2

1514131211109876543210
LDDtS001010
31302928272625242322212019181716
1101AmAtA
47464544434241403938373635343332
??DxAx0000000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000001101AmAtALDDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = exp2(A[thread])

dfdx

1514131211109876543210
LDDtS001010
31302928272625242322212019181716
0100AmAtA
47464544434241403938373635343332
??DxAx0000000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000000100AmAtALDDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

TODO()

dfdy

1514131211109876543210
LDDtS001010
31302928272625242322212019181716
0110AmAtA
47464544434241403938373635343332
??DxAx0000000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAx00000000000110AmAtALDDtS001010
D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

TODO()

Flow Control Instructions

ret

1514131211109876543210
reg32??0010100
1514131211109876543210
reg32??0010100
reg32 = Reg32(reg32)

TODO()

stop

1514131211109876543210
0000000010001000
1514131211109876543210
0000000010001000
end_execution()

trap

1514131211109876543210
0000000000001000
1514131211109876543210
0000000000001000
TODO()

call

1514131211109876543210
reg32??0000100
1514131211109876543210
reg32??0000100
reg32 = Reg32(reg32)

TODO()

jmp_incomplete

1514131211109876543210
0000000000000000
31302928272625242322212019181716
00000000off
313029282726252423222120191817161514131211109876543210
00000000off0000000000000000
TODO()

jmp_exec_any

1514131211109876543210
1100000000000000
4746454443424140393837363534333231302928272625242322212019181716
off
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
off1100000000000000
if any(exec_mask):
  next_pc = pc + sign_extend(off, 32)

jmp_exec_none

1514131211109876543210
1100000000100000
4746454443424140393837363534333231302928272625242322212019181716
off
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
off1100000000100000
if not any(exec_mask):
  next_pc = pc + sign_extend(off, 32)

call

1514131211109876543210
1100000000010000
4746454443424140393837363534333231302928272625242322212019181716
off
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
off1100000000010000
next_pc = pc + sign_extend(off, 32)

for each active thread:
  r1 = pc + 6

Execution Mask Stack Instructions

pop_exec

1514131211109876543210
000n11?Dt1010010
31302928272625242322212019181716
0000000000000000
47464544434241403938373635343332
0000000000000000
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
00000000000000000000000000000000000n11?Dt1010010
D = ImplicitR0L(Dt)

for each thread:
  v = D[thread]
  v -= n
  if v < 0:
    v = 0
  D[thread] = v
  exec_mask[thread] = (v == 0)

if_icmp

1514131211109876543210
ccn00ccnDt1010010
393837363534333231302928272625242322212019181716
00BtB00AtA
4746454443424140
??00AxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??00AxBx00BtB00AtAccn00ccnDt1010010
D = ImplicitR0L(Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each thread:
  v = D[thread]
  if v != 0:
    v += n
  elif not cc.compare(A[thread], B[thread]):
    v = 1
  D[thread] = v
  exec_mask[thread] = (v == 0)

if_fcmp

1514131211109876543210
ccn00ccnDt1000010
393837363534333231302928272625242322212019181716
BmBtBAmAtA
4746454443424140
??00AxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??00AxBxBmBtBAmAtAccn00ccnDt1000010
D = ImplicitR0L(Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

for each thread:
  v = D[thread]
  if v != 0:
    v += n
  elif not cc.compare(A[thread], B[thread]):
    v = 1
  D[thread] = v
  exec_mask[thread] = (v == 0)

while_icmp

1514131211109876543210
ccn10ccnDt1010010
393837363534333231302928272625242322212019181716
00BtB00AtA
4746454443424140
??00AxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??00AxBx00BtB00AtAccn10ccnDt1010010
D = ImplicitR0L(Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each thread:
  v = D[thread]
  if v < n:
    if cc.compare(A[thread], B[thread]):
      v = 0
    else:
      v = n
  D[thread] = v
  exec_mask[thread] = (v == 0)

while_fcmp

1514131211109876543210
ccn10ccnDt1000010
393837363534333231302928272625242322212019181716
BmBtBAmAtA
4746454443424140
??00AxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??00AxBxBmBtBAmAtAccn10ccnDt1000010
D = ImplicitR0L(Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

for each thread:
  v = D[thread]
  if v < n:
    if cc.compare(A[thread], B[thread]):
      v = 0
    else:
      v = n
  D[thread] = v
  exec_mask[thread] = (v == 0)

else_icmp

1514131211109876543210
ccn01ccnDt1010010
393837363534333231302928272625242322212019181716
00BtB00AtA
4746454443424140
??00AxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??00AxBx00BtB00AtAccn01ccnDt1010010
D = ImplicitR0L(Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each thread:
  v = D[thread]
  if v == 0:
    v = n
  elif v == 1:
    if cc.compare(A[thread], B[thread]):
      v = 0
    else:
      v = 1
  D[thread] = v
  exec_mask[thread] = (v == 0)

else_fcmp

1514131211109876543210
ccn01ccnDt1000010
393837363534333231302928272625242322212019181716
BmBtBAmAtA
4746454443424140
??00AxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??00AxBxBmBtBAmAtAccn01ccnDt1000010
D = ImplicitR0L(Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

for each thread:
  v = D[thread]
  if v == 0:
    v = n
  elif v == 1:
    if cc.compare(A[thread], B[thread]):
      v = 0
    else:
      v = 1
  D[thread] = v
  exec_mask[thread] = (v == 0)

Select Instructions

icmpsel (Integer Compare and Select)

1514131211109876543210
LDDt0010010
393837363534333231302928272625242322212019181716
??BtB??AtA
636261605958575655545352515049484746454443424140
ccYtY???XtX
79787776757473727170696867666564
??DxAxBxXxYx????
797877767574737271706968676665646362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBxXxYx????ccYtY???XtX??BtB??AtALDDt0010010
cc = ICondition(cc)
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
X = CmpselSrc(Xx:X, Xt, Dt)
Y = CmpselSrc(Yx:Y, Yt, Dt)

for each active thread:
  if cc.compare(A[thread], B[thread]):
    D[thread] = X[thread]
  else:
    D[thread] = Y[thread]

fcmpsel (Floating-Point Compare and Select)

1514131211109876543210
LDDt0000010
393837363534333231302928272625242322212019181716
BmBtBAmAtA
636261605958575655545352515049484746454443424140
ccYtY???XtX
79787776757473727170696867666564
??DxAxBxXxYx????
797877767574737271706968676665646362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??DxAxBxXxYx????ccYtY???XtXBmBtBAmAtALDDt0000010
cc = FCondition(cc)
D = ALUDst(Dx:D, Dt)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
X = CmpselSrc(Xx:X, Xt, Dt)
Y = CmpselSrc(Yx:Y, Yt, Dt)

for each active thread:
  if cc.compare(A[thread], B[thread]):
    D[thread] = X[thread]
  else:
    D[thread] = Y[thread]

SIMD Group and Quad Group Instructions

icmp_ballot

1514131211109876543210
?DDt0110010
393837363534333231302928272625242322212019181716
00BtB00AtA
636261605958575655545352515049484746454443424140
cc0000000000001ccn?DxAxBx
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
cc0000000000001ccn?DxAxBx00BtB00AtA?DDt0110010
D = ALUDst(Dx:D, Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

result = 0

for each active thread:
  a = A[thread]
  b = B[thread]

  if cc.compare(a, b):
    result |= 1 << thread

D.broadcast_to_active(result)

icmp_quad_ballot

1514131211109876543210
?DDt0110010
393837363534333231302928272625242322212019181716
00BtB00AtA
636261605958575655545352515049484746454443424140
cc0000000000000ccn?DxAxBx
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
cc0000000000000ccn?DxAxBx00BtB00AtA?DDt0110010
D = ALUDst(Dx:D, Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

TODO()

fcmp_ballot

1514131211109876543210
?DDt0100010
393837363534333231302928272625242322212019181716
BmBtBAmAtA
636261605958575655545352515049484746454443424140
cc0000000000001ccn?DxAxBx
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
cc0000000000001ccn?DxAxBxBmBtBAmAtA?DDt0100010
D = ALUDst(Dx:D, Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

result = 0

for each active thread:
  a = A[thread]
  b = B[thread]

  if cc.compare(a, b):
    result |= 1 << thread

D.broadcast_to_active(result)

fcmp_quad_ballot

1514131211109876543210
?DDt0100010
393837363534333231302928272625242322212019181716
BmBtBAmAtA
636261605958575655545352515049484746454443424140
cc0000000000000ccn?DxAxBx
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
cc0000000000000ccn?DxAxBxBmBtBAmAtA?DDt0100010
D = ALUDst(Dx:D, Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

TODO()

simd_shuffle

1514131211109876543210
0DDt1101111
393837363534333231302928272625242322212019181716
00BtB01AtA
4746454443424140
0?DxAxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
0?DxAxBx00BtB01AtA0DDt1101111
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc16(Bx:B, Bt)

quad_values = []

for each quad:
  quad_index = 0

  for each thread in quad:
    # NOTE: this is not execution masked, meaning any inactive thread can make
    # simd_broadcast from the whole quad undefined (although it works fine if
    # B is an immediate)

    quad_index |= B[thread] & 3

  quad_values.append(A[quad.start + quad_index])

for each active thread:
  b = B[thread]

  if b < 32:
    result = quad_values[index >> 2]

    D[thread] = result

simd_shuffle_down

1514131211109876543210
0DDt1101111
393837363534333231302928272625242322212019181716
11BtB01AtA
4746454443424140
0?DxAxBx
47464544434241403938373635343332313029282726252423222120191817161514131211109876543210
0?DxAxBx11BtB01AtA0DDt1101111
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc16(Bx:B, Bt)

TODO()

Memory and Stack Instructions

wait

1514131211109876543210
???????i00111000
1514131211109876543210
???????i00111000
wait_for_loads()

ld/st_tile

1514131211109876543210
?DDtload001001
31302928272625242322212019181716
????F????????
47464544434241403938373635343332
????????masku0rt
63626160595857565554535251504948
??Dx????????????
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??Dx????????????????????masku0rt????F?????????DDtload001001
D = ALUDst(Dx:D, Dt)

TODO()

ld_var

The last four bytes are omitted if L=0.

1514131211109876543210
LDDtperspective100001
31302928272625242322212019181716
mask????????index
47464544434241403938373635343332
????????????????
63626160595857565554535251504948
??Dx????????????
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??Dx????????????????????????????mask????????indexLDDtperspective100001
D = ALUDst(Dx:D, Dt)

TODO()

uniform_store

uniform_store is used to initialise uniform registers. R is stored to offset O, which is typically an index in 16-bit units into the uniform registers. This is encoded like (and possibly is) a store to device memory, and can move one 16-bit register to initialise a 16-bit uniform, or two consecutive 16-bit registers to initialise a 32-bit uniform.

1514131211109876543210
R0F1000101
31302928272625242322212019181716
??111unkOtOl0000
47464544434241403938373635343332
LbsRx0000Oh
63626160595857565554535251504948
Oxmask00Rt?
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
Oxmask00Rt?LbsRx0000Oh??111unkOtOl0000R0F1000101
R = MemoryReg(Rx:R, Rt)
O = MemoryIndex(Ox:Oh:Ol, Ot)

TODO()

device_load

device_load initiates a load from device memory, the result of which may be used after a wait. The data can be unpacked from a variety of formats, or passed through as-is. On each thread, up to four aligned values, each up to 32-bits, can be read from a base address plus an offset (shifted left by the alignment, with an optional additional left shift of up to two).

The number of values to read is described by a mask, such that 0b0001 indicates one value, or 0b1111 loads four values. Non-contiguous masks skip values in memory, but still write the result to contiguous registers.

Non-packed formats (8, 16, and 32-bit values) are zero extended. All packed values are unpacked to 16-bit or 32-bit floating-point values, depending on the size of the register. Bit-packed formats (rgb10a2, rg11b10f and rgb9e5) are supported, but ignore the optional shift and the mask. They always read an aligned 32-bit value, and write to the same number of registers. However simple packed values (unorm8, snorm8, unorm16, snorm16 and srgba8) do not have this limitation.

Unaligned addresses are rounded-down to the required alignment. The base address (A) is a 64-bit value from either uniform or general-purpose registers. The offset (O) may be a signed 16-bit immediate, or a signed or unsigned 32-bit general-purpose register.

1514131211109876543210
RF0000101
31302928272625242322212019181716
?u2??At?OuOtOlAl
47464544434241403938373635343332
L???sRxAhOh
63626160595857565554535251504948
Oxmask??RtFx
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
Oxmask??RtFxL???sRxAhOh?u2??At?OuOtOlAlRF0000101
R = MemoryReg(Rx:R, Rt)
A = MemoryBase(Ah:Al, At)
O = MemoryIndex(Ox:Oh:Ol, Ot)

TODO()

device_store

1514131211109876543210
RF1000101
31302928272625242322212019181716
?u2??At?OuOtOlAl
47464544434241403938373635343332
L???sRxAhOh
63626160595857565554535251504948
Oxmask??RtFx
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
Oxmask??RtFxL???sRxAhOh?u2??At?OuOtOlAlRF1000101
R = MemoryReg(Rx:R, Rt)
A = MemoryBase(Ah:Al, At)
O = MemoryIndex(Ox:Oh:Ol, Ot)

TODO()

stack_store

1514131211109876543210
RF10110101
31302928272625242322212019181716
?i6???i1?OtOl0000
47464544434241403938373635343332
Li5??Rx?i2Oh
63626160595857565554535251504948
OxmaskFxRt?
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
OxmaskFxRt?Li5??Rx?i2Oh?i6???i1?OtOl0000RF10110101
R = MemoryReg(Rx:R, Rt)
O = MemoryIndex(Ox:Oh:Ol, Ot)

TODO()

stack_load

1514131211109876543210
RF00110101
31302928272625242322212019181716
?i6???i1?OtOl0000
47464544434241403938373635343332
Li5??Rx?i2Oh
63626160595857565554535251504948
OxmaskFxRt?
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
OxmaskFxRt?Li5??Rx?i2Oh?i6???i1?OtOl0000RF00110101
R = MemoryReg(Rx:R, Rt)
O = MemoryIndex(Ox:Oh:Ol, Ot)

TODO()

stack_get_ptr

1514131211109876543210
Ri000110101
31302928272625242322212019181716
?????i1??????0001
47464544434241403938373635343332
1i3??Rx?i2????
63626160595857565554535251504948
????????i410
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
????????i4101i3??Rx?i2?????????i1??????0001Ri000110101
R = StackReg32(Rx:R)

TODO()

stack_adjust

1514131211109876543210
??????i010110101
31302928272625242322212019181716
?????i101v10001
47464544434241403938373635343332
Li3?????i2v2
63626160595857565554535251504948
v3i4??
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
v3i4??Li3?????i2v2?????i101v10001??????i010110101
v = v3:v2:v1

TODO()

threadgroup_load

1514131211109876543210
LRRt?11?1001
393837363534333231302928272625242322212019181716
mask?OtOFAtA
636261605958575655545352515049484746454443424140
??RxAxOx????????
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??RxAxOx????????mask?OtOFAtALRRt?11?1001
R = ThreadgroupMemoryReg(Rx:R, Rt)
A = ThreadgroupMemoryBase(Ax:A, At)
O = ThreadgroupIndex(Ox:O, Ot)

TODO()

threadgroup_store

1514131211109876543210
LRRt?01?1001
393837363534333231302928272625242322212019181716
mask?OtOFAtA
636261605958575655545352515049484746454443424140
??RxAxOx????????
6362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
??RxAxOx????????mask?OtOFAtALRRt?01?1001
R = ThreadgroupMemoryReg(Rx:R, Rt)
A = ThreadgroupMemoryBase(Ax:A, At)
O = ThreadgroupIndex(Ox:O, Ot)

TODO()

texture_sample

The last four bytes are omitted if L=0.

1514131211109876543210
LRRt00110001
31302928272625242322212019181716
q2Dq1CtC
47464544434241403938373635343332
q3nTtT
63626160595857565554535251504948
q5StSlodmask
9594939291908988878685848382818079787776757473727170696867666564
OxSxOtq6OTxDxCxRxq4U
95949392919089888786858483828180797877767574737271706968676665646362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
OxSxOtq6OTxDxCxRxq4Uq5StSlodmaskq3nTtTq2Dq1CtCLRRt00110001
R = SampleReg(Rx:R, Rt)
U = SampleUReg(U)
T = Texture(Tx:T, Tt)
S = Sampler(Sx:S, St)
C = Coords(Cx:C, Ct)
D = Lod(Dx:D)
O = SampleOff(Ox:O, Ot)

TODO()

texture_load

The last four bytes are omitted if L=0.

1514131211109876543210
LRRt01110001
31302928272625242322212019181716
q2Dq1CtC
47464544434241403938373635343332
q3nTtT
63626160595857565554535251504948
q5StSlodmask
9594939291908988878685848382818079787776757473727170696867666564
OxSxOtq6OTxDxCxRxq4U
95949392919089888786858483828180797877767574737271706968676665646362616059585756555453525150494847464544434241403938373635343332313029282726252423222120191817161514131211109876543210
OxSxOtq6OTxDxCxRxq4Uq5StSlodmaskq3nTtTq2Dq1CtCLRRt01110001
R = SampleReg(Rx:R, Rt)
U = SampleUReg(U)
T = Texture(Tx:T, Tt)
S = Sampler(Sx:S, St)
C = Coords(Cx:C, Ct)
D = Lod(Dx:D)
O = SampleOff(Ox:O, Ot)

TODO()

threadgroup_barrier

1514131211109876543210
????????01101000
1514131211109876543210
????????01101000
TODO()

Operands

ALUDst

ALUDst(value, flags, max_size=32):
  cache_flag = flags & 1
  if flags & 2 and value & 1 and max_size >= 64:
    return Reg64Reference(value >> 1, cache=cache_flag)
  elif flags & 2 and max_size >= 32:
    return Reg32Reference(value >> 1, cache=cache_flag)
  else:
    return Reg16Reference(value, cache=cache_flag)

ALUDst64

ALUDst64(value, flags):
  return ALUDst(value, flags, max_size=64)

FloatDst

FloatDst(value, flags, saturating, max_size=32):
  destination = ALUDst(value, flags, max_size=max_size)
  if destination.thread_bit_size == 32:
    wrapper = RoundToFloat32Wrapper(destination, flush_to_zero=True)
  else:
    wrapper = RoundToFloat16Wrapper(destination, flush_to_zero=False)

  if saturating:
    wrapper = SaturateRealWrapper(wrapper)

  return wrapper

FloatDst16

FloatDst16(value, flags, saturating):
  return FloatDst(value, flags, saturating, max_size=16)

ALUSrc

ALUSrc(value, flags, max_size=32):
  if flags == 0b0000:
    return BroadcastImmediateReference(value)

  if flags >> 2 == 0b01:
    ureg = value | (flags & 1) << 8
    if flags & 0b10:
      if max_size < 32:
        UNDEFINED()
      return BroadcastUReg32Reference(ureg >> 1)
    else:
      return BroadcastUReg16Reference(ureg)

  if flags & 0b11 == 0b00: UNDEFINED()

  cache_flag   = (flags & 0b11) == 0b10
  discard_flag = (flags & 0b11) == 0b11

  if flags >> 2 == 0b11 and max_size >= 64:
    if value & 1: UNDEFINED()
    return Reg64Reference(value >> 1, cache=cache_flag, discard=discard_flag)

  if flags >> 2 >= 0b10 and max_size >= 32:
    if flags >> 2 != 0b10: UNDEFINED()
    if value & 1: UNDEFINED()
    return Reg32Reference(value >> 1, cache=cache_flag, discard=discard_flag)

  if max_size >= 16:
    if flags >> 2 != 0b00: UNDEFINED()
    return Reg16Reference(value, cache=cache_flag, discard=discard_flag)

MulSrc

MulSrc(value, flags, sx):
  source = ALUSrc(value, flags, max_size=32)
  if sx:
    # Note: 8-bit immediates have already been zero-extended to 16-bit,
    # so do not get sign extended.
    return SignExtendWrapper(source, source.thread_bit_size)
  else:
    return source

AddSrc

AddSrc(value, flags, sx):
  source = ALUSrc(value, flags, max_size=64)
  if sx:
    # Note: 8-bit immediates have already been zero-extended to 16-bit,
    # so do not get sign extended.
    return SignExtendWrapper(source, source.thread_bit_size)
  else:
    return source

CmpselSrc

CmpselSrc(value, flags, destination_flags):
  if flags == 0b100:
    return BroadcastImmediateReference(value)

  if flags >> 1 == 0b11:
    ureg = value | (flags & 1) << 8
    if destination_flags & 2:
      if ureg & 1: UNDEFINED()
      return BroadcastUReg32Reference(ureg >> 1)
    else:
      return BroadcastUReg16Reference(ureg)

  if flags >> 2 == 1: UNDEFINED()
  if flags & 0b11 == 0b00: UNDEFINED()

  cache_flag   = (flags & 0b11) == 0b10
  discard_flag = (flags & 0b11) == 0b11

  if destination_flags & 2:
    if value & 1: UNDEFINED()
    return Reg32Reference(value >> 1, cache=cache_flag, discard=discard_flag)
  else:
    return Reg16Reference(value, cache=cache_flag, discard=discard_flag)

FloatSrc

FloatSrc(value, flags, modifier, max_size=32):
  source = ALUSrc(value, flags, max_size)

  if source.is_immediate:
    float = BroadcastRealReference(decode_float_immediate(source))

  elif source.thread_bit_size == 16:
    float = Float16ToRealWrapper(source, flush_to_zero=False)
  elif source.thread_bit_size == 32:
    float = Float32ToRealWrapper(source, flush_to_zero=True)

  if modifier & 0b01: float = FloatAbsoluteValueWrapper(float)
  if modifier & 0b10: float = FloatNegateWrapper(float)

  return float

FloatSrc16

FloatSrc16(value, flags, modifier):
  return FloatSrcDesc(value, flags, modifier, max_size=16)

Reg32

Reg32(value):
  return Reg32Reference(value)

ICondition

ICondition(value, n=0):
  sign_extend   = (value & 0b100) != 0
  condition     =  value & 0b011
  invert_result = (n != 0)

  if condition == 0b00:
    return IntEqualityComparison(sign_extend, invert_result)
  if condition == 0b01:
    return IntLessThanComparison(sign_extend, invert_result)
  if condition == 0b10:
    return IntGreaterThanComparison(sign_extend, invert_result)

FCondition

FCondition(condition, n=0):
  invert_result = (n != 0)

  if condition == 0b000:
    return FloatEqualityComparison(invert_result)
  if condition == 0b001:
    return FloatLessThanComparison(invert_result)
  if condition == 0b010:
    return FloatGreaterThanComparison(invert_result)
  if condition == 0b011:
    return FloatLessThanNanLosesComparison(invert_result)
  if condition == 0b101:
    return FloatLessThanOrEqualComparison(invert_result)
  if condition == 0b110:
    return FloatGreaterThanOrEqualComparison(invert_result)
  if condition == 0b111:
    return FloatGreaterThanNanLosesComparison(invert_result)

MemoryIndex

MemoryIndex(value, flags):
  if flags != 0:
    return BroadcastImmediateReference(sign_extend(value, 16))
  else:
    if value & 1: UNDEFINED()
    if value >= 0x100: UNDEFINED()
    return Reg32Reference(value >> 1)

MemoryBase

MemoryBase(value, flags):
  if value & 1: UNDEFINED()
  if flags != 0:
    return UReg64Reference(value >> 1)
  else:
    return Reg64Reference(value >> 1)

Helper Pseudocode

decode_float_immediate(value):
  sign = (value & 0x80) >> 7
  exponent = (value & 0x70) >> 4
  fraction = value & 0xF

  if exponent == 0:
    result = fraction / 64.0
  else:
    fraction = 16.0 + fraction
    exponent -= 7
    result = fraction * (2.0 ** exponent)

  if sign != 0:
    result = -result

  return result