This document attempts to describe the Apple G13 GPU architecture, as used in the M1 SoC. This is based on reverse engineering and is likely to have mistakes. The Metal Shading Language is typically used to program these GPUs, and this document uses Metal terminology. For example a CPU SIMD-lane is a Metal thread, and a CPU thread is a Metal SIMD-group.

The G13 architecture has 32 threads per SIMD-group. Each SIMD-group has a stack pointer (sp), a program counter (pc), a 32-bit execution mask (exec_mask), and up to 128 general purpose registers.

General purpose registers each store one 32-bit value per thread. Each register can be accessed as a 32-bit register, named r0 to r127, the low 16-bits of the register r0l to r127l, or the high 16-bits of the register r0h to r127h. Some instructions may also use pairs of contiguous 32-bit registers to operate on 64-bit values, and memory operations may use up to four contiguous 16 or 32-bit registers (in both cases, encoded as the first register).

A number of physical registers are allocated to each SIMD-group, and registers from r0 to r(N-1) may be used, but accesses to higher register numbers are not valid, and may read or corrupt data from other SIMD-groups. Using fewer registers (e.g. by using 16-bit types instead of 32-bit types) allows more SIMD-groups to fit in the physical register file (higher occupancy), which improves performance.

Certain instructions are hardcoded to use early registers. r0l tracks the execution mask stack, and r1 is used as the link register.

Shared state is less well understood, but includes 256 32-bit uniform registers named u0 to u255 and similarly accessible as their 16-bit halves, u0l to u255h. These are used for values that are the same across all threads, such as threads_per_grid, or addresses of buffers.

Conditional Execution

Each thread within a SIMD-group may be deactivated, meaning the values in registers for that thread will keep their current value. Whether or not each thread is active is tracked in a 32-bit execution mask (in this document, by convention, a one bit indicates the thread is active and a zero bit indicates it is not).

r0l tracks the execution mask stack. When used with flow-control instructions, the value in r0l indicates how many 'pop' operations will be needed to re-enable an inactive thread, or zero if the thread is active.

The execution mask stack instructions (pop_exec, if_*cmp, else_*cmp and while_*cmp) are typically used to manage r0l. They also update exec_mask based on the value in r0l, and are the only known way to manipulate exec_mask. However, r0l may be manipulated with other instructions. It should be initialised to zero prior to using execution mask stack instructions, and break statements may be implemented by conditionally moving a value (corresponding to the number of stack-levels to break out of) to r0l and using a pop_exec 0 (which deactivates any threads that now have non-zero values in r0l).

The jmp_exec_none instruction may be used to jump over an if statement body if no threads are active, and the jmp_exec_any may be used to jump back to the start of the loop only if one or more threads are still executing the loop.

Execution masking generally prevents reading or writing values from inactive threads, however SIMD shuffle instructions can read from inactive threads in some cases, which can be valuable for research or debugging purposes (e.g. it allows observing non-zero values in r0l).

Register Cache

The GPU has a register cache, which keeps the contents of recently used general purpose registers more quickly accessible. When instructions read or write GPRs, they usually allow hints for the access to be encoded. The cache hint, on a source or destination operand, indicates the value will be used again, and should be cached (meaning other values, where this hint was not used, will be preferred for eviction). The discard hint (on a source operand) invalidates the value in the register cache after all operands have been read, without writing it back to the register file.

While the cache hint should only change performance, the discard hint will make future reads undefined, which could lead to confusing issues. discard should probably not be used within conditional execution, as inactive threads within the SIMD-group may contain data that has not been written back to the register file that would probably also also get discarded. The behaviour of this hint when execution is partially or completely masked has not been tested.

Either hint may be used multiple times even if the same operand appears twice, and discard on a source register can be used with cache on the same destination register.

Instructions

Instructions vary in length in multiples of two bytes up to twelve bytes (so far). Some instructions have a long and short encoding. This is indicated by a bit L, which, if zero, indicates the last two (or four) bytes of the instruction are omitted, and any bits within those bytes should be read as zero. So far only 12-byte instructions omit the last four bytes, and all others omit the last two.

The encodings are described in little-endian, meaning bytes go right-to-left (and top-to-bottom), but bits may be read in the usual numerical order.

Behaviour is mostly described in a Python-like pseudocode for now. In operand descriptions, the : operator describes bit concatenation, with the most-significant fields first. Elsewhere, values are considered to be Python-style arbitrary-precision integers, and floating-point values are considered to be arbitrary-precision floats (although double-precision with round-to-odd may be an adequate approximation).

Wrap bit diagrams
Right-align bit diagrams

Move Instructions

mov (Move 16-bit Immediate)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						0	?	1	1	0	0	0	1	0
							Dt

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
imm16

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		?	?	?	?	?	?	?	?	?	?	?	?

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		?	?	?	?	?	?	?	?	?	?	?	?	imm16																L	D						0	?	1	1	0	0	0	1	0
																																							Dt

D = ALUDst(Dx:D, Dt)

D.broadcast_to_active(imm16)

mov (Move 32-bit Immediate)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						1	?	1	1	0	0	0	1	0
							Dt

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
imm32

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
?	?	Dx		?	?	?	?	?	?	?	?	?	?	?	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		?	?	?	?	?	?	?	?	?	?	?	?	imm32																																L	D						1	?	1	1	0	0	0	1	0
																																																							Dt

D = ALUDst(Dx:D, Dt)

D.broadcast_to_active(imm32)

get_sr (Move From Special Register)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		1	1	1	0	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	?	Dx		SRx		?	?	?	?	SR

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		SRx		?	?	?	?	SR						0	D						Dt		1	1	1	0	0	1	0

D = ALUDst(Dx:D, Dt)
SR = SReg32(SRx:SR)

for each active thread:
  D[thread] = SR.read(thread)

Integer Arithmetic Instructions

iadd (Integer Add or Subtract)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		S	0	0	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
s1	Bs	Bt				B						N	As	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	?	?	?	?	?	?	?	?	s2		?	?	?	?	?	?	Dx		Ax		Bx

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	?	?	?	?	?	?	?	?	s2		?	?	?	?	?	?	Dx		Ax		Bx		s1	Bs	Bt				B						N	As	At				A						0	D						Dt		S	0	0	1	1	1	0

D = ALUDst64(Dx:D, Dt)
A = AddSrc(Ax:A, At, As)
B = AddSrc(Bx:B, Bt, Bs)
shift = s2:s1

for each active thread:
  a = A[thread]
  b = B[thread]

  saturating = (S == 1 and shift == 0 and A.thread_bit_size <= 32 and
                B.thread_bit_size <= 32 and D.thread_bit_size <= 32)

  if N == 1:
    b = -b

  if shift < 5:
    b <<= shift
  else:
    b = 0

  result = a + b

  if saturating:
    signed = (As == 1 or Bs == 1)
    result = saturate_integer(result, D.thread_bit_size, signed)

  D[thread] = result

imadd (Integer Multiply-Add or Subtract)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		S	0	1	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
s1	Bs	Bt				B						N	As	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx		Cx		s2		?	Cs	Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		Cx		s2		?	Cs	Ct				C						s1	Bs	Bt				B						N	As	At				A						0	D						Dt		S	0	1	1	1	1	0

D = ALUDst64(Dx:D, Dt)
A = MulSrc(Ax:A, At, As)
B = MulSrc(Bx:B, Bt, Bs)
C = AddSrc(Cx:C, Ct, Cs)
shift = s2:s1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  saturating = (S == 1 and shift == 0 and C.thread_bit_size <= 32 and
                D.thread_bit_size <= 32)

  if N == 1:
    c = -c

  if shift < 5:
    c <<= shift
  else:
    c = 0

  result = a * b + c

  if saturating:
    signed = (As == 1 or Bs == 1 or Cs == 1)
    result = saturate_integer(result, D.thread_bit_size, signed)

  D[thread] = result

convert

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		0	1	1	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	srct				src						round		0	0	0	0	mode

47	46	45	44	43	42	41	40
?	?	Dx		0	0	srcx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		0	0	srcx		0	0	srct				src						round		0	0	0	0	mode						1	D						Dt		0	1	1	1	1	1	0

D = ALUDst(Dx:D, Dt)
src = ALUSrc(srcx:src, srct)

TODO()

Shift/Bitfield Instructions

bfi (Bitfield Insert/Shift Left)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		0	1	0	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
m1		Bt				B						0	0	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C						m1		Bt				B						0	0	At				A						0	D						Dt		0	1	0	1	1	1	0

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  shift_amount = (c & 0x7F)

  if m == 0:
    mask = 0xFFFFFFFF
  else:
    mask = (1 << m) - 1

  result = (a & ~(mask << shift_amount)) | ((b & mask) << shift_amount)

  D[thread] = result

bfeil (Bitfield Extract and Insert Low/Shift Right)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		0	1	0	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
m1		Bt				B						0	0	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C						m1		Bt				B						0	0	At				A						1	D						Dt		0	1	0	1	1	1	0

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  shift_amount = (c & 0x7F)

  if m == 0:
    mask = 0xFFFFFFFF
  else:
    mask = (1 << m) - 1

  result = (a & ~mask) | ((b >> shift_amount) & mask)

  D[thread] = result

extr (Extract From Register Pair)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		0	1	0	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
m1		Bt				B						0	1	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C						m1		Bt				B						0	1	At				A						0	D						Dt		0	1	0	1	1	1	0

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  shift_amount = (c & 0x7F)

  if m == 0:
    mask = 0xFFFFFFFF
  else:
    mask = (1 << m) - 1

  result = (((b << 32) | a) >> shift_amount) & mask

  D[thread] = result

shlhi (Shift Left High and Insert)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		0	1	0	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
m1		Bt				B						1	0	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C						m1		Bt				B						1	0	At				A						0	D						Dt		0	1	0	1	1	1	0

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  shift_amount = (c & 0x7F)

  if m == 0:
    mask = 0xFFFFFFFF
  else:
    mask = (1 << m) - 1

  shifted_mask = mask << max(shift_amount-32, 0)
  result = (((b << shift_amount) >> 32) & shifted_mask) | (a & ~shifted_mask)

  D[thread] = result

shrhi (Shift Right High and Insert)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		0	1	0	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
m1		Bt				B						1	0	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C						m1		Bt				B						1	0	At				A						1	D						Dt		0	1	0	1	1	1	0

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
C = ALUSrc(Cx:C, Ct)
m = m3:m2:m1

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  shift_amount = (c & 0x7F)

  if m == 0:
    mask = 0xFFFFFFFF
  else:
    mask = (1 << m) - 1

  shifted_mask = (mask << 32) >> min(shift_amount, 32)
  result = (((b << 32) >> shift_amount) & shifted_mask) | (a & ~shifted_mask)

  D[thread] = result

asr (Arithmetic Shift Right)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		0	1	0	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	?	Bt				B						0	1	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx		?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	Bt				B						0	1	At				A						1	D						Dt		0	1	0	1	1	1	0

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each active thread:
  a = A[thread]
  b = B[thread]

  shift_amount = (b & 0x7F)

  result = sign_extend(a, A.thread_bit_size) >> shift_amount

  D[thread] = result

asrh (Arithmetic Shift Right High)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		0	1	0	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	?	Bt				B						1	1	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx		?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	Bt				B						1	1	At				A						1	D						Dt		0	1	0	1	1	1	0

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each active thread:
  a = A[thread]
  b = B[thread]

  shift_amount = (b & 0x7F)

  result = (sign_extend(a, A.thread_bit_size) << 32) >> shift_amount

  D[thread] = result

Bit Manipulation Instructions

bitop (Bitwise Operation)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		1	1	1	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
tt3	tt2	Bt				B						tt1	tt0	At				A

47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		tt3	tt2	Bt				B						tt1	tt0	At				A						0	D						Dt		1	1	1	1	1	1	0

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each active thread:
  a = A[thread]
  b = B[thread]

  if tt0 == tt1 and tt2 == tt3 and tt0 != tt2:
    UNDEFINED()
    result = a
  else:
    result = 0
    if tt0: result |= ~a & ~b
    if tt1: result |=  a & ~b
    if tt2: result |= ~a &  b
    if tt3: result |=  a &  b

  D[thread] = result

bitrev (Reverse Bits)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		0	1	1	1	1	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	0	0	0	1	At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		?	?	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		?	?	0	0	0	0	0	0	0	0	0	0	0	0	0	1	At				A						0	D						Dt		0	1	1	1	1	1	0

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)

for each active thread:
  a = A[thread]

  result = 0

  i = 0
  while i < 32:
    if a & (1 << i):
      result |= (1 << (31-i))

  D[thread] = result

popcount (Population Count)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		0	1	1	1	1	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	0	0	1	0	At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		?	?	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		?	?	0	0	0	0	0	0	0	0	0	0	0	0	1	0	At				A						0	D						Dt		0	1	1	1	1	1	0

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)

for each active thread:
  a = A[thread]

  result = 0

  i = 0
  while i < 32:
    if a & (1 << i):
      result += 1

  D[thread] = result

ffs (Find First Set)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		0	1	1	1	1	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	0	0	1	1	At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		?	?	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		?	?	0	0	0	0	0	0	0	0	0	0	0	0	1	1	At				A						0	D						Dt		0	1	1	1	1	1	0

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)

for each active thread:
  a = A[thread]

  result = -1

  i = 31
  while i >= 0:
    if a & (1 << i):
      result = i
      break
    i -= 1

  D[thread] = result

Floating-Point Arithmetic

fmadd (Floating-Point Fused Multiply-Add)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	1	1	1	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
Bm		Bt				B						Am		At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx		Cx		?	?	Cm		Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		Cx		?	?	Cm		Ct				C						Bm		Bt				B						Am		At				A						L	D						Dt		S	1	1	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
C = FloatSrc(Cx:C, Ct, Cm)

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  result = fused_multiply_add(a, b, c)

  D[thread] = result

fmadd16 (Half Precision Floating-Point Fused Multiply-Add)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	1	1	0	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	Bm		Bt			B						?	Am		At			A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx		Cx		?	?	?	Cm		Ct			C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		Cx		?	?	?	Cm		Ct			C						?	Bm		Bt			B						?	Am		At			A						L	D						Dt		S	1	1	0	1	1	0

D = FloatDst16(Dx:D, Dt, S)
A = FloatSrc16(Ax:A, At, Am)
B = FloatSrc16(Bx:B, Bt, Bm)
C = FloatSrc16(Cx:C, Ct, Cm)

for each active thread:
  a = A[thread]
  b = B[thread]
  c = C[thread]

  result = fused_multiply_add(a, b, c)

  D[thread] = result

fadd (Floating-Point Add)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		S	1	0	1	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
Bm		Bt				B						Am		At				A

47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		Bm		Bt				B						Am		At				A						1	D						Dt		S	1	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

for each active thread:
  a = A[thread]
  b = B[thread]

  result = fused_multiply_add(a, 1.0, b)

  D[thread] = result

fadd16 (Half Precision Floating-Point Add)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		S	1	0	0	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	Bm		Bt			B						?	Am		At			A

47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		?	Bm		Bt			B						?	Am		At			A						1	D						Dt		S	1	0	0	1	1	0

D = FloatDst16(Dx:D, Dt, S)
A = FloatSrc16(Ax:A, At, Am)
B = FloatSrc16(Bx:B, Bt, Bm)

for each active thread:
  a = A[thread]
  b = B[thread]

  result = fused_multiply_add(a, 1.0, b)

  D[thread] = result

fmul (Floating-Point Multiply)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		S	0	1	1	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
Bm		Bt				B						Am		At				A

47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		Bm		Bt				B						Am		At				A						1	D						Dt		S	0	1	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

for each active thread:
  a = A[thread]
  b = B[thread]

  result = fused_multiply_add(a, b, 0.0)

  D[thread] = result

fmul16 (Half Precision Floating-Point Multiply)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		S	0	1	0	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	Bm		Bt			B						?	Am		At			A

47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		?	Bm		Bt			B						?	Am		At			A						1	D						Dt		S	0	1	0	1	1	0

D = FloatDst16(Dx:D, Dt, S)
A = FloatSrc16(Ax:A, At, Am)
B = FloatSrc16(Bx:B, Bt, Bm)

for each active thread:
  a = A[thread]
  b = B[thread]

  result = fused_multiply_add(a, b, 0.0)

  D[thread] = result

floor

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	0	0	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0	0	0	0	0	Am		At				A						L	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = floor(A[thread])

ceil

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	0	0	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	1

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	1	0	0	0	0	Am		At				A						1	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = ceil(A[thread])

trunc

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	0	0	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	1	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	1	0	0	0	0	0	Am		At				A						1	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = trunc(A[thread])

rint

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	0	0	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	1	1

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	1	1	0	0	0	0	Am		At				A						1	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = rint(A[thread])

rcp

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
1	0	0	0	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0	1	0	0	0	Am		At				A						L	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = reciprocal(A[thread])

rsqrt

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
1	0	0	1	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0	1	0	0	1	Am		At				A						L	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = rsqrt(A[thread])

rsqrt_special

rsqrt_special can be used to implement fast sqrt as rsqrt_special(x) * x, by handling special-cases differently.

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	0	1	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0	0	0	0	1	Am		At				A						L	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = rsqrt_special(A[thread])

sin_pt_1

sin_pt_1 is used together with sin_pt_2 and supporting ALU to compute the sine function. sin_pt_1 takes an angle around the circle in the interval [0, 4) and produces an intermediate result. This intermediate result is then passed to sin_pt_2, and the two results are multipled to give sin. The argument reduction to [0, 4) can be computed with a few ALU instructions: reduce(x) = 4 fract(x / tau), where tau is the circle constant formerly known as twice pi. Calculating cosine follows from the identity cos(x) = sin(x + tau/4). After multipling by 1/tau, the bias become 1/4 which can be added in the same cycle via a fused multiply-add. Tangent should be lowered to a division of sine and cosine.

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
1	0	1	0	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0	1	0	1	0	Am		At				A						L	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = sin_pt_1(A[thread])

sin_pt_2

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
1	1	1	0	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0	1	1	1	0	Am		At				A						L	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = sin_pt_2(A[thread])

log2

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
1	1	0	0	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0	1	1	0	0	Am		At				A						L	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = log2(A[thread])

exp2

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
1	1	0	1	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0	1	1	0	1	Am		At				A						L	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

for each active thread:
  D[thread] = exp2(A[thread])

dfdx

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	1	0	0	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0	0	1	0	0	Am		At				A						L	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

TODO()

dfdy

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		S	0	0	1	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	1	1	0	Am		At				A

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		0	0	0	0	0	0	0	0	0	0	0	1	1	0	Am		At				A						L	D						Dt		S	0	0	1	0	1	0

D = FloatDst(Dx:D, Dt, S)
A = FloatSrc(Ax:A, At, Am)

TODO()

Flow Control Instructions

ret

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
reg32							?	?	0	0	1	0	1	0	0

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
reg32							?	?	0	0	1	0	1	0	0

reg32 = Reg32(reg32)

TODO()

stop

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0

end_execution()

trap

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0

TODO()

call

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
reg32							?	?	0	0	0	0	1	0	0

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
reg32							?	?	0	0	0	0	1	0	0

reg32 = Reg32(reg32)

TODO()

jmp_incomplete

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	0	0	0	0	0	0	off

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	0	0	0	0	0	0	0	off								0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

TODO()

jmp_exec_any

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
off

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
off																																1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0

if any(exec_mask):
  next_pc = pc + sign_extend(off, 32)

jmp_exec_none

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
off

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
off																																1	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0

if not any(exec_mask):
  next_pc = pc + sign_extend(off, 32)

call

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
off

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
off																																1	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0

next_pc = pc + sign_extend(off, 32)

for each active thread:
  r1 = pc + 6

Execution Mask Stack Instructions

pop_exec

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	0	0	n		1	1	?	Dt	1	0	1	0	0	1	0

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	n		1	1	?	Dt	1	0	1	0	0	1	0

D = ImplicitR0L(Dt)

for each thread:
  v = D[thread]
  v -= n
  if v < 0:
    v = 0
  D[thread] = v
  exec_mask[thread] = (v == 0)

if_icmp

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
cc			n		0	0	ccn	Dt	1	0	1	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	Bt				B						0	0	At				A

47	46	45	44	43	42	41	40
?	?	0	0	Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	0	0	Ax		Bx		0	0	Bt				B						0	0	At				A						cc			n		0	0	ccn	Dt	1	0	1	0	0	1	0

D = ImplicitR0L(Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each thread:
  v = D[thread]
  if v != 0:
    v += n
  elif not cc.compare(A[thread], B[thread]):
    v = 1
  D[thread] = v
  exec_mask[thread] = (v == 0)

if_fcmp

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
cc			n		0	0	ccn	Dt	1	0	0	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
Bm		Bt				B						Am		At				A

47	46	45	44	43	42	41	40
?	?	0	0	Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	0	0	Ax		Bx		Bm		Bt				B						Am		At				A						cc			n		0	0	ccn	Dt	1	0	0	0	0	1	0

D = ImplicitR0L(Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

for each thread:
  v = D[thread]
  if v != 0:
    v += n
  elif not cc.compare(A[thread], B[thread]):
    v = 1
  D[thread] = v
  exec_mask[thread] = (v == 0)

while_icmp

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
cc			n		1	0	ccn	Dt	1	0	1	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	Bt				B						0	0	At				A

47	46	45	44	43	42	41	40
?	?	0	0	Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	0	0	Ax		Bx		0	0	Bt				B						0	0	At				A						cc			n		1	0	ccn	Dt	1	0	1	0	0	1	0

D = ImplicitR0L(Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each thread:
  v = D[thread]
  if v < n:
    if cc.compare(A[thread], B[thread]):
      v = 0
    else:
      v = n
  D[thread] = v
  exec_mask[thread] = (v == 0)

while_fcmp

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
cc			n		1	0	ccn	Dt	1	0	0	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
Bm		Bt				B						Am		At				A

47	46	45	44	43	42	41	40
?	?	0	0	Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	0	0	Ax		Bx		Bm		Bt				B						Am		At				A						cc			n		1	0	ccn	Dt	1	0	0	0	0	1	0

D = ImplicitR0L(Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

for each thread:
  v = D[thread]
  if v < n:
    if cc.compare(A[thread], B[thread]):
      v = 0
    else:
      v = n
  D[thread] = v
  exec_mask[thread] = (v == 0)

else_icmp

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
cc			n		0	1	ccn	Dt	1	0	1	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	Bt				B						0	0	At				A

47	46	45	44	43	42	41	40
?	?	0	0	Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	0	0	Ax		Bx		0	0	Bt				B						0	0	At				A						cc			n		0	1	ccn	Dt	1	0	1	0	0	1	0

D = ImplicitR0L(Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

for each thread:
  v = D[thread]
  if v == 0:
    v = n
  elif v == 1:
    if cc.compare(A[thread], B[thread]):
      v = 0
    else:
      v = 1
  D[thread] = v
  exec_mask[thread] = (v == 0)

else_fcmp

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
cc			n		0	1	ccn	Dt	1	0	0	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
Bm		Bt				B						Am		At				A

47	46	45	44	43	42	41	40
?	?	0	0	Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	0	0	Ax		Bx		Bm		Bt				B						Am		At				A						cc			n		0	1	ccn	Dt	1	0	0	0	0	1	0

D = ImplicitR0L(Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

for each thread:
  v = D[thread]
  if v == 0:
    v = n
  elif v == 1:
    if cc.compare(A[thread], B[thread]):
      v = 0
    else:
      v = 1
  D[thread] = v
  exec_mask[thread] = (v == 0)

Select Instructions

icmpsel (Integer Compare and Select)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		0	0	1	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	?	Bt				B						?	?	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
cc			Yt			Y						?	?	?	Xt			X

79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64
?	?	Dx		Ax		Bx		Xx		Yx		?	?	?	?

79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		Xx		Yx		?	?	?	?	cc			Yt			Y						?	?	?	Xt			X						?	?	Bt				B						?	?	At				A						L	D						Dt		0	0	1	0	0	1	0

cc = ICondition(cc)
D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)
X = CmpselSrc(Xx:X, Xt, Dt)
Y = CmpselSrc(Yx:Y, Yt, Dt)

for each active thread:
  if cc.compare(A[thread], B[thread]):
    D[thread] = X[thread]
  else:
    D[thread] = Y[thread]

fcmpsel (Floating-Point Compare and Select)

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		0	0	0	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
Bm		Bt				B						Am		At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
cc			Yt			Y						?	?	?	Xt			X

79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64
?	?	Dx		Ax		Bx		Xx		Yx		?	?	?	?

79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		Xx		Yx		?	?	?	?	cc			Yt			Y						?	?	?	Xt			X						Bm		Bt				B						Am		At				A						L	D						Dt		0	0	0	0	0	1	0

cc = FCondition(cc)
D = ALUDst(Dx:D, Dt)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)
X = CmpselSrc(Xx:X, Xt, Dt)
Y = CmpselSrc(Yx:Y, Yt, Dt)

for each active thread:
  if cc.compare(A[thread], B[thread]):
    D[thread] = X[thread]
  else:
    D[thread] = Y[thread]

SIMD Group and Quad Group Instructions

icmp_ballot

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	D						Dt		0	1	1	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	Bt				B						0	0	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
cc			0	0	0	0	0	0	0	0	0	0	0	0	1	ccn	?	Dx		Ax		Bx

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
cc			0	0	0	0	0	0	0	0	0	0	0	0	1	ccn	?	Dx		Ax		Bx		0	0	Bt				B						0	0	At				A						?	D						Dt		0	1	1	0	0	1	0

D = ALUDst(Dx:D, Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

result = 0

for each active thread:
  a = A[thread]
  b = B[thread]

  if cc.compare(a, b):
    result |= 1 << thread

D.broadcast_to_active(result)

icmp_quad_ballot

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	D						Dt		0	1	1	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	Bt				B						0	0	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
cc			0	0	0	0	0	0	0	0	0	0	0	0	0	ccn	?	Dx		Ax		Bx

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
cc			0	0	0	0	0	0	0	0	0	0	0	0	0	ccn	?	Dx		Ax		Bx		0	0	Bt				B						0	0	At				A						?	D						Dt		0	1	1	0	0	1	0

D = ALUDst(Dx:D, Dt)
cc = ICondition(cc, ccn)
A = ALUSrc(Ax:A, At)
B = ALUSrc(Bx:B, Bt)

TODO()

fcmp_ballot

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	D						Dt		0	1	0	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
Bm		Bt				B						Am		At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
cc			0	0	0	0	0	0	0	0	0	0	0	0	1	ccn	?	Dx		Ax		Bx

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
cc			0	0	0	0	0	0	0	0	0	0	0	0	1	ccn	?	Dx		Ax		Bx		Bm		Bt				B						Am		At				A						?	D						Dt		0	1	0	0	0	1	0

D = ALUDst(Dx:D, Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

result = 0

for each active thread:
  a = A[thread]
  b = B[thread]

  if cc.compare(a, b):
    result |= 1 << thread

D.broadcast_to_active(result)

fcmp_quad_ballot

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	D						Dt		0	1	0	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
Bm		Bt				B						Am		At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
cc			0	0	0	0	0	0	0	0	0	0	0	0	0	ccn	?	Dx		Ax		Bx

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
cc			0	0	0	0	0	0	0	0	0	0	0	0	0	ccn	?	Dx		Ax		Bx		Bm		Bt				B						Am		At				A						?	D						Dt		0	1	0	0	0	1	0

D = ALUDst(Dx:D, Dt)
cc = FCondition(cc, ccn)
A = FloatSrc(Ax:A, At, Am)
B = FloatSrc(Bx:B, Bt, Bm)

TODO()

simd_shuffle

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		1	1	0	1	1	1	1

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
0	0	Bt				B						0	1	At				A

47	46	45	44	43	42	41	40
0	?	Dx		Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	?	Dx		Ax		Bx		0	0	Bt				B						0	1	At				A						0	D						Dt		1	1	0	1	1	1	1

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc16(Bx:B, Bt)

quad_values = []

for each quad:
  quad_index = 0

  for each thread in quad:
    # NOTE: this is not execution masked, meaning any inactive thread can make
    # simd_broadcast from the whole quad undefined (although it works fine if
    # B is an immediate)

    quad_index |= B[thread] & 3

  quad_values.append(A[quad.start + quad_index])

for each active thread:
  b = B[thread]

  if b < 32:
    result = quad_values[index >> 2]

    D[thread] = result

simd_shuffle_down

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	D						Dt		1	1	0	1	1	1	1

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
1	1	Bt				B						0	1	At				A

47	46	45	44	43	42	41	40
0	?	Dx		Ax		Bx

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	?	Dx		Ax		Bx		1	1	Bt				B						0	1	At				A						0	D						Dt		1	1	0	1	1	1	1

D = ALUDst(Dx:D, Dt)
A = ALUSrc(Ax:A, At)
B = ALUSrc16(Bx:B, Bt)

TODO()

Memory and Stack Instructions

wait

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	?	?	?	?	?	i	0	0	1	1	1	0	0	0

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	?	?	?	?	?	i	0	0	1	1	1	0	0	0

wait_for_loads()

ld/st_tile

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	D						Dt		load	0	0	1	0	0	1

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	?	?	?	F				?	?	?	?	?	?	?	?

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	?	?	?	?	?	?	mask				u0	rt

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
?	?	Dx		?	?	?	?	?	?	?	?	?	?	?	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	mask				u0	rt			?	?	?	?	F				?	?	?	?	?	?	?	?	?	D						Dt		load	0	0	1	0	0	1

D = ALUDst(Dx:D, Dt)

TODO()

ld_var

The last four bytes are omitted if L=0.

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	D						Dt		perspective	1	0	0	0	0	1

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
mask				?	?	?	?	?	?	?	?	index

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
?	?	Dx		?	?	?	?	?	?	?	?	?	?	?	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	?	mask				?	?	?	?	?	?	?	?	index				L	D						Dt		perspective	1	0	0	0	0	1

D = ALUDst(Dx:D, Dt)

TODO()

uniform_store

uniform_store is used to initialise uniform registers. R is stored to offset O, which is typically an index in 16-bit units into the uniform registers. This is encoded like (and possibly is) a store to device memory, and can move one 16-bit register to initialise a 16-bit uniform, or two consecutive 16-bit registers to initialise a 32-bit uniform.

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
R						0	F		1	0	0	0	1	0	1

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	?	1	1	1	unk		Ot	Ol				0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
L	b			s		Rx		0	0	0	0	Oh

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
Ox								mask				0	0	Rt	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
Ox								mask				0	0	Rt	?	L	b			s		Rx		0	0	0	0	Oh				?	?	1	1	1	unk		Ot	Ol				0	0	0	0	R						0	F		1	0	0	0	1	0	1

R = MemoryReg(Rx:R, Rt)
O = MemoryIndex(Ox:Oh:Ol, Ot)

TODO()

device_load

device_load initiates a load from device memory, the result of which may be used after a wait. The data can be unpacked from a variety of formats, or passed through as-is. On each thread, up to four aligned values, each up to 32-bits, can be read from a base address plus an offset (shifted left by the alignment, with an optional additional left shift of up to two).

The number of values to read is described by a mask, such that 0b0001 indicates one value, or 0b1111 loads four values. Non-contiguous masks skip values in memory, but still write the result to contiguous registers.

Non-packed formats (8, 16, and 32-bit values) are zero extended. All packed values are unpacked to 16-bit or 32-bit floating-point values, depending on the size of the register. Bit-packed formats (rgb10a2, rg11b10f and rgb9e5) are supported, but ignore the optional shift and the mask. They always read an aligned 32-bit value, and write to the same number of registers. However simple packed values (unorm8, snorm8, unorm16, snorm16 and srgba8) do not have this limitation.

Unaligned addresses are rounded-down to the required alignment. The base address (A) is a 64-bit value from either uniform or general-purpose registers. The offset (O) may be a signed 16-bit immediate, or a signed or unsigned 32-bit general-purpose register.

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
R						F			0	0	0	0	1	0	1

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	u2	?	?	At	?	Ou	Ot	Ol				Al

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
L	?	?	?	s		Rx		Ah				Oh

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
Ox								mask				?	?	Rt	Fx

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
Ox								mask				?	?	Rt	Fx	L	?	?	?	s		Rx		Ah				Oh				?	u2	?	?	At	?	Ou	Ot	Ol				Al				R						F			0	0	0	0	1	0	1

R = MemoryReg(Rx:R, Rt)
A = MemoryBase(Ah:Al, At)
O = MemoryIndex(Ox:Oh:Ol, Ot)

TODO()

device_store

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
R						F			1	0	0	0	1	0	1

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	u2	?	?	At	?	Ou	Ot	Ol				Al

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
L	?	?	?	s		Rx		Ah				Oh

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
Ox								mask				?	?	Rt	Fx

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
Ox								mask				?	?	Rt	Fx	L	?	?	?	s		Rx		Ah				Oh				?	u2	?	?	At	?	Ou	Ot	Ol				Al				R						F			1	0	0	0	1	0	1

R = MemoryReg(Rx:R, Rt)
A = MemoryBase(Ah:Al, At)
O = MemoryIndex(Ox:Oh:Ol, Ot)

TODO()

stack_store

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
R						F		1	0	1	1	0	1	0	1

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	i6	?	?	?	i1	?	Ot	Ol				0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
L	i5			?	?	Rx		?	i2			Oh

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
Ox								mask				Fx		Rt	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
Ox								mask				Fx		Rt	?	L	i5			?	?	Rx		?	i2			Oh				?	i6	?	?	?	i1	?	Ot	Ol				0	0	0	0	R						F		1	0	1	1	0	1	0	1

R = MemoryReg(Rx:R, Rt)
O = MemoryIndex(Ox:Oh:Ol, Ot)

TODO()

stack_load

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
R						F		0	0	1	1	0	1	0	1

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	i6	?	?	?	i1	?	Ot	Ol				0	0	0	0

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
L	i5			?	?	Rx		?	i2			Oh

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
Ox								mask				Fx		Rt	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
Ox								mask				Fx		Rt	?	L	i5			?	?	Rx		?	i2			Oh				?	i6	?	?	?	i1	?	Ot	Ol				0	0	0	0	R						F		0	0	1	1	0	1	0	1

R = MemoryReg(Rx:R, Rt)
O = MemoryIndex(Ox:Oh:Ol, Ot)

TODO()

stack_get_ptr

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
R						i0		0	0	1	1	0	1	0	1

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	?	?	?	?	i1	?	?	?	?	?	?	0	0	0	1

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
1	i3			?	?	Rx		?	i2			?	?	?	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
?	?	?	?	?	?	?	?	i4						1	0

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	?	?	?	?	?	?	i4						1	0	1	i3			?	?	Rx		?	i2			?	?	?	?	?	?	?	?	?	i1	?	?	?	?	?	?	0	0	0	1	R						i0		0	0	1	1	0	1	0	1

R = StackReg32(Rx:R)

TODO()

stack_adjust

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	?	?	?	?	i0		1	0	1	1	0	1	0	1

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
?	?	?	?	?	i1	0	1	v1				0	0	0	1

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
L	i3			?	?	?	?	?	i2			v2

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
v3								i4						?	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
v3								i4						?	?	L	i3			?	?	?	?	?	i2			v2				?	?	?	?	?	i1	0	1	v1				0	0	0	1	?	?	?	?	?	?	i0		1	0	1	1	0	1	0	1

v = v3:v2:v1

TODO()

threadgroup_load

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	R						Rt	?	1	1	?	1	0	0	1

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
mask				?	Ot	O						F				At		A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	Rx		Ax		Ox										?	?	?	?	?	?	?	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Rx		Ax		Ox										?	?	?	?	?	?	?	?	mask				?	Ot	O						F				At		A						L	R						Rt	?	1	1	?	1	0	0	1

R = ThreadgroupMemoryReg(Rx:R, Rt)
A = ThreadgroupMemoryBase(Ax:A, At)
O = ThreadgroupIndex(Ox:O, Ot)

TODO()

threadgroup_store

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	R						Rt	?	0	1	?	1	0	0	1

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
mask				?	Ot	O						F				At		A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	Rx		Ax		Ox										?	?	?	?	?	?	?	?

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Rx		Ax		Ox										?	?	?	?	?	?	?	?	mask				?	Ot	O						F				At		A						L	R						Rt	?	0	1	?	1	0	0	1

R = ThreadgroupMemoryReg(Rx:R, Rt)
A = ThreadgroupMemoryBase(Ax:A, At)
O = ThreadgroupIndex(Ox:O, Ot)

TODO()

texture_sample

The last four bytes are omitted if L=0.

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	R						Rt	0	0	1	1	0	0	0	1

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
q2		D						q1	Ct	C

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
q3					n			Tt		T

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
q5	St	S						lod				mask

95	94	93	92	91	90	89	88	87	86	85	84	83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64
Ox		Sx		Ot	q6					O						Tx		Dx		Cx		Rx		q4			U

95	94	93	92	91	90	89	88	87	86	85	84	83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
Ox		Sx		Ot	q6					O						Tx		Dx		Cx		Rx		q4			U					q5	St	S						lod				mask				q3					n			Tt		T						q2		D						q1	Ct	C						L	R						Rt	0	0	1	1	0	0	0	1

R = SampleReg(Rx:R, Rt)
U = SampleUReg(U)
T = Texture(Tx:T, Tt)
S = Sampler(Sx:S, St)
C = Coords(Cx:C, Ct)
D = Lod(Dx:D)
O = SampleOff(Ox:O, Ot)

TODO()

texture_load

The last four bytes are omitted if L=0.

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
L	R						Rt	0	1	1	1	0	0	0	1

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
q2		D						q1	Ct	C

47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
q3					n			Tt		T

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48
q5	St	S						lod				mask

95	94	93	92	91	90	89	88	87	86	85	84	83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64
Ox		Sx		Ot	q6					O						Tx		Dx		Cx		Rx		q4			U

95	94	93	92	91	90	89	88	87	86	85	84	83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
Ox		Sx		Ot	q6					O						Tx		Dx		Cx		Rx		q4			U					q5	St	S						lod				mask				q3					n			Tt		T						q2		D						q1	Ct	C						L	R						Rt	0	1	1	1	0	0	0	1

R = SampleReg(Rx:R, Rt)
U = SampleUReg(U)
T = Texture(Tx:T, Tt)
S = Sampler(Sx:S, St)
C = Coords(Cx:C, Ct)
D = Lod(Dx:D)
O = SampleOff(Ox:O, Ot)

TODO()

threadgroup_barrier

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	?	?	?	?	?	?	0	1	1	0	1	0	0	0

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	?	?	?	?	?	?	0	1	1	0	1	0	0	0

TODO()

Operands

ALUDst

ALUDst(value, flags, max_size=32):
  cache_flag = flags & 1
  if flags & 2 and value & 1 and max_size >= 64:
    return Reg64Reference(value >> 1, cache=cache_flag)
  elif flags & 2 and max_size >= 32:
    return Reg32Reference(value >> 1, cache=cache_flag)
  else:
    return Reg16Reference(value, cache=cache_flag)

ALUDst64

ALUDst64(value, flags):
  return ALUDst(value, flags, max_size=64)

FloatDst

FloatDst(value, flags, saturating, max_size=32):
  destination = ALUDst(value, flags, max_size=max_size)
  if destination.thread_bit_size == 32:
    wrapper = RoundToFloat32Wrapper(destination, flush_to_zero=True)
  else:
    wrapper = RoundToFloat16Wrapper(destination, flush_to_zero=False)

  if saturating:
    wrapper = SaturateRealWrapper(wrapper)

  return wrapper

FloatDst16

FloatDst16(value, flags, saturating):
  return FloatDst(value, flags, saturating, max_size=16)

ALUSrc

ALUSrc(value, flags, max_size=32):
  if flags == 0b0000:
    return BroadcastImmediateReference(value)

  if flags >> 2 == 0b01:
    ureg = value | (flags & 1) << 8
    if flags & 0b10:
      if max_size < 32:
        UNDEFINED()
      return BroadcastUReg32Reference(ureg >> 1)
    else:
      return BroadcastUReg16Reference(ureg)

  if flags & 0b11 == 0b00: UNDEFINED()

  cache_flag   = (flags & 0b11) == 0b10
  discard_flag = (flags & 0b11) == 0b11

  if flags >> 2 == 0b11 and max_size >= 64:
    if value & 1: UNDEFINED()
    return Reg64Reference(value >> 1, cache=cache_flag, discard=discard_flag)

  if flags >> 2 >= 0b10 and max_size >= 32:
    if flags >> 2 != 0b10: UNDEFINED()
    if value & 1: UNDEFINED()
    return Reg32Reference(value >> 1, cache=cache_flag, discard=discard_flag)

  if max_size >= 16:
    if flags >> 2 != 0b00: UNDEFINED()
    return Reg16Reference(value, cache=cache_flag, discard=discard_flag)

MulSrc

MulSrc(value, flags, sx):
  source = ALUSrc(value, flags, max_size=32)
  if sx:
    # Note: 8-bit immediates have already been zero-extended to 16-bit,
    # so do not get sign extended.
    return SignExtendWrapper(source, source.thread_bit_size)
  else:
    return source

AddSrc

AddSrc(value, flags, sx):
  source = ALUSrc(value, flags, max_size=64)
  if sx:
    # Note: 8-bit immediates have already been zero-extended to 16-bit,
    # so do not get sign extended.
    return SignExtendWrapper(source, source.thread_bit_size)
  else:
    return source

CmpselSrc

CmpselSrc(value, flags, destination_flags):
  if flags == 0b100:
    return BroadcastImmediateReference(value)

  if flags >> 1 == 0b11:
    ureg = value | (flags & 1) << 8
    if destination_flags & 2:
      if ureg & 1: UNDEFINED()
      return BroadcastUReg32Reference(ureg >> 1)
    else:
      return BroadcastUReg16Reference(ureg)

  if flags >> 2 == 1: UNDEFINED()
  if flags & 0b11 == 0b00: UNDEFINED()

  cache_flag   = (flags & 0b11) == 0b10
  discard_flag = (flags & 0b11) == 0b11

  if destination_flags & 2:
    if value & 1: UNDEFINED()
    return Reg32Reference(value >> 1, cache=cache_flag, discard=discard_flag)
  else:
    return Reg16Reference(value, cache=cache_flag, discard=discard_flag)

FloatSrc

FloatSrc(value, flags, modifier, max_size=32):
  source = ALUSrc(value, flags, max_size)

  if source.is_immediate:
    float = BroadcastRealReference(decode_float_immediate(source))

  elif source.thread_bit_size == 16:
    float = Float16ToRealWrapper(source, flush_to_zero=False)
  elif source.thread_bit_size == 32:
    float = Float32ToRealWrapper(source, flush_to_zero=True)

  if modifier & 0b01: float = FloatAbsoluteValueWrapper(float)
  if modifier & 0b10: float = FloatNegateWrapper(float)

  return float

FloatSrc16

FloatSrc16(value, flags, modifier):
  return FloatSrcDesc(value, flags, modifier, max_size=16)

Reg32

Reg32(value):
  return Reg32Reference(value)

ICondition

ICondition(value, n=0):
  sign_extend   = (value & 0b100) != 0
  condition     =  value & 0b011
  invert_result = (n != 0)

  if condition == 0b00:
    return IntEqualityComparison(sign_extend, invert_result)
  if condition == 0b01:
    return IntLessThanComparison(sign_extend, invert_result)
  if condition == 0b10:
    return IntGreaterThanComparison(sign_extend, invert_result)

FCondition

FCondition(condition, n=0):
  invert_result = (n != 0)

  if condition == 0b000:
    return FloatEqualityComparison(invert_result)
  if condition == 0b001:
    return FloatLessThanComparison(invert_result)
  if condition == 0b010:
    return FloatGreaterThanComparison(invert_result)
  if condition == 0b011:
    return FloatLessThanNanLosesComparison(invert_result)
  if condition == 0b101:
    return FloatLessThanOrEqualComparison(invert_result)
  if condition == 0b110:
    return FloatGreaterThanOrEqualComparison(invert_result)
  if condition == 0b111:
    return FloatGreaterThanNanLosesComparison(invert_result)

MemoryIndex

MemoryIndex(value, flags):
  if flags != 0:
    return BroadcastImmediateReference(sign_extend(value, 16))
  else:
    if value & 1: UNDEFINED()
    if value >= 0x100: UNDEFINED()
    return Reg32Reference(value >> 1)

MemoryBase

MemoryBase(value, flags):
  if value & 1: UNDEFINED()
  if flags != 0:
    return UReg64Reference(value >> 1)
  else:
    return Reg64Reference(value >> 1)

Helper Pseudocode

decode_float_immediate(value):
  sign = (value & 0x80) >> 7
  exponent = (value & 0x70) >> 4
  fraction = value & 0xF

  if exponent == 0:
    result = fraction / 64.0
  else:
    fraction = 16.0 + fraction
    exponent -= 7
    result = fraction * (2.0 ** exponent)

  if sign != 0:
    result = -result

  return result

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		SRx		?	?	?	?	SR						0	D						Dt		1	1	1	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
s1	Bs	Bt				B						N	As	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	?	?	?	?	?	?	?	?	s2		?	?	?	?	?	?	Dx		Ax		Bx

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	?	?	?	?	?	?	?	?	s2		?	?	?	?	?	?	Dx		Ax		Bx		s1	Bs	Bt				B						N	As	At				A						0	D						Dt		S	0	0	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
s1	Bs	Bt				B						N	As	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx		Cx		s2		?	Cs	Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		Cx		s2		?	Cs	Ct				C						s1	Bs	Bt				B						N	As	At				A						0	D						Dt		S	0	1	1	1	1	0

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C						m1		Bt				B						0	0	At				A						0	D						Dt		0	1	0	1	1	1	0

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		SRx		?	?	?	?	SR						0	D						Dt		1	1	1	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
s1	Bs	Bt				B						N	As	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	?	?	?	?	?	?	?	?	s2		?	?	?	?	?	?	Dx		Ax		Bx

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	?	?	?	?	?	?	?	?	s2		?	?	?	?	?	?	Dx		Ax		Bx		s1	Bs	Bt				B						N	As	At				A						0	D						Dt		S	0	0	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
s1	Bs	Bt				B						N	As	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx		Cx		s2		?	Cs	Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		Cx		s2		?	Cs	Ct				C						s1	Bs	Bt				B						N	As	At				A						0	D						Dt		S	0	1	1	1	1	0

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C						m1		Bt				B						0	0	At				A						0	D						Dt		0	1	0	1	1	1	0

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		SRx		?	?	?	?	SR						0	D						Dt		1	1	1	0	0	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
s1	Bs	Bt				B						N	As	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	?	?	?	?	?	?	?	?	s2		?	?	?	?	?	?	Dx		Ax		Bx

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	?	?	?	?	?	?	?	?	s2		?	?	?	?	?	?	Dx		Ax		Bx		s1	Bs	Bt				B						N	As	At				A						0	D						Dt		S	0	0	1	1	1	0

39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16
s1	Bs	Bt				B						N	As	At				A

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
?	?	Dx		Ax		Bx		Cx		s2		?	Cs	Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
?	?	Dx		Ax		Bx		Cx		s2		?	Cs	Ct				C						s1	Bs	Bt				B						N	As	At				A						0	D						Dt		S	0	1	1	1	1	0

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C						m1		Bt				B						0	0	At				A						0	D						Dt		0	1	0	1	1	1	0

63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40
m3	?	Dx		Ax		Bx		Cx		?	?	m2		Ct				C