Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

AUTDA

Test 1: uops

Code:

  autda x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)? int output thing (e9)? int retires (ef)
1004602910011001100052825101110011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000

Test 2: Latency 1->1

Code:

  autda x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
102046002910201102011020005303250010200200002001010110100
102046002910201102011020005303250010200200002001010110100
102046002910201102011020005303250010200200002001010110100
102046002910201102011020005303250010200200002001010110100
102046002910201102011020005303250010200200002001010110100
102046002910201102011020005303250010200200002001010110100
102046002910201102011020005303250010200200002001010110100
102046002910201102011020005303250010200200002001010410100
102046002910201102011020005303250010200200002001010110100
102046002910201102011020005303250010200200002001010110100

1000 unrolls and 10 iterations

Result (median cycles for code): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10025600581002410024100315297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205298851003120201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010

Test 3: Latency 1->2

Chain cycles: 1

Code:

  add x1, x0, x0
  mov x0, 0
  autda x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
302047002920201202012020214295622020210204202082010130100
302047002920201202012020214295682020210204202082010130100
302047002920201202012020214295682020210204202082010130100
302047002920201202012020214295682020210204202082010130100
302047002920201202012020214295682020210204202082010130100
302047002920201202012020214295682020210204203362012530100
302047002920201202012020214295682020210204202422010630100
302047002920201202012020214295682020210204202082010130100
302047002920201202012020214295682020210204202082010130100
302047002920201202012020214295682020210204202082010130100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
3002470029200212002120022142919120022100242002800200110030010
3002470029200212002120020142919820020100202002000200110030010
3002470029200212002120020142919820020100202002000200110030010
3002470029200212002120020142919820020100202002000200110030010
3002470029200212002120020142919820020100202002000200110030010
3002470029200212002120020142919820020100202002000200110030010
3002470029200212002120020142919820020100202002000200110030010
3002470029200212002120022142919820020100202002000200110030010
3002470029200212002120020142919820020100202002000200110030010
3002470029200212002120020142919820020100202002000200110030010

Test 4: throughput

Count: 8

Code:

  autda x0, x8
  autda x1, x8
  autda x2, x8
  autda x3, x8
  autda x4, x8
  autda x5, x8
  autda x6, x8
  autda x7, x8

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 2.0004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
80204160030802018020108020213603798020220020000801010080100
80204160030802018020108020213604818020220020000801010080100
80204160030802018020108020213604818020220020000801010080100
80204160030802018020108020213605808022020020000801010080100
80204160030802018020108020213604818020220020000801010080100
80204160030802018020108020213604818020220020000801010080100
80204160030802018020108020213604818020220020000801010080100
80205160064802118021108022013604818020220020000801010080100
80204160030802018020108020213605668021920020000801010080100
80204160030802018020108020213604818020220020000801010080100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 2.0004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8002416003080021800218002201359890080022200208001180010
8002416003080021800218002201359880080020200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001359931080020200208001180010