Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

AUTIA

Test 1: uops

Code:

  autia x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)? int output thing (e9)? int retires (ef)
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000

Test 2: Latency 1->1

Code:

  autia x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530425102112002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100

1000 unrolls and 10 iterations

Result (median cycles for code): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024600291002110021100200529785010020200201001110010
10024600291002110021100200529785010020200201001110010
10024600291002110021100200529785010020200201001110010
10024600291002110021100200529785010020200201001110010
10024600291002110021100200529785010020200201001110010
10024600291002110021100200529785010020200201002710010
10024600291002110021100200529785010020200201001110010
10024600291002110021100200529785010020200201001110010
10024600291002110021100200529785010020200201001110010
10024600291002110021100200529885010031200201001110010

Test 3: Latency 1->2

Chain cycles: 1

Code:

  add x1, x0, x0
  mov x0, 0
  autia x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
302057005820205202052022714295742020210204202080201010030100
302047002920201202012020214295682020210204202080201010030100
302047002920201202012020214295682020210204202080201010030100
302047002920201202012020214295682020210204202080201010030100
302047002920201202012020214295682020210204202080201010030100
302047002920201202012020214295682020210204202720201090030100
302047002920201202012020214295682020210204202080201010030100
302047002920201202012020214295682020210204202080201010030100
302047002920201202012020214295682020210204202080201010030100
302047002920201202012020214295682020210204202080201010030100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
300247002920021200212002214291812002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300257007220025200252004614291912002210024200202001130010
300247002920021200212002014291982002010020200602001530010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014294372004710040200202001130010

Test 4: throughput

Count: 8

Code:

  autia x0, x8
  autia x1, x8
  autia x2, x8
  autia x3, x8
  autia x4, x8
  autia x5, x8
  autia x6, x8
  autia x7, x8

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 2.0004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8020416003080201802010802021360430802022002008010180100
8020416003080201802010802021360566802192002008010180100
8020416003080201802010802021360481802022002008010180100
8020416003080201802010802021360481802022002008010180100
8020416003080201802010802021360481802022002008010180100
8020416003080201802010802021360481802022002008010180100
8020416003080201802010802021360481802022002008010180100
8020416003080201802010802021360481802022002008011080100
8020416003080201802010802021360481802022002008010180100
8020416003080201802010802021360481802022002008010180100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 2.0004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8002416003080021800218002201359880080020200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001360144080058200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001360026080039200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001359931080020200208001180010
8002416003080021800218002001359931080020200208001180010