Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

AUTDB

Test 1: uops

Code:

  autdb x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)? int output thing (e9)? int retires (ef)
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000

Test 2: Latency 1->1

Code:

  autdb x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204600291020110201010200530325102002002001010110100
10204600291020110201010200530325102002002001010110100
10204600291020110201010200530325102002002001010110100
10204600291020110201010200530325102002002001010110100
10204600291020110201010200530325102002002001010110100
10204600291020110201010200530325102002002001010110100
10204600291020110201010200530325102002002001010110100
10204600291020110201010200530325102002002001010110100
10204600291020110201010200530325102002002001010110100
10204600291020110201010200530325102002002001010110100

1000 unrolls and 10 iterations

Result (median cycles for code): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
100246002910021100210100205297851002020201001110010
100246002910021100210100205297851002020201001110010
100246002910021100210100205297851002020201001110010
100246002910021100210100205297851002020201001110010
100246002910021100210100205297851002020201001110010
100256005810024100240100315297851002020201001110010
100246002910021100210100205297851002020201001110010
100246002910021100210100205297851002020201001110010
100246002910021100210100205297851002020201001110010
100246002910021100210100205297851002020201001110010

Test 3: Latency 1->2

Chain cycles: 1

Code:

  add x1, x0, x0
  mov x0, 0
  autdb x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
3020470029202012020120202142956220202102042020800201010030100
3020470029202012020120202142956820202102042020800201010030100
3020470029202012020120202142956820202102042020800201010030100
3020470129202132021320254142956820202102042024000201070030100
3020470029202012020120202142996720228102202020800201010030100
3020470079202082020820229142956820202102042033800201250030100
3020470029202012020120202143021920256102392020800201010030100
3020470080202072020720229142956820202102042043400201430030100
3020471230203512035120832142956820202102042020800201010030100
3020470029202012020120202142956820202102042020800201010030100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
30024700292002120021200220142918900200221002400200600200150030010
30024700292002120021200200142919800200201002000200200200110030010
30024700292002120021200200142919800200201002000200600200160030010
30024700292002120021200200142919800200201002000200200200110030010
30024700292002120021200200142919800200201002000200200200110030010
30024700292002120021200200142919800200201002000200200200110030010
30024700292002120021200200142919800200201002000200200200110030010
30024700292002120021200200142919800200201002000200200200110030010
30024700292002120021200200142919800200201002000200200200110030010
30024700292002120021200200142919800200201002000200200200110030010

Test 4: throughput

Count: 8

Code:

  autdb x0, x8
  autdb x1, x8
  autdb x2, x8
  autdb x3, x8
  autdb x4, x8
  autdb x5, x8
  autdb x6, x8
  autdb x7, x8

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 2.0004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
8020416003080201802018020213603798020220050423380262118180317
802041600308020180201802021360478802202002000801010080100
802041600308020180201802021360481802022002000801010080100
802041600308020180201802021360481802022002000801090080100
802041600308020180201802021360481802022002000801010080100
802041600308020180201802021360481802022002000801010080100
802041600308020180201802021360481802022002000801010080100
802041600308020180201802021360580802202002000801010080100
802041600308020180201802021360481802022002000801010080100
802041600308020180201802021360481802022002000801010080100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 2.0004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
80024160030800218002108002213598398002220208002180010
80024160030800218002108002013599318002020208001180010
80024160030800218002108002013599318002020208002180010
80024160030800218002108002013599318002020208001180010
80025160064800308003008004013599318002020208001180010
80024160030800218002108002013599318002020208001180010
80024160030800218002108002013599318002020208001180010
80024160078800348003408003913599318002020208002180010
80024160030800218002108002013599318002020208001180010
8050216266580434802931418025613598908002220208001180010