Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

PACIA

Test 1: uops

Code:

  pacia x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)? int output thing (e9)? int retires (ef)
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000

Test 2: Latency 1->1

Code:

  pacia x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530425102112022001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100
1020460029102011020110200530325102002002001010110100

1000 unrolls and 10 iterations

Result (median cycles for code): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001410010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205298851003120201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010

Test 3: Latency 1->2

Chain cycles: 1

Code:

  add x1, x0, x0
  mov x0, 0
  pacia x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
3020470029202012020120202142958120202102042020800201010030100
3020470029202012020120202142956820202102042020800201010030100
3020470029202012020120202142983820226102202020800201010030100
3020470029202012020120202142956820202102042020800201010030100
3020470029202012020120202142956820202102042020800201010030100
3020470029202012020120202142956820202102042020800201010030100
3020470029202012020120202142956820202102042020800201010030100
3020470029202012020120202142956820202102042020800201010030100
3020470029202012020120202142956820202102042020800201010030100
3020470029202012020120202142956820202102042024000201050030100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
300257005820025200252004714292082002210024200282001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010

Test 4: throughput

Count: 8

Code:

  pacia x0, x8
  pacia x1, x8
  pacia x2, x8
  pacia x3, x8
  pacia x4, x8
  pacia x5, x8
  pacia x6, x8
  pacia x7, x8

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 2.0004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
802041600308020180201802020136037908020220002008010180100
802041600308020180201802020136048108020220002008010180100
802041600308020180201802020136048108020220002008010180100
802041600308020180201802020136048108020220002008010180100
802041600308020180201802020136048108020220002008010180100
802041600308020180201802020136048108020220002008010980100
802041600308020180201802020136048108020220002008010180100
802041600308020180201802020136048108020220002008010180100
802041600308020180201802020136048108020220002008010180100
802041600308020180201802020136048108020220002008010180100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 2.0004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
8002516006480031800310080040013598908002220200800110080010
8002416003080021800210080020013599318002020200800110080010
8002416003080021800210080020013599318002020200800110080010
8002416003080021800210080020013599318002020200800210080010
846841766218373482297461391820763313599318002020200800110080010
8002416003080021800210080020013599318002020200800110080010
8002416003080021800210080020013599318002020200800110080010
8002416003080021800210080020013599318002020200800190080010
8002416003080021800210080020013599318002020200800110080010
8002416003080021800210080020013599318002020200800110080010