Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

PACDA

Test 1: uops

Code:

  pacda x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)? int output thing (e9)? int retires (ef)
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000
1004602910011001100052725100010011000

Test 2: Latency 1->1

Code:

  pacda x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1020460029102011020100102000530325102002002001010110100
1020460029102011020100102000530325102002002001010110100
1020460029102011020100102000530425102112002001010110100
1020460029102011020100102000530325102002002001010110100
1020460029102011020100102000530325102002002001010110100
1020460029102011020100102000530325102002002001010110100
1020460029102011020100102000530325102002002001010110100
1020460029102011020100102000530325102002002001010110100
1020460029102011020100102000530325102002002001010110100
1020460029102011020100102000530325102002002001010110100

1000 unrolls and 10 iterations

Result (median cycles for code): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10025600581002410024100315297851002020201001110010
10024600291002110021100205298851003120201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010
10024600291002110021100205297851002020201001110010

Test 3: Latency 1->2

Chain cycles: 1

Code:

  add x1, x0, x0
  mov x0, 0
  pacda x0, x1
  mov x0, 1

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
302047002920201202012020201429581020202102040202082010130100
302047002920201202012020201429568020202102040202082010130100
302047002920201202012020201429568020202102040202082010130100
302047002920201202012020201429568020202102040202082010130100
302047002920201202012020201429568020202102040202082010130100
302047002920201202012020201429568020202102040202082010130100
302047002920201202012020201429568020202102040202082010130100
302047002920201202012020201429861020227102200202082010130100
302047002920201202012020201429568020202102040202082010130100
3020470029202012020120202130143796340422057610537158202742010930100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 6.0029

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
300247002920021200212002214291892002210024200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014294552004610040200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014292082002210024200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010
300247002920021200212002014291982002010020200202001130010

Test 4: throughput

Count: 8

Code:

  pacda x0, x8
  pacda x1, x8
  pacda x2, x8
  pacda x3, x8
  pacda x4, x8
  pacda x5, x8
  pacda x6, x8
  pacda x7, x8

(requires arm64e binary, with arm64e_preview_abi boot arg)

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 2.0004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8020416003080201802010802021360430802022002008011180100
822831636758193881284654812471360529802202002008010180100
8020416003080201802010802021360430802022002008010180100
8020416003080201802010802021360481802022002008010180100
8020516006480211802110802201360481802022002008010180100
8020416003080201802010802021360481802022002008010180100
8020416003080201802010802021360481802022002008010180100
8020416003080201802010802021360481802022002008011180100
8020416003080201802010802021360481802022002008010180100
8020416003080201802010802021360481802022002008010180100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 2.0004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
80024160030800218002100800220135994180022202000800110080010
80024160030800218002100800200135993180020202000800110080010
80024160030800218002100800200135993180020202000800110080010
80024160030800218002100800200135993180020202000800110080010
80024160030800218002100800200135993180020202000800110080010
80024160030800218002100800200135993180020202000800110080010
80024160030800218002100800200135993180020202000800110080010
80024160030800218002100800200136003380040202000800110080010
80024160030800218002100800200135993180020202000800110080010
80024160030800218002100800200135993180020202000800110080010