Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

PRFM (register, PSTL3STRM)

Test 1: uops

Code:

  prfm pstl3strm, [x6]
  mov x0, 0

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 0.001

Load/store unit issues: 1.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch ldst uop (58)simd uops in schedulers (5a)dispatch uop (78)map ldst uop (7d)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)
1004211510011100010003490410001000100011000
1004209910011100010003490610001000100011000
1004209810011100010003491410001000100011000
1004209810011100010003489210001000100011000
1004209810011100010003490610001000100011000
1004209710011100010003492810001000100011000
1004210810011100010003478610001000100011000
1004210510011100010003492410001000100011000
1004209710011100010003493610001000100011000
1004209710011100010003491010001000100011000

Test 2: throughput

Code:

  prfm pstl3strm, [x6]
  add x6, x6, 64

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 2.0176

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? int retires (ef)
20204211512010210102100001010410006612973506532011710213100131021310013100071000010100
20204201352010110101100001010010000612563506232010010202100021020410004100031000010100
20204200392010310103100001010210000612563506232010010202100021020810008100021000010100
20204201352010110101100001010010000612563506232010010202100021020210002100011000010100
20204201352010110101100001010010000612563506232010010202100021020210002100011000010100
20204201352010110101100001010010000612563506232010010202100021020210002100011000010100
20204201352010110101100001010010002614473497172011010210100101020210002100011000010100
20204199612010310103100001010210002612873501852011010210100101020810008100021000010100
20204200832010510105100001010810000614543506592010010202100021020210002100011000010100
20204201352010110101100001010010000612563506232010010202100021020210002100011000010100

1000 unrolls and 10 iterations

Result (median cycles for code): 2.0169

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? int retires (ef)
20024227382001710017100001002310000609093513992001310025100051002010000100011000010010
20024203462001110011100001001310000610413511872001010020100001002010000100011000010010
20024201722001110011100001001010000608783514292001010020100001002010000100011000010010
20024201712001110011100001001010000610793503312001010020100001002010000100011000010010
20024201372001110011100001001010000610043513392001010020100001002010000100011000010010
20024200722001110011100001001010000609183520052001010020100001002010000100011000010010
20024201582001110011100001001010000609323510872001010020100001002010000100011000010010
20024202352001110011100001001010000606333505652001010020100001002010000100011000010010
20024200992001110011100001001010000608973504872001010020100001002010000100011000010010
20024201582001110011100001001010000609383514472001010020100001002010000100011000010010

Test 3: throughput

Code:

  prfm pstl3strm, [x6]
  mov x7, 8

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 2.0958

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? int retires (ef)
1020420938101011011000010010006300365398101062001001220010012110000100
1020420958101011011000010010006300365450101062001001220010012110000100
1020420958101011011000010010006300365450101062001001220010012110000100
1020420958101011011000010010006300365450101062001001220010012110000100
1020420944101011011000010010000300365044101002001000420010004110000100
1020420770101011011000010010000300360660101002001000420010012110000100
1020420891101011011000010010000300364304101002001000820010004110000100
1020420481101011011000010010006300363866101062001001220010004110000100
1020420868101011011000010010000300363808101002001000420010008110000100
1020420880101011011000010010000300356222101002001000820010004110000100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.8731

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? int retires (ef)
1002419321100111110000101000430331658100142010012201000011000010
1002419479100111110000101000030339210100102010000201000011000010
1002419371100111110000101000030333518100102010000201000011000010
1002419398100111110000101000030338048100102010000201000011000010
1002419594100111110000101000030336554100102010000201000011000010
1002419541100111110000101000030333716100102010000201000011000010
1002419547100111110000101000030338112100102010000201000011000010
1002419697100711110060101000030338272100102010000201000011000010
1002419601100111110000101000030335272100102010000201000011000010
1002419351100111110000101000030325728100102010000201000011000010