Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

STP (pre-index, 32-bit)

Test 1: uops

Code:

  stp w0, w1, [x6, #8]!

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 2.000

Integer unit issues: 1.001

Load/store unit issues: 1.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map ldst uop (7d)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)
100518902059104110181040100046851800120001000300010011000
100410822001100110001000100046811744320001000300010011000
100410732001100110001000100046811737120001000300010011000
100410922001100110001000100046811845120001000300010011000
100410962001100110001000100046811735320001000300010011000
100410632001100110001000100046811735320001000300010011000
100410702001100110001000100046851751520001000300010011000
100410712001100110001000100046811740720001000300010011000
100410792001100110001000100046571807320001000300010011000
100410962001100110001000100046411801920001000300010011000

Test 2: Latency 3->3

Code:

  stp w0, w1, [x6, #8]!

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0087

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? int retires (ef)
102091189320404103141009010314100031071131707152010920010010200300301000110000100
10204100882010410104100001010410002436291707912010620010008200300241000410000100
10204100872010410104100001010410002436291707912010620010008200300241000410000100
10204101282010410104100001010410002436291710072010620010008200300241000410000100
10204101072010410104100001010410002436311709892010620010008200300241000410000100
10204101152010410104100001010410001436271711662010520010008200300241000410000100
10204101202010410104100001010410001436281710762010520010008200300241000410000100
10204100962010410104100001010410002436311711512010620010008200300241000310000100
10204100862010410104100001010410002436291707912010620010008200300241000410000100
10204100862010410104100001010410002436291707912010620010008200300241000410000100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0094

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? int retires (ef)
10029113532030710217100901021810002430071709172001620100082030000100011000010
10024100942001110011100001001010000429951709292001020100002030000100011000010
10024100942001110011100001001010000429951709472001020100002030000100011000010
10024100942001110011100001001010000429951709292001020100002030000100011000010
10024100942001110011100001001010000429951709292001020100002030000100011000010
10024100942001110011100001001010000429951709292001020100002030000100011000010
10024100942001110011100001001010000429951709292001020100002030000100011000010
10024100942001110011100001001010000429951709292001020100002030000100011000010
10024100942001110011100001001010000429951709292001020100002030000100011000010
10024100932001110011100001001010000429911708752001020100002030000100011000010

Test 3: throughput

Count: 8

Code:

  stp w0, w1, [x6, #8]!
  stp w0, w1, [x7, #8]!
  stp w0, w1, [x8, #8]!
  stp w0, w1, [x9, #8]!
  stp w0, w1, [x10, #8]!
  stp w0, w1, [x11, #8]!
  stp w0, w1, [x12, #8]!
  stp w0, w1, [x13, #8]!
  mov x7, x6
  mov x8, x6
  mov x9, x6
  mov x10, x6
  mov x11, x6
  mov x12, x6
  mov x13, x6

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 1.0006

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? int retires (ef)
8020981553160401803118009080311800022403121360064160106200800082002400248000580000100
8020480056160105801058000080104800022403121360051160106200800082002400248000580000100
8020480053160105801058000080104800022403121360051160106200800082002400248000580000100
8020480043160105801058000080104800022403121361923160106200800082002400248000580000100
8020480048160105801058000080104800022403121360051160106200800082002400248000580000100
8020480048160105801058000080104800022403121360051160106200800082002400248000580000100
8020480048160105801058000080104800022403121360051160106200800082002400248000580000100
8020480048160105801058000080104800022403121360051160106200800082002400248000580000100
8020480048160105801058000080104800022403121360051160106200800082002400248000580000100
8020480048160105801058000080104800022403121360051160106200800082002400248000580000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 1.0007

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? int retires (ef)
8002980965160305802158009080214800022400421360211160016208000820240024800058000010
8002480056160011800118000080010800002400301360205160010208000020240000800018000010
8002480056160011800118000080010800002400301360205160010208000020240000800018000010
8002480056160011800118000080010800002400301360205160010208000020240264800698000010
8002480056160011800118000080010800352401491360807160085208004820240000800018000010
8002480056160011800118000080010800002400301360205160010208000020240120800338000010
8002580115160064800478001780050800002400301360205160010208000020240000800018000010
8002480056160011800118000080010800002400301360205160010208000020240000800018000010
8002480056160011800118000080010800002400301360205160010208000020240000800018000010
8002480056160011800118000080010800002400301360205160010208000020240000800018000010