Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

NGCS (register, 64-bit)

Test 1: uops

Code:

  ngcs x0, x0
  mov x0, 1
  mov x1, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000

Test 2: Latency 1->2

Code:

  ngcs x0, x0
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204100301010110101101080251562010109102100202161000110100
10204100301010110101101080251774010108102080202161000110100
10204100301010110101101080251774010108102080202161000110100
10204100301010110101101080251774010108102080202161000110100
10204100301010110101101080251774010108102080202161000110100
10204100301010110101101080251774010108102080202961001510100
10204100301010110101101080251819010107102080202161000110100
10204100301010110101101080251774010108102080202161000110100
10204100301010110101101080251774010108102080202161000110100
10204100301010110101101080251774010108102080202161000110100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100292532561002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020201181002510010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202535191006610068200201001110010

Test 3: Latency 1->3

Chain cycles: 1

Code:

  ngcs x0, x1
  tst x0, 1
  mov x0, 1
  mov x1, 2
  mov x2, 3

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302020120201202085088242021120212302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20206200902023120231202845090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302002120021200285099652003120032300202001110010
20024200302002120021200205103642006620068300202001110010
20024200302002120021200205099042002020020300202001110010
20024200302002120021200205099042002020020300202001110010
20024200302002120021200205099042002020020300202001110010
20024200302002120021200205099042002020020300202001110010
20024200302002120021200205099042002020020300202001110010
20024200302002120021200205099042002020020300202001110010
20024200302002120021200205099042002020020300202001110010
20024200302002120021200205099042002020020300202001110010

Test 4: Latency 3->2

Chain cycles: 1

Code:

  ngcs x0, x1
  cset x1, cc
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201075195482010820214302242000120100
20204200302010120101201075195482010820216302242000120100
20205200602011520115201475195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302212000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185194952001820036300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010

Test 5: Latency 3->3

Code:

  ngcs x0, x1
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204100331020110201102122533441021010212202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1002410030100211002110029025342700100291003200200201001110010
1002410030100211002110020025338300100201002000200201001110010
1002410030100211002110020025338300100201002000200201001110010
1002410030100211002110020025338300100201002000200201001110010
1002410030100211002110020025338300100201002000200201001110010
1002410030100211002110020025338300100201002000200201001110010
1002410030100211002110020025338300100201002000200201001110010
1002410030100211002110020025338300100201002000200201001110010
1002410030100211002110020025338300100201002000200201001110010
1002410030100211002110020025287600100301003200200441001110010

Test 6: throughput

Count: 8

Code:

  ands xzr, xzr, xzr
  ngcs x0, x8
  ands xzr, xzr, xzr
  ngcs x1, x8
  ands xzr, xzr, xzr
  ngcs x2, x8
  ands xzr, xzr, xzr
  ngcs x3, x8
  ands xzr, xzr, xzr
  ngcs x4, x8
  ands xzr, xzr, xzr
  ngcs x5, x8
  ands xzr, xzr, xzr
  ngcs x6, x8
  ands xzr, xzr, xzr
  ngcs x7, x8
  mov x8, 9
  mov x9, 10

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.7991

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1602046405916011216011216011668183316012016022216021816001380100
1602046393416011316011316011865379516011916022116021616001180100
1602046393316011116011116011568206116011516021616022016001580100
1602046393916011116011116011568251216011816022016021616001080100
1602046391516011216011216011868225316011816022016022016001580100
1602046391216011316011316011968216216011516021616025616004880100
1602046393516011516011516011968222416011516021616022016001580100
1602046393616011116011116011668216616011916022016021616001180100
1602046393716011116011116011568213716011516021616021616001180100
1602046392116011216011216011668217016011916022016021616001080100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.7929

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1600246483816002416002400160028064478416001016002016002016000180010
1600246381416001116001100160010065668316001016002016002016000180010
1600246334316001116001100160010064872216001016002016002016000180010
1600246340916001116001100160010064900516001016002016002016000180010
1600246365016001116001100160010065637616001016002016002016000180010
1600246335916001116001100160010064747116001016002016002016000180010
1600246334716001116001100160010065494316001016002016002016000180010
1600246335916001116001100160010064429816001016002016002016000180010
1600246341416001116001100160010064326916001016002016002016000180010
1600246336916001116001100160010064862316001016002016002016000180010

Test 7: throughput

Count: 4

Code:

  fcmp s0, s0
  ngcs x0, x4
  ngcs x1, x4
  ngcs x2, x4
  ngcs x3, x4
  mov x4, 5
  mov x5, 6

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.6208

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
502042479250108401041000440116100053093214001350114402111000380226200084000240100
502042478750108401041000440116100053093084001350114402111000380224200084000140100
502042483150104401011000340112100043092244001750116402121000480224200084000140100
502042483150104401011000340112100043092224001750116402121000480288200244003440100
502042483150104401011000340112100043092224001750116402121000480224200084000140100
502042483150104401011000340112100043095214004150152402421001080224200084000140100
502042483150104401011000340112100043092244001750116402121000480224200084000140100
502042483150104401011000340112100043092224001750116402121000480224200084000140100
502042483150104401011000340112100043092224001750116402121000480224200084000140100
502042483150104401011000340112100043092224001750116402121000480224200084000140100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.6197

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
500242485850016400141000240024100043111524001750028400341000480020200004000140010
500242468650011400111000040010100003119784000050010400201000080020200004000140010
500242478950011400111000040010100003119764000050010400201000080020200004000140010
500242478950011400111000040010100003119764000050010400201000080020200004000140010
500242478950011400111000040010100003119764000050010400201000080020200004000140010
500242478950011400111000040010100003119764000050010400201000080020200004000140010
500242478950011400111000040010100003119784000050010400201000080020200004000140010
500242478950011400111000040010100003119764000050010400201000080020200004000140010
500242478950011400111000040010100003119784000050010400201000080020200004000140010
500242478950011400111000040010100003119764000050010400201000080020200004000140010

Test 8: throughput

Count: 7

Code:

  ands xzr, xzr, xzr
  ngcs x0, x7
  ngcs x1, x7
  ngcs x2, x7
  ngcs x3, x7
  ngcs x4, x7
  ngcs x5, x7
  ngcs x6, x7
  mov x7, 8
  mov x8, 9
  mov x9, 10

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5844

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
802044092080110801108011954051480115802161402348001070100
802044092280107801078011254102880112802121402208000770100
802044090580107801078011254102880112802121402208000770100
802044090580107801078011254102880112802121402208000770100
802044090580107801078011254102880112802121402208000770100
802044090580107801078011254102880112802121402208000770100
802044090580107801078011254102880112802121402208000770100
802044090580107801078011254046580112802121402208000770100
802044090580107801078011254102880112802121402208000770100
802044090580107801078011254102880112802121402208000770100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5837

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
8002440937800158001580027545743800108002014002000800010070010
8002440862800118001180010543052800108002014002000800010070010
8002440862800118001180010543052800108002014002000800010070010
8002440862800118001180010543052800108002014002000800010070010
8002440862800118001180010543052800108002014002000800010070010
8002440862800118001180010543052800108002014002000800010070010
8002440862800118001180010543052800108002014002000800010070010
8002440862800118001180010543052800108002014002000800010070010
8002440862800118001180010543052800108002014002000800010070010
8002440862800118001180010543747800618007114002000800010070010