Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

NGCS (register, 32-bit)

Test 1: uops

Code:

  ngcs w0, w0
  mov x0, 1
  mov x1, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000
100410301001100110002504310001000200010011000

Test 2: Latency 1->2

Code:

  ngcs w0, w0
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204100301010110101101072515331010910210202161000110100
10204100301010110101101062517741010810208202161000110100
10204100301010110101101082517741010810208202161000110100
10204100301010110101101082517741010810208202161000110100
10204100301010110101101082517741010810208202161000110100
10204100301010110101101082517871014710248202161000110100
10204100301010110101101082517741010810208202161000110100
10204100301010110101101062517741010810208202161000110100
10204100301010110101101082517741010810208202161000110100
10204100301010110101101082517741010810208202161000110100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100282532921002810030201221002510010
10024100301002110021100282531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010
10024100301002110021100202531611002010020200201001110010

Test 3: Latency 1->3

Chain cycles: 1

Code:

  ngcs w0, w1
  tst x0, 1
  mov x0, 1
  mov x1, 2
  mov x2, 3

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302020120201202085090552020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100
20204200302020120201202085090182020820208302122010110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
2002420030200212002120028050996500200312003200300202001110010
2002420030200212002120020050990400200202002000300202001110010
2002420030200212002120020050990400200202002000300202001110010
2002420030200212002120020050990400200202002000300202001110010
2002420030200212002120020050990400200202002000300202001110010
2002420030200212002120020050990400200202002000300202001110010
2002420030200212002120020050990400200202002000300382001110010
20024200302002120021200204220588347371665622947727390504546300202001110010
2002420030200212002120031050978200200202002000300202001110010
2002420030200212002120020050990400200202002000300202001110010

Test 4: Latency 3->2

Chain cycles: 1

Code:

  ngcs w0, w1
  cset x1, cc
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201085193392010820216302212000120100
20204200302010120101201075195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100
20204200302010120101201085195482010820216302242000120100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185195982001020020300202000120010
20024200302001120011200185194542001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010
20024200302001120011200105195982001020020300202000120010

Test 5: Latency 3->3

Code:

  ngcs w0, w1
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204100301020110201102102532621020810208202281010110100
10204100301020110201102082533181021010210202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100
10204100301020110201102082534321020810208202161010110100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100302532351002910032200441001110010
10024100301002110021100202533831002010020200201001110010
10024100301002110021100202533831002010020200201001110010
10024100301002110021100202533831002010020200201001110010
10024100301002110021100202533831002010020200201001110010
10024100301002110021100202533831002010020200201001110010
10024100301002110021100202533831002010020200201001110010
10024100301002110021100202533831002010020200201001110010
10024100301002110021100202533831002010020200201001110010
10024100301002110021100202533831002010020200201001110010

Test 6: throughput

Count: 8

Code:

  ands xzr, xzr, xzr
  ngcs w0, w8
  ands xzr, xzr, xzr
  ngcs w1, w8
  ands xzr, xzr, xzr
  ngcs w2, w8
  ands xzr, xzr, xzr
  ngcs w3, w8
  ands xzr, xzr, xzr
  ngcs w4, w8
  ands xzr, xzr, xzr
  ngcs w5, w8
  ands xzr, xzr, xzr
  ngcs w6, w8
  ands xzr, xzr, xzr
  ngcs w7, w8
  mov x8, 9
  mov x9, 10

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.7992

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
160205639031601451601451601550681921001601131602140016021816001380100
160204635021601151601151601200682134001601131602140016021416000980100
160204638921601121601121601180681997001601161602160016022016001180100
160204639351601151601151601190654216001601161602170016022016001280100
160204639341601111601111601150682185001601181602180016022016001580100
160204639131601131601131601190682238001601151602160016022016001180100
160204639241601121601121601160682170001601191602200016022016001380100
160204639201601101601101601150682153001601151602160016022016001580100
160204639391601111601111601150682330001601181602200016022016001180100
160204639031601111601111601150682162001601151602160016022016001180100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.7928

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
160024652601600201600201600260678908001600301600420016007616004880010
160024638321600231600231600280658127001600101600200016002016000180010
1600246341416001116001116001068746255598914043839310819869916651481400171216004216001480010
160024635981600111600111600100651616001600101600200016002016000180010
160024633321600111600111600100651992001600101600200016002016000180010
160024640341600111600111600100657941001600101600200016002016000180010
160024635481600111600111600100655537001600101600200016002016000180010
160024634651600111600111600100659312001600101600200016002016000180010
160024633771600111600111600100655742001600641600760016002016000180010
160024633391600111600111600100656581001600101600200016002016000180010

Test 7: throughput

Count: 4

Code:

  fcmp s0, s0
  ngcs w0, w4
  ngcs w1, w4
  ngcs w2, w4
  ngcs w3, w4
  mov x4, 5
  mov x5, 6

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.6208

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
502042484550104401011000340112100043089504001350114402111000380226200084000240100
502042483250104401021000240113100043092224001750116402121000480224200084000140100
502042483150104401011000340112100043092244001750116402121000480224200084000140100
502042483150104401011000340112100043095694004650155402441001180222200064000340100
502042483150104401011000340112100043092224001750116402121000480224200084000140100
502042483150104401011000340112100043092224001750116402121000480224200084000140100
502042483150104401011000340112100043092244001750116402121000480224200084000140100
502042483150104401011000340112100043092244001750116402121000480224200084000140100
502042483150104401011000340112100043092224001750116402121000480224200084000140100
502042483150104401011000340112100043092224001750116402121000480224200084000140100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.6197

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
5002424888500144001210002040019100033095594000050010400201000080114020024400280040010
5002424688500114001110000040010100003119784000050010400201000080020020000400010040010
5002424789500114001110000040010100003119764000050010400201000080020020000400010040010
5002424826500164001410002040024100043111734001750028400341000480020020000400010040010
5002424789500114001110000040010100003095864000050010400201000080020020000400010040010
5002424789500114001110000040010100003119764000050010400201000080020020000400010040010
5002424789500114001110000040010100003119764000050010400201000080020020000400010040010
5002424789500114001110000040010100003119764000050010400201000080020020000400010040010
5002424789500114001110000040010100003119764000050010400201000080020020000400010040010
5002424789500114001110000040010100003119784000050010400201000080020020000400010040010

Test 8: throughput

Count: 7

Code:

  ands xzr, xzr, xzr
  ngcs w0, w7
  ngcs w1, w7
  ngcs w2, w7
  ngcs w3, w7
  ngcs w4, w7
  ngcs w5, w7
  ngcs w6, w7
  mov x7, 8
  mov x8, 9
  mov x9, 10

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5844

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
802044093480106801068011405406570080116802160014022800800040070100
802044092280107801078011205410280080112802120014022000800070070100
802044090580107801078011205410280080112802120014028600800370070100
802044091580108801088011605410280080112802120014022000800070070100
802044090480105801058011305410280080112802120014022000800070070100
802044090580107801078011205410280080112802120014022000800070070100
802044090580107801078011205410280080112802120014022000800070070100
802044090580107801078011205410280080112802120014022000800070070100
802044090580107801078011205410280080112802120014022000800070070100
802044090580107801078011205410280080112802120014022000800070070100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5839

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
800254103680060800600080078054599880020800201400208001170010
800244087680021800210080020054262480020800201400208001170010
800244087680021800210080020054237880020800201400208001170010
800244087680021800210080020054262480020800201400208001170010
800244087680021800210080020054262480020800201400208001170010
800244087680021800210080020054262480020800201400208001170010
800244087680021800210080020054262480020800201400208001170010
800244087680021800210080020054262480020800201400208001170010
800244087680021800210080020054262480020800201400208001170010
800244087680021800210080020054262480020800201400208001170010