Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

SBCS (64-bit)

Test 1: uops

Code:

  sbcs x0, x0, x1
  mov x0, 1
  mov x1, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000

Test 2: Latency 1->2

Code:

  sbcs x0, x0, x1
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204100301010110101101082516411010910210302241000110100
10204100301010110101101082516581010610206302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101072517741010810208302241000110100
10204100301010110101101092517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100282532561002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202536131006810068300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010

Test 3: Latency 1->3

Code:

  sbcs x0, x1, x0
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204100301010110101101092515631010610206302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100292531291003010030300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202532841002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010

Test 4: Latency 1->4

Chain cycles: 1

Code:

  sbcs x0, x1, x2
  tst x0, 1
  mov x0, 1
  mov x1, 2
  mov x2, 3

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302020120201202085087462020820208402162010110100
20204200302020120201202085090182020820208402162010110100
20204200302020120201202085090182020820208402162010110100
20204200302020120201202085090182020820208402162010110100
20204200302020120201202085090182020820208402162010110100
20204200302020120201202085090182020820208402162010110100
20204200302020120201202085090182020820208402162010110100
20204200302020120201202085090182020820208402162010110100
20204200302020120201202085090182020820208402162010110100
20204200302020120201202085090182020820208402162010110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302002120021200295097562003020032400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010

Test 5: Latency 4->2

Chain cycles: 1

Code:

  sbcs x0, x1, x2
  cset x1, cc
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201075193122010720212402322000120100
20204200302010120101201075195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085198342014720261402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185193712001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010

Test 6: Latency 4->3

Chain cycles: 1

Code:

  sbcs x0, x1, x2
  cset x2, cc
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201075192222010720212402282000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185195072001720032400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010
20024200302001120011200105195982001020020400202000120010

Test 7: Latency 4->4

Code:

  sbcs x0, x1, x2
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204100301020110201102082531081021210214302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100292532201003010032300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202534271002910032301311002510010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010

Test 8: throughput

Count: 8

Code:

  ands xzr, xzr, xzr
  sbcs x0, x8, x9
  ands xzr, xzr, xzr
  sbcs x1, x8, x9
  ands xzr, xzr, xzr
  sbcs x2, x8, x9
  ands xzr, xzr, xzr
  sbcs x3, x8, x9
  ands xzr, xzr, xzr
  sbcs x4, x8, x9
  ands xzr, xzr, xzr
  sbcs x5, x8, x9
  ands xzr, xzr, xzr
  sbcs x6, x8, x9
  ands xzr, xzr, xzr
  sbcs x7, x8, x9
  mov x8, 9
  mov x9, 10
  mov x10, 11

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.7991

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1602046408216011516011516012268206816011516021624028416004780100
1602046393516010916010916011368213316011516021624023016001580100
1602046392116011216011216011668209516012216022224022416001280100
1602046393616011516011516011968249916011816022024023016001180100
1602046389216011216011216011868206116011516021624022416001180100
1602046393616011116011116011568206516011516021624023016001580100
1602046391816011116011116011568214216011516021624022416001180100
1602046392016011016011016011568224016011616021624023016001580100
1602046389916011016011016011568214916011516021624023016001180100
1602056395516014916014916015668213316011516021624023016001580100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.7985

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1600246522416002016002016002668204216003116004224005316001680010
1600246412416001116001116001068215816001016002024002016000180010
1600246386816001116001116001068622116001016002024002016000180010
1600246388016001116001116001068234616001016002024002016000180010
1600246386716001116001116001068251116001016002024002016000180010
1600246386616001116001116001068259916001016002024002016000180010
1600246387316001116001116001068251616001016002024002016000180010
1600246387416001116001116001068240016006616007724002016000180010
1600246389316001116001116001068212116001016002024002016000180010
1600246387816001116001116001068249816001016002024002016000180010

Test 9: throughput

Count: 4

Code:

  fcmp s0, s0
  sbcs x0, x4, x5
  sbcs x1, x4, x5
  sbcs x2, x4, x5
  sbcs x3, x4, x5
  mov x4, 5
  mov x5, 6
  mov x6, 7

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.6208

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
5020424844501054010310002401111000330934940017501164021210004120236200084000140100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100
5020424831501044010110003401121000430922440017501164021210004120236200084000240100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100
5020424831501044010110003401121000431038440053501604024710013120236200084000140100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100
5020424831501044010110003401121000430922440017501164021210004120236200084000140100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.6197

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
5002424869500164001410002400241000431120240017500264003210004120020200004000140010
5002424790500114001110000400101000031197840000500104002010000120020200004000140010
5002424789500114001110000400101000031197640000500104002010000120047200064000240010
5002424798500114001110000400101000031197640000500104002010000120020200004000140010
5002424789500114001110000400101000031197640000500104002010000120020200004000140010
5002424789500114001110000400101000031197640000500104002010000120020200004000140010
5002424789500114001110000400101000031197640000500104002010000120020200004000140010
5002424789500114001110000400101000031197640000500104002010000120020200004000140010
5002424789500114001110000400101000031197640000500104002010000120020200004000140010
5002424789500114001110000400101000031134240045500664006510011120020200004000140010

Test 10: throughput

Count: 7

Code:

  ands xzr, xzr, xzr
  sbcs x0, x7, x8
  sbcs x1, x7, x8
  sbcs x2, x7, x8
  sbcs x3, x7, x8
  sbcs x4, x7, x8
  sbcs x5, x7, x8
  sbcs x6, x7, x8
  mov x7, 8
  mov x8, 9
  mov x9, 10

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5844

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8020441090801068010608011454127080112802122102308000770100
8020440905801078010708011254102880112802122102308000770100
8020440905801078010708011254102880112802122102308000770100
8020440905801078010708011254102880112802122102308000770100
8020440905801078010708011254102880112802122102308000770100
8020440905801078010708011254102880112802122102308000770100
8020440905801078010708011254102880112802122102308000770100
8020440880801068010608011254102880112802122102308000770100
8020440905801078010708011254102880112802122102308000770100
8020440905801078010708011254102880112802122102308000670100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5839

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
800244102180033800330080041054342780036800362100208001170010
800244087680021800210080020054262480020800202100208001170010
800244089980027800270080037054166680020800202100208001170010
800244087680021800210080020054262480020800202100208001170010
800254091580060800600080078054506680020800202100208001170010
800244089980021800210080020054262480020800202100208001170010
800244087680021800210080020054262480020800202100208001170010
800244087680021800210080020053845580020800202100208001170010
800244087680021800210080020054254080036800362100208001170010
800244087680021800210080020054262480020800202100208001170010