Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

ADCS (32-bit)

Test 1: uops

Code:

  adcs w0, w0, w1
  mov x0, 1
  mov x1, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000
100410301001100110002504310001000300010011000

Test 2: Latency 1->2

Code:

  adcs w0, w0, w1
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204100301010110101101082516411010610206302181000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208303441001510100
10204100301010110101101082517741010810208302241000110100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100282532921002810030300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020301311003210010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010

Test 3: Latency 1->3

Code:

  adcs w0, w1, w0
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204100301010110101101092517741010810208302301000110100
10204100301010110101101062517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082517741010810208302241000110100
10204100301010110101101082515201010610206302241000110100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100282532561002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010
10024100301002110021100202531611002010020300201001110010

Test 4: Latency 1->4

Chain cycles: 1

Code:

  adcs w0, w1, w2
  tst x0, 1
  mov x0, 1
  mov x1, 2
  mov x2, 3

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
20204200302020120201202085090182020820208402160201010010100
2020420030202012020120208509018202082020834409241516821108129529
20204200302020120201202085090182020820208402160201010010100
20204200302020120201202085090182020820208402160201010010100
20204200302020120201202085090182020820208402160201010010100
20204200302020120201202085090182020820208402160201010010100
20204200302020120201202085090182020820208402160201010010100
20204200302020120201202085090182020820208402160201010010100
20204200302020120201202085090182020820208402160201010010100
20204200302020120201202085090182020820208402160201010010100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302002120021200295098542002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205102242006920070400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010
20024200302002120021200205099042002020020400202001110010

Test 5: Latency 4->2

Chain cycles: 1

Code:

  adcs w0, w1, w2
  cset x1, cc
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? int retires (ef)
2020420030201012010120108519234201082021640224020001020100
2020420030201012010120107519416201072021240232020001020100
2020420030201012010120108519548201082021640232020001020100
2020420030201012010120108519548201082021640232020001020100
2020420030201012010120108519548201082021640232020001020100
2020420030201012010120108519548201082021640232020001020100
2020420030201012010120108519548201082021640232020001020100
2020420030201012010120108519548201082021640232020001020100
2020420030201012010120108519548201082021640557219201175420239
2020420030201012010120108519434201072021440232020001020100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
2002420030200112001120018051950700200172003200400202000120010
20025200602002520025200589467784150126864106446361426711103347400522000120010
2002420030200112001120017051985200200582008300400202000120010
2002420030200112001120010051959800200102002000400202000120010
2002420030200112001120010051959800200102002000400202000120010
2002420030200112001120010051959800200102002000400202000120010
2002420030200112001120019051947600200102002000400202000120010
2002420030200112001120010051995600200582008400400202000120010
2002420030200112001120010051959800200102002000400202000120010
2002420030200112001120010051959800200102002000400202000120010

Test 6: Latency 4->3

Chain cycles: 1

Code:

  adcs w0, w1, w2
  cset x2, cc
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201085194162010720212402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216403282001520100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100
20204200302010120101201085195482010820216402322000120100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
2002420030200112001120018051950700200172003200400202000120010
2002420030200112001120010051959800200102002000400202000120010
2002420030200112001120010051959800200102002000400202000120010
2002420030200112001120010051959800200102002000400202000120010
2002420030200112001120010051959800200102002000400202000120010
2002420030200112001120010051959800200102002000400202000120010
2002420030200112001120010051959800200102002000400202000120010
2002420030200112001120010051959800200102002000400202000120010
2002420030200112001120010051959800200102002000400522000120010
2002420030200112001120010051959800200102002000400202000120010

Test 7: Latency 4->4

Code:

  adcs w0, w1, w2
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204100301020110201102122531081021210214302301010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102102534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100
10204100301020110201102082534321020810208302241010110100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100292531671002910032300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010
10024100301002110021100202533831002010020300201001110010

Test 8: throughput

Count: 8

Code:

  ands xzr, xzr, xzr
  adcs w0, w8, w9
  ands xzr, xzr, xzr
  adcs w1, w8, w9
  ands xzr, xzr, xzr
  adcs w2, w8, w9
  ands xzr, xzr, xzr
  adcs w3, w8, w9
  ands xzr, xzr, xzr
  adcs w4, w8, w9
  ands xzr, xzr, xzr
  adcs w5, w8, w9
  ands xzr, xzr, xzr
  adcs w6, w8, w9
  ands xzr, xzr, xzr
  adcs w7, w8, w9
  mov x8, 9
  mov x9, 10
  mov x10, 11

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.7992

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1602046407916011416011416012006820830160118160220024023016001380100
1602046393716011316011316011706825160160118160220024022416001180100
1602046389316011116011116011806819630160116160216024022416001280100
1602046393816011116011116011506822560160119160220024023016001580100
1602046392416011116011116011806822380160115160216024022416001080100
1602046393916011516011516011906822050160119160220024023016001580100
1602046390816011116011116011506821410160115160216024028416004980100
1602046389216011216011216011806819930160119160220024022416001180100
1602046393716011316011316011906821620160115160216024023016001180100
1602046392216011016011016011506821700160119160220024022416001080100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.7985

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1600246532416002316002316002868161616002616003624002016000180010
1600246406116001116001116001068254416001016002024002016000180010
1600246388716001116001116001068231516001016002024002016000180010
1600246388116001116001116001068261916001016002024002016000180010
1600246387216001116001116001068299516006816007824002016000180010
1600256383816005816005816006668594916001016002024002016000180010
1600246386016001116001116001068269016001016002024002016000180010
1600246388116001116001116001068259816001016002024002016000180010
1600246389616001116001116001068256016001016002024002016000180010
1600246387816001116001116001068257016001016002024002016000180010

Test 9: throughput

Count: 4

Code:

  fcmp s0, s0
  adcs w0, w4, w5
  adcs w1, w4, w5
  adcs w2, w4, w5
  adcs w3, w4, w5
  mov x4, 5
  mov x5, 6
  mov x6, 7

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.6208

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
5020424830501054010310002401111000330869340017501164021210004120236200084000140100
5020424840501044010110003401121000430850340017501164021210004120236200084000140100
5020424831501044010110003401121000430922440017501164021210004120236200084000140100
5020424831501044010110003401121000430922440017501164021210004120236200084000140100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100
5020424831501044010110003401121000430922240017501164021210004120236200084000140100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.6197

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
5002424878500164001410002400241000431308240013500224002910003120020200004000140010
5002424783500114001110000400101000031196240000500104002010000120020200004000140010
5002424789500114001110000400101000031197640000500104002010000120020200004000140010
5002424789500114001110000400101000031197840000500104002010000120020200004000140010
5002424789500114001110000400101000031197640000500104002010000120020200004000140010
5002424789500114001110000400101000031197840000500104002010000120020200004000140010
5002424789500114001110000400101000031197840000500104002010000120020200004000140010
5002424789500114001110000400101000031197840000500104002010000120020200004000140010
5002424789500114001110000400101000031197840000500104002010000120020200004000140010
5002424789500114001110000400101000031197840000500104002010000120020200004000140010

Test 10: throughput

Count: 7

Code:

  ands xzr, xzr, xzr
  adcs w0, w7, w8
  adcs w1, w7, w8
  adcs w2, w7, w8
  adcs w3, w7, w8
  adcs w4, w7, w8
  adcs w5, w7, w8
  adcs w6, w7, w8
  mov x7, 8
  mov x8, 9
  mov x9, 10

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5844

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
80204409218011080110801160540886008011280212002102308000770100
80204408868010680106801120541028008011280212002102308000770100
80204409058010780107801120541028008011280212002102308000770100
80204409058010780107801120541028008011280212002102308000770100
80204409058010780107801120541028008011280212002102308000770100
80204409058010780107801120541028008011280212002102308000770100
80204409058010780107801120541028008011280212002102308000770100
80204409058010780107801120541028008011280212002103448003870100
80204409058010780107801120541028008011280212002102308000770100
80204409058010780107801120541028008011280212002102308000770100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5839

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
800244103180025800258003654262480020800202100208001170010
800244087680021800218002054262480020800202100208001170010
800244084580021800218002054262480020800202100208001170010
800244087680021800218002054262480020800202100208001170010
800244087680021800218002054262480020800202100208001170010
800244087680021800218002054262480020800202100208001170010
800244087680021800218002054262480020800202100208001170010
800244087680021800218002054262480020800202100208001170010
800244087680021800218002054262480020800202100208001170010
800244087680021800218002054262480020800202100208001170010