Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

CCMP (register, 32-bit)

Test 1: uops

Code:

  ccmp w0, w1, #0, hi
  mov x0, 1
  mov x1, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251831000100030001001
10041030100110011000251921000100030001001

Test 2: Latency 3->1

Chain cycles: 1

Code:

  ccmp w0, w1, #0, hi
  cset x0, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201075197522014720258402282000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085198662014820260402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085200192015220260402322000110100
20204200302010120101201085195482010820216402322000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010

Test 3: Latency 3->2

Chain cycles: 1

Code:

  ccmp w0, w1, #0, hi
  cset x1, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0475

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201075193282010720214402322000110100
20204203312023320233203775232882046420586407972012110100
20204203462022920229203745236662050620626410662019210100
20204203742024820248204165228342041620544412462023410100
20204202792020820208203295232852045920582410552019210100
20204204012025020250204175232882046020579409762017110100
20204202782021020210203305232832050520636409742017010100
20204205242031320313205485223422037120492407962012710100
20204205242031320313205485233032046020579408822014910100
20204204772029420294205065237392050720639411742021310100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185195162001820034400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010

Test 4: Latency 3->3

Code:

  ccmp w0, w1, #0, hi
  mov x0, 1
  mov x1, 2

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
102041003010201102011020825470910208102083024210101100
102041003010201102011021225470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100292550051002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300591001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110

Test 5: throughput

Count: 8

Code:

  ands xzr, xzr, xzr
  ccmp w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, w1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.7889

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
16020463213160115160115160120686407160118160218240233160015100
16020463145160113160113160117689534160118160220240230160015100
16020463087160112160112160118688308160120160220240224160012100
16020463082160112160112160118689796160118160220240230160014100
16020463096160117160117160121672010160116160217240230160015100
16020463112160112160112160118685997160118160220240224160010100
16020562704160148160148160154688958160120160220240230160014100
16020463129160112160112160118688458160120160224240230160015100
16020463114160109160109160118688958160120160220240230160013100
16020463129160112160112160118689253160120160220240230160012100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.7831

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1600246456216002316002316002769224716002716003824004716001110
1600246321716002316002316003067062316001016002024002016000110
1600246266816001116001116001067157816001016002024002016000110
1600246251516001116001116001066990216001016002024002016000110
1600246266616001116001116001067148316001016002024002016000110
1600246250716001116001116001066998216001016002024002016000110
1600246267116001116001116001067175116007116008224002016000110
1600246252716001116001116001067028616001016002024002016000110
1600246266916001116001116001067203316001016002024002016000110
1600246256716001116001116001067056916001016002024002016000110

Test 6: throughput

Count: 4

Code:

  fcmp s0, s0
  ccmp w0, w1, #0, hi
  ccmp w0, w1, #0, hi
  ccmp w0, w1, #0, hi
  ccmp w0, w1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5998

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
50204240035011040106100044011710005315434400175011840214100041202422000840003100
50204239945010340101100024011210004315122400175011840214100041202272000640001100
50204240045010640103100034011210004315063400135011240209100031202362000840001100
50204239825010440101100034010910003315026400175011640212100041202272000640001100
50204239815010440101100034011210004315105400165012040216100041202362000840001100
50204240005010340101100024011210004314993400135011240209100031202272000640003100
50204240045010640103100034011210004315509400185011640212100041202362000840002100
50204239885010440101100034010910003315436400175011640212100041202362000840001100
50204239675010340101100024010910003315898400175011640212100041203322002240028100
50204239935010340101100024010910003314891400125011240209100031202482000840007100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5998

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
5002424255500124001110001400161000231644040017500284003410004120020200004000110
5002423991500114001110000400101000031582840000500104002010000120020200004000110
5002423962500114001110000400101000031683840000500104002010000120020200004000110
5002423984500114001110000400101000031667940000500104002010000120020200004000110
5002423993500114001110000400101000031618640000500104002010000120020200004000110
5002423993500114001110000400101000031570840000500104002010000120020200004000110
5002423975500114001110000400101000031601240000500104002010000120020200004000110
5002423970500114001110000400101000031636540000500104002010000120020200004000110
5002423993500114001110000400101000031574640000500104002010000120020200004000110
5002423978500114001110000400101000031630140000500104002010000120020200004000110

Test 7: throughput

Count: 7

Code:

  ands xzr, xzr, xzr
  ccmp w0, w1, #0, hi
  ccmp w0, w1, #0, hi
  ccmp w0, w1, #0, hi
  ccmp w0, w1, #0, hi
  ccmp w0, w1, #0, hi
  ccmp w0, w1, #0, hi
  ccmp w0, w1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5568

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
80204389078010680106008011305504398011780218210236008000600100
80204389028010380103008011205499928011180212210230008000300100
80204389698010380103008011105505638011180212210230008000500100
80204389908010580105008011405483928011480216210230008000300100
80204390038010480104008011105491378011180212210242008000400100
80204389858010380103008011105499028011180212210242008000600100
80204389748010480104008011405505638011180212210353008003700100
80204388998010580105008011305509858010880208210242008000700100
80204389678010780107008011605468068011680217210257008001100100
8631046464852428304823217182850235479998011980220210245008001100100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5557

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
800243908380030800308003905473740800208002002100208001110
800243888180021800218002005490570800208002002100208001110
800243891180021800218002005453230800208002002100208001110
800243893080021800218002005489040800208002002100208001110
800243889280021800218002005474010800208002002100208001110
800243890180021800218002005476670800208002002100208001110
800243892780021800218002005465710800208002002100208001110
800243886180021800218002005454500800208002002100208001110
800243890580021800218002005454930800208002002100208001110
800243892880021800218002005486210800208002002100208001110