Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

CCMN (register, 64-bit)

Test 1: uops

Code:

  ccmn x0, x1, #0, hi
  mov x0, 1
  mov x1, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001

Test 2: Latency 3->1

Chain cycles: 1

Code:

  ccmn x0, x1, #0, hi
  cset x0, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201085194152010720214402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185195162001820034400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010

Test 3: Latency 3->2

Chain cycles: 1

Code:

  ccmn x0, x1, #0, hi
  cset x1, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201085193112010720214402282000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201075195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185194542001020020400202000110010
20024200302001120011200105199562005820080400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010

Test 4: Latency 3->3

Code:

  ccmn x0, x1, #0, hi
  mov x0, 1
  mov x1, 2

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
102041003010201102011021225458110211102143024210101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100292550861002910030300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110

Test 5: throughput

Count: 8

Code:

  ands xzr, xzr, xzr
  ccmn x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, x1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.7889

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
16020463251160112160112160118686288160122160222240227016001300100
16020463083160112160112160117691935160118160220240236016001800100
16020463127160115160115160120686358160118160220240236016001400100
16020463141160117160117160123691935160118160220240230016001100100
16020463147160113160113160119687239160115160216240224016001000100
16020463125160112160112160118688095160118160220240230016001200100
16020562591160148160148160153692101160118160220240230016001200100
16020463130160115160115160120686737160124160224240230016001200100
16020463144160115160115160120690395160120160220240224016001000100
16020463130160115160115160120688458160120160224240230016001200100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.7830

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1600246451616002116002116002669764416002616003824002016000110
1600246334016001116001116001067192516001016002024002016000110
1600246255516001116001116001067133316001016002024002016000110
1600246269716001116001116001067138616001016002024002016000110
1600246256016001116001116001066980016001016002024002016000110
1600246263716001116001116001067217416001016002024002016000110
1600246259016001116001116001067003016001016002024002016000110
1600246270416001116001116001067229816001016002024002016000110
1600246256116001116001116001066994916001016002024002016000110
1600256270216006416006416007266881116001016002024002016000110

Test 6: throughput

Count: 4

Code:

  fcmp s0, s0
  ccmn x0, x1, #0, hi
  ccmn x0, x1, #0, hi
  ccmn x0, x1, #0, hi
  ccmn x0, x1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5996

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
50204240235010540102100034011210004315483400175011840214100041202392000840002100
50204241015010640103100034011310004315011400125011240209100031202422000840004100
50204239825010540102100034011210004315447400125011240209100031202272000640001100
50204241385010540103100024011010003315132400175011640212100041202272000640001100
50204239875010440101100034011210004314801400125011240209100031202362000840002100
50204239795010340101100024010910003315522400175011640212100041202362000840001100
50204239995010540102100034011210004315087400175011940216100041202272000640003100
50204239925010540102100034011210004315619400175011640212100041202272000640001100
50204239665010440101100034010910003315070400135011240209100031202362000840001100
50204239945011040107100034011510004315702400125011240209100031202362000840001100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5998

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
5002424127500184001510003400201000331717140022500324003710005120020200004000110
5002423972500114001110000400101000031616940000500104002010000120020200004000110
5002424053500114001110000400101000031570840000500104002010000120020200004000110
5002424003500114001110000400101000031646140000500104002010000120020200004000110
5002424038500114001110000400101000031668140000500104002010000120020200004000110
5002424030500114001110000400101000031589840000500104002010000120020200004000110
5002424022500114001110000400101000031633340000500104002010000120020200004000110
5002423981500114001110000400101000031667740000500104002010000120020200004000110
5002423974500114001110000400101000031739340000500104002010000120020200004000110
5002423993500114001110000400101000031614340000500104002010000120020200004000110

Test 7: throughput

Count: 7

Code:

  ands xzr, xzr, xzr
  ccmn x0, x1, #0, hi
  ccmn x0, x1, #0, hi
  ccmn x0, x1, #0, hi
  ccmn x0, x1, #0, hi
  ccmn x0, x1, #0, hi
  ccmn x0, x1, #0, hi
  ccmn x0, x1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5568

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8020438944801098010980113550147801178021821023680007100
8020438922801068010680112548136801558025621024280004100
8020438967801038010380111549137801088020821024280004100
8020438949801048010480108549880801118021221023080003100
8020438989801098010980116548756801168021621024280005100
8020439005801058010580114550503801168021621023080003100
8020439003801048010480111550390801168021621024280004100
8020438937801048010480114548220801118021221024280006100
8020438984801038010380108550455801148021621022180005100
8020438970801048010480114549185801088020821023080004100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5558

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
80024392158003080030800390549469008003980040002100658001710
80024388918002880028800380547424008002080020002100208001110
80024389818002880028800387084800340158584438596253932398403792100208001110
80025389468005880058800820548164008002080020002100208001110
80024388518002180021800200546608008002080020002100208001110
80024388818002180021800200547196008002080020002101858005610
80024389118002180021800200544422008002080020002100208001110
80024389388002180021800200548164008002080020002100208001110
80024388728002180021800200548130008007980081002100208001110
80024389578002180021800200545261008002080020002100208001110