Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

CCMN (register, 32-bit)

Test 1: uops

Code:

  ccmn w0, w1, #0, hi
  mov x0, 1
  mov x1, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001

Test 2: Latency 3->1

Chain cycles: 1

Code:

  ccmn w0, w1, #0, hi
  cset x0, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201085193202010820214402322000110100
20204200302010120101201085195482010820216402282000110100
20204200302010120101201085204702019520307403192002210100
20204200302010120101201085195482010820216404172004410100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185194962001820034400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010

Test 3: Latency 3->2

Chain cycles: 1

Code:

  ccmn w0, w1, #0, hi
  cset x1, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201075193202010820214402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20205200602011520115201475198662014820261402282000110100
20204200302010120101201085195482010820216405052006510100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402312000110100
20204200802012420124201535195482010820216402322000110100
20204200302010120101201085200192015220261402322000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185194762001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010

Test 4: Latency 3->3

Code:

  ccmn w0, w1, #0, hi
  mov x0, 1
  mov x1, 2

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
102041003010201102011020925461210211102143024210101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100282550521002910030300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110

Test 5: throughput

Count: 8

Code:

  ands xzr, xzr, xzr
  ccmn w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmn w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmn w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmn w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmn w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmn w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmn w0, w1, #0, hi
  ands xzr, xzr, xzr
  ccmn w0, w1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.7889

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
16020463279160111160111160116684128160118160218240239160018100
16020463130160114160114160120690295160120160220240233160017100
16020463110160110160110160116689181160120160220240230160012100
16020563131160152160152160159686864160118160220240284160051100
16020463113160119160119160124687376160118160220240227160012100
16020462510160111160111160116689382160123160224240230160012100
16020463088160112160112160118688171160120160220240230160013100
16020463102160113160113160118689272160120160220240296160056100
16020463127160112160112160118684087160156160256240233160013100
16020463107160119160119160124689066160117160218240230160014100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.7830

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
1600246442516002616002616003369054416002716003824002001600010010
1600246317816001116001116001067164116001016002024002001600010010
1600246254116001116001116001066924116001016002024002001600010010
1600246264916001116001116001067194416001016002024002001600010010
1600246255116001116001116001067051616001016002024002001600010010
1600246266116001116001116001067208716006716007724002001600010010
1600246254516001116001116001066976916001016002024002001600010010
1600246269116001116001116001067214716001016002024002001600010010
1600246254016001116001116001066971716001016002024002001600010010
1600246263116001116001116001067102816001016002024002001600010010

Test 6: throughput

Count: 4

Code:

  fcmp s0, s0
  ccmn w0, w1, #0, hi
  ccmn w0, w1, #0, hi
  ccmn w0, w1, #0, hi
  ccmn w0, w1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5996

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
50204240185010840105100034011710005315002400225012240217100051202362000840001100
50204239765010540102100034011210004315388400185011640212100041202362000840001100
50204240005010640103100034010910003315502400125011240209100031202272000640001100
50204239735010940105100044011610004315302400135011240209100031202362000840001100
50204239795010440101100034011210004315423400495015640244100121202272000640003100
50204239925010440101100034010910003315003400175011640212100041202272000640001100
50204239795010640103100034010910003315364400185011640212100041202482000840007100
50204240135010440101100034011210004315393400135011240209100031202362000840001100
50204239725011040107100034011510004315063400135011240209100031202362000840002100
50204239825010540102100034011210004315133400175011640212100041202362000840001100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.6000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
5002424091500184001510003400201000331620640000500104002010000120020200004000110
5002424011500114001110000400101000031647040000500104002010000120020200004000110
5002424064500114001110000400101000031620840000500104002010000120020200004000110
5002423972500114001110000400101000031718840000500104002010000120020200004000110
5002423968500114001110000400101000031702940000500104002010000120020200004000110
5002424013500114001110000400101000031586340000500104002010000120020200004000110
5002423983500114001110000400101000031631640000500104002010000120020200004000110
5002423980500114001110000400101000031636640000500104002010000120020200004000110
5002424001500114001110000400101000031595740000500104002010000120020200004000110
5002424030500114001110000400101000031634240000500104002010000120020200004000110

Test 7: throughput

Count: 7

Code:

  ands xzr, xzr, xzr
  ccmn w0, w1, #0, hi
  ccmn w0, w1, #0, hi
  ccmn w0, w1, #0, hi
  ccmn w0, w1, #0, hi
  ccmn w0, w1, #0, hi
  ccmn w0, w1, #0, hi
  ccmn w0, w1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5568

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8020439026801078010780116551094801558025521024280007100
8020439052801098010980114549980801168021621024280007100
8020438984801038010380111549899801118021221023080003100
8020438923801048010480109550985801088020821024580010100
8020438983801078010780119548462801148021621023080003100
8020439022801058010580116548570801148021621023080004100
8020439042801058010580111549137801088020821023080003100
8020438970801038010380108550985801088020821023080003100
8020438974801098010980116550563801118021221023380011100
8020438996801118011180117549066801128021421024280006100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5555

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
800243917780030800308004454956280038800382101798005310
800243887780027800278003754880180020800202100208001110
800243886780021800218002054760880020800202100208001110
800243885680021800218002054678580020800202100208001110
800243891780021800218002054787380020800202100208001110
800243893080021800218002054850080020800202100208001110
800243890080021800218002054760880020800202100208001110
800243891880021800218002054609780020800202100208001110
800243891080021800218002054638080020800202100208001110
800243887180021800218002054611280020800202100208001110