Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

CCMN (immediate, 64-bit)

Test 1: uops

Code:

  ccmn x1, #3, #0, hi
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001

Test 2: Latency 2->1

Chain cycles: 1

Code:

  ccmn x1, #3, #0, hi
  cset x1, cc
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201085193392010820216302242000110100
20204200302010120101201075195482010820216302242000110100
20204200302010120101201075195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216303592004310100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010

Test 3: Latency 2->2

Code:

  ccmn x0, #3, #0, hi
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
102041003010201102011021225452410212102142021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100292549961002910032200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202552361002910032200201001110

Test 4: throughput

Count: 8

Code:

  ands xzr, xzr, xzr
  ccmn x0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmn x0, #3, #0, hi
  mov x0, 1

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.7890

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
16020563304160149160149160157687542160118160220160220160011100
16020463107160119160119160124688661160118160218160224160014100
16020463115160112160112160118687376160118160220160218160013100
16020463133160115160115160120689396160118160220160220160015100
16020463122160113160113160118688887160118160220160220160010100
16020463103160114160114160120687260160118160220160220160012100
16020463135160114160114160119686593160116160216160220160009100
16020463080160115160115160120689787160120160224160220160012100
16020463142160114160114160120689502160160160261160220160015100
16020463127160112160112160118689285160118160220160224160017100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.7882

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1600246463716002616002616003069956916003016004216002016000110
1600246326416001116001116001069645516001016002016002016000110
1600246304416001116001116001069736616001016002016002016000110
1600246305516001116001116001069813616001016002016002016000110
1600246307116001116001116001069714116001016002016002016000110
1600246306016001116001116001070090816001016002016002016000110
1600246303116001116001116001070017916001016002016002016000110
1600246307616001116001116001069879216001016002016002016000110
1600246304916001116001116001070228416001016002016002016000110
1600246304116001116001116001069895216001016002016002016000110

Test 5: throughput

Count: 4

Code:

  fcmp s0, s0
  ccmn x0, #3, #0, hi
  ccmn x0, #3, #0, hi
  ccmn x0, #3, #0, hi
  ccmn x0, #3, #0, hi
  mov x0, 1

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5998

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
5020424000501064010310003401111000331506440012501124020910003802342001040007100
5020423991501114010710004401171000531522040017501194021610004802242000840001100
5020423999501064010310003401091000331509740012501124020910003802242000840001100
5020423990501044010110003401121000431506840017501164021210004802242000840001100
5020423997501064010310003401091000331530040013501124020910003802242000840001100
5020423997501064010310003401091000331544440017501164021210004802242000840001100
5020423990501034010110002401091000331530040013501124020910003802242000840001100
5020423987501044010110003401091000331533440012501124020910003802242000840001100
5020424000501034010110002401121000431544740012501124020910003802182000640003100
5020424000501034010110002401121000431506840017501164021210004802242000840001100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5995

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
500242417350019400161000340024100043162574000050010400201000080020200004000110
500242400850011400111000040010100003164204000050010400201000080020200004000110
500242395650011400111000040010100003159374000050010400201000080020200004000110
500242395650011400111000040010100003163494000050010400201000080082200144002110
500242393850011400111000040010100003162964000050010400201000080020200004000110
500242399350011400111000040010100003163414000050010400201000080020200004000110
500242399350011400111000040010100003164564000050010400201000080020200004000110
500242399350011400111000040010100003163634000050010400201000080020200004000110
500242400950011400111000040010100003168114000050010400201000080020200004000110
500242400950011400111000040010100003154574000050010400201000080020200004000110

Test 6: throughput

Count: 7

Code:

  ands xzr, xzr, xzr
  ccmn x0, #3, #0, hi
  ccmn x0, #3, #0, hi
  ccmn x0, #3, #0, hi
  ccmn x0, #3, #0, hi
  ccmn x0, #3, #0, hi
  ccmn x0, #3, #0, hi
  ccmn x0, #3, #0, hi
  mov x0, 1

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5568

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8020438968801048010480114552073801088020814021480004100
8020439026801028010280111550563801118021214022480006100
8020438941801068010680116547262801168021614029080038100
8020438933801068010680114549055801118021214022080003100
8020438933801068010680114550632801168021614021480004100
8020438981801048010480114547003801088020814022880006100
8020439012801068010680116551366801088020814022880006100
8020438955801048010480108551366801088020814022880007100
8020438933801068010680114547003801088020814021480004100
8020438979801078010780116551366801088020814021480004100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5561

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
80024391418003280032008004400549460008004180042001400208001110
80024389368002180021008002000549871008002080020001400208001110
80024389728002180021008002000551106008002080020001400208001110
80024389718002180021008002000550934008002080020001400208001110
80024389728002180021008002000551068008002080020001400208001110
80024388858002180021008002000551106008002080020001400208001110
80024389418002180021008002000551106008002080020001400208001110
80024388978002180021008002000551233008002080020001400208001110
80024389128002180021008002000551233008002080020001400208001110
80024389318002180021008002000548246008002080020001400208001110