Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

CCMP (register, 64-bit)

Test 1: uops

Code:

  ccmp x0, x1, #0, hi
  mov x0, 1
  mov x1, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001
10041030100110011000251921000100030001001

Test 2: Latency 3->1

Chain cycles: 1

Code:

  ccmp x0, x1, #0, hi
  cset x0, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201070519311020107202140402322000110100
20205200602011520115201480519548020108202160402322000110100
20204200302010120101201080519548020108202160402322000110100
20204200302010120101201080519548020108202160402322000110100
20204200302010120101201080519548020108202160402322000110100
20204200302010120101201080519548020108202160402322000110100
20204200302010120101201080519548020108202160402322000110100
20204200302010120101201080519548020108202160402322000110100
20204200302010120101201080519548020108202160402322000110100
20204200302010120101201080519548020108202160402322000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185195162001820034400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020401402001510010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010

Test 3: Latency 3->2

Chain cycles: 1

Code:

  ccmp x0, x1, #0, hi
  cset x1, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201085193382010820214402282000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085198662014820264402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100
20204200302010120101201085195482010820216402322000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010
20024200302001120011200105195982001020020400202000110010

Test 4: Latency 3->3

Code:

  ccmp x0, x1, #0, hi
  mov x0, 1
  mov x1, 2

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
102041003010201102011020825471610212102143024210101100
102041003010201102011021125470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100
102041003010201102011020825470910208102083022410101100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100282550521002910030300201001110
10024100301002110021100202551931002010020300201001110
10025100601003510035100692551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110
10024100301002110021100202551931002010020300201001110

Test 5: throughput

Count: 8

Code:

  ands xzr, xzr, xzr
  ccmp x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmp x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmp x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmp x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmp x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmp x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmp x0, x1, #0, hi
  ands xzr, xzr, xzr
  ccmp x0, x1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.7889

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? int retires (ef)
1602046331016011416011416011868847716012016022224023901600190100
1602046313616011216011216011769005816011816021824023601600140100
1602046314316010816010816011569230316012016022024023001600130100
1602046312916011216011216011868635816011816022024022401600110100
1602046313416011416011416012067151616011916022124023001600100100
1602046312716011216011216011868635816011816022024022401600110100
1602046316016011216011216011869130516012316022424023001600120100
1602046309816011216011216011868635816011816022024023601600140100
1602046311816011216011216011869193516011816022024023001600120100
1602046308116011416011416012068791916011816022024022401600110100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.7830

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1600246459116002616002616003369057116001016002024002016000110
1600246319916001116001116001067127016006416007624002016000110
1600246272716001116001116001067133316001016002024002016000110
1600246255416001116001116001067024816001016002024002016000110
1600246264616001116001116001067164116001016002024002016000110
1600246251916001116001116001066966716001016002024002016000110
1600246256416001116001116001066998216001016002024002016000110
1600256270816006116006116006867147416001016002024002016000110
1600246256616001116001116001066946416001016002024002016000110
1600246263116001116001116001067188416001016002024002016000110

Test 6: throughput

Count: 4

Code:

  fcmp s0, s0
  ccmp x0, x1, #0, hi
  ccmp x0, x1, #0, hi
  ccmp x0, x1, #0, hi
  ccmp x0, x1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5998

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
50204240245010540102100034011210004315845400225012240217100051202422000840004100
50204240045010640103100034011210004315126400175011840214100041202482001040004100
50204239935010340101100024010910003314822400175011640212100041202272000640003100
50204239795010640103100034010910003315434400125011240209100031202272000640001100
50204239945010540102100034011210004315617400175011640212100041202362000840002100
50204239835010340101100024010910003315110400175011640212100041202362000840002100
50204239915010340101100024011210004315137400175011640212100041202272000640001100
50204239845010340101100024010910003315896400175011640212100041202482000840007100
50204240135010440101100034011210004315287400125011240209100031202362000840002100
50204239915010340101100024011210004315271400185011640212100041202362000840001100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5998

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
5002424153500174001410003400221000431694540000500104002010000120020200004000110
5002423976500114001110000400101000031548240000500104002010000120020200004000110
5002424027500114001110000400101000031614340000500104002010000120020200004000110
5002423993500114001110000400101000031596640000500104002010000120020200004000110
5002424002500114001110000400101000031653140000500104002010000120020200004000110
5002423993500114001110000400101000031545740000500104002010000120020200004000110
5002423993500114001110000400101000031668140000500104002010000120158200244002510
5002424051500114001110000400101000031609640000500104002010000120020200004000110
5002423992500114001110000400101000031562140000500104002010000120020200004000110
5002423993500114001110000400101000031636040000500104002010000120020200004000110

Test 7: throughput

Count: 7

Code:

  ands xzr, xzr, xzr
  ccmp x0, x1, #0, hi
  ccmp x0, x1, #0, hi
  ccmp x0, x1, #0, hi
  ccmp x0, x1, #0, hi
  ccmp x0, x1, #0, hi
  ccmp x0, x1, #0, hi
  ccmp x0, x1, #0, hi
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5568

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8020439015801048010480111547994801148021421022180003100
8020438932801078010780113549924802688036921024280004100
8020438970801048010480114549465801118021221022180003100
8020539028801408014080152549021801148021621024280005100
8020438941801068010680114548963801168021621024280004100
8020438997801038010380111549465801118021221023080004100
8020439003801048010480111549137801088020821023080003100
8020439009801068010680116549968801118021221024280006100
8020438969801038010380111549657801118021221023080003100
8020439004801038010380111549608801118021221024280008100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5557

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
80024390978002780027800350548681008002080020002100208001110
80024389008002180021800200547121008002080020002100208001110
80024389248002180021800200545912008002080020002100208001110
80024389328002180021800200548339008002080020002100208001110
80024388798002180021800200546309008002080020002100208001110
80024389048002180021800200549018008002080020002100208001110
80024389458002180021800200544927008002080020002100208001110
80024388988002180021800200548589008017180171002103328013010
80025389328005680056800760544645008002080020002100658001710
80024389008002780027800370545183008002080020002100208001110