Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

CCMP (immediate, 32-bit)

Test 1: uops

Code:

  ccmp w1, #3, #0, hi
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001
10041030100110011000251921000100020001001

Test 2: Latency 2->1

Chain cycles: 1

Code:

  ccmp w1, #3, #0, hi
  cset x1, cc
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201085193392010820216302212000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185194542001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010

Test 3: Latency 2->2

Code:

  ccmp w0, #3, #0, hi
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(non-fused SUB/CBNZ loop)

100 unrolls and 100 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
102041003010201102011021225470910208102082022410101100
102041003010201102011021125470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102051006010217102171025325470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100
102041003010201102011020825470910208102082021610101100

1000 unrolls and 10 iterations

Result (median cycles for code): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024100301002110021100282551501002810032200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110
10024100301002110021100202551931002010020200201001110

Test 4: throughput

Count: 8

Code:

  ands xzr, xzr, xzr
  ccmp w0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, #3, #0, hi
  ands xzr, xzr, xzr
  ccmp w0, #3, #0, hi
  mov x0, 1

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.7891

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
160206633221601881601880160193688331160118160220160216160011100
160204631541601121601120160118689272160120160220160218160011100
160204631091601101601100160115689378160118160220160216160011100
160204631331601151601150160120691091160124160226160220160013100
160204630911601121601120160118688408160160160262160220160015100
16228276987161916161233683161184689313160121160222160224160017100
160204631761601121601120160118686358160118160220160220160012100
160204631231601121601120160118687496160115160216160220160014100
160204631441601151601150160120687496160115160216160220160012100
160204631761601121601120160118688458160120160224160220160014100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.7887

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1600246460216003516003501600390696058001600421600440016002016001110
1600246336516002116002101600200697468001600201600200016002016001110
1600246308716002116002101600200695438001600201600200016002016001110
1600246308416002116002101600200689914001600201600200016002016001110
1600246305116002116002101600200696900001600201600200016002016001110
1600246309716002116002101600200692652001600201600200016002016001110
1600246312716002116002101600200697520001600201600200016002016001110
1600246306616002116002101600200698400001600201600200016002016001110
1600246306716002116002101600200695287001600201600200016002016001110
1600246311016002116002101600200694137001600201600200016002016001110

Test 5: throughput

Count: 4

Code:

  fcmp s0, s0
  ccmp w0, #3, #0, hi
  ccmp w0, #3, #0, hi
  ccmp w0, #3, #0, hi
  ccmp w0, #3, #0, hi
  mov x0, 1

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5998

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
5020424006501104010710003401171000531517640013501124020910003802322000840005100
5020424004501064010310003401121000431503640012501124020910003802322000840005100
5020423990501034010110002401091000331508640018501164021210004802322000840007100
5020423985501064010310003401091000331503640012501124020910003802282000840003100
5020423995501044010110003401091000331522340018501164021210004802282000840003100
5020423987501054010210003401121000431543440017501164021210004802242000840001100
5020423983501034010110002401091000331507040013501124020910003802982002640029100
5020423993501034010110002401091000331552840012501144021110003802182000640002100
5020423979501044010110003401091000331530240013501124020910003802182000640003100
5020424013501044010110003401121000431544940012501124020910003802242000840001100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5996

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? int retires (ef)
500252416050051400411001040054100110316063040000500104002001000080020200004000110
500242399650011400111000040010100000316685040000500104002001000080020200004000110
500242400650011400111000040010100000315631040000500104002001000080020200004000110
500242395950011400111000040010100000316679040000500104002001000080020200004000110
500242395950011400111000040010100000316377040000500104002001000080020200004000110
500242397550011400111000040010100000317393040000500104002001000080020200004000110
500242397450011400111000040010100000316877040000500104002001000080020200004000110
500242398550011400111000040010100000317393040000500104002001000080020200004000110
500242397550011400111000040010100000316679040000500104002001000080020200004000110
500242397550011400111000040010100000316748040000500104002001000080020200004000110

Test 6: throughput

Count: 7

Code:

  ands xzr, xzr, xzr
  ccmp w0, #3, #0, hi
  ccmp w0, #3, #0, hi
  ccmp w0, #3, #0, hi
  ccmp w0, #3, #0, hi
  ccmp w0, #3, #0, hi
  ccmp w0, #3, #0, hi
  ccmp w0, #3, #0, hi
  mov x0, 1

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5569

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8020439028801078010780116549212801118021214023880010100
8020539009801348013480151548332801118021214022880007100
8020438919801068010680116548113801168021614022080003100
8020438962801038010380114549992801118021214022880004100
8020439032801078010780116549837801118021214022880004100
8020438967801078010780116549608801118021214022880007100
8020439003801048010480111548458801148021614022880005100
8020438970801038010380108550877801168021614022880004100
8020438969801038010380111549196801168021614022880008100
8020439004801038010380111549512801148021614021480003100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5562

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
800243919380034800348004205498210080020800200014002000800110010
800243890580021800218002005496630080020800200014002000800110010
800243892480021800218002005490440080020800200014002000800110010
800243891780021800218002005517850080020800200014002000800110010
800243895480021800218002005493910080020800200014002000800110010
800243892680021800218002005486240080020800200014002000800110010
800243892780021800218002005491720080020800200014002000800110010
800243891580021800218002005519600080020800200014002000800110010
800243892280021800218002005491720080020800200014002000800110010
800243891580021800218002005491720080020800200014002000800110010