Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

CMN (sxtw, 32-bit)

Test 1: uops

Code:

  cmn w0, w1, sxtw
  mov x0, 1
  mov x1, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)
100452610011001100030001000100020001001
100439510011001100030001000100020001001
100439310011001100030001000100020001001
100439710011001100030001000100020001001
100439110011001100030001000100020001001
100450610011001100030001000100020001001
100439110011001100030001000100020001001
100439510011001100030001000100020001001
100439110011001100030001000100020001001
100439310011001100030001000100020001001

Test 2: Latency 3->1

Chain cycles: 1

Code:

  cmn w0, w1, sxtw
  cset x0, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201075193382010820214302242000110100
20204200302010120101201075195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302962001510100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185194762001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010

Test 3: Latency 3->2

Chain cycles: 1

Code:

  cmn w0, w1, sxtw
  cset x1, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201075193312010920216302272000110100
20204200302010120101201075195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302202000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185194542001020020300202000110010
20024200302001120011200105194542001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020301102001510010

Test 4: throughput

Count: 8

Code:

  cmn w0, w1, sxtw
  cmn w0, w1, sxtw
  cmn w0, w1, sxtw
  cmn w0, w1, sxtw
  cmn w0, w1, sxtw
  cmn w0, w1, sxtw
  cmn w0, w1, sxtw
  cmn w0, w1, sxtw
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.3636

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8020429206801138011380117240351801178021816023480013100
8020429058801148011480119240354801188022016024080015100
8020429114801148011480119240351801178021816024080013100
8020429034801158011580119240354801188022016031280052100
8020429049801138011380118240360801208022016024080013100
8020429167801158011580119240357801198022016024080014100
8020429049801138011380118240360801208022016024080013100
8020429167801158011580119240354801188022016024080015100
8020429071801138011380118240360801208022016024080012100
8020429163801138011380118240354801188022016024080012100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.3632

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
800243000980035800358003924008480020800201600208001110
800242937880174801748017324009080020800201600208001110
800242896880021800218002024010980020800201600208001110
800242907780021800218002024010680020800201600208001110
800242895080021800218002024011880020800201600208001110
800242905180021800218002024009280020800201600208001110
800242893280021800218002024011780020800201603228016210
800242905180021800218002024010680020800201600208001110
800242916580021800218002024009880020800201600208001110
800242893980021800218002024010780020800201600208001110