Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

CMP (register, 32-bit)

Test 1: uops

Code:

  cmp w0, w1
  mov x0, 1
  mov x1, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 1.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)
100450410011001100030001000100020001001
100439210011001100030001000100020001001
100439210011001100030001000100020001001
100438910011001100030001000100020001001
100439610011001100030001000100020001001
100438910011001100030001000100020001001
100439110011001100030001000100020001001
100439410011001100030001000100020001001
100439310011001100030001000100020001001
100439010011001100030001000100020001001

Test 2: Latency 3->1

Chain cycles: 1

Code:

  cmp w0, w1
  cset x0, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204200302010120101201075193382010820214302212000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100
20204200302010120101201085195482010820216302242000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185194762001020020300202000110010
20024200302001120011200105195982001020020301112001510010
20025200602002520025200595195982001020020300202000110010
20024200302001120011200195195162001820034300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010

Test 3: Latency 3->2

Chain cycles: 1

Code:

  cmp w0, w1
  cset x1, cc
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
2020420030201012010120107051946400201102021800302212000110100
2020420030201012010120108051954800201082021600302912002310100
2020420030201012010120108051954800201082021600302242000110100
2020420030201012010120108051954800201082021600302242000110100
2020420030201012010120108051954800201082021600302242000110100
2020420030201012010120108051954800201082021600302242000110100
2020420030201012010120108052001000201512025900302242000110100
2020420030201012010120108051954800201082021600302242000110100
2020420030201012010120108051954800201082021600302242000110100
2020420030201012010120108051954800201082021600302242000110100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 1.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024200302001120011200185194762001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010
20024200302001120011200105195982001020020300202000110010

Test 4: throughput

Count: 8

Code:

  cmp w0, w1
  cmp w0, w1
  cmp w0, w1
  cmp w0, w1
  cmp w0, w1
  cmp w0, w1
  cmp w0, w1
  cmp w0, w1
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.3635

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
8020429154801158011580119499239080155988719949189297459185116023680013100
802042913280114801148011902403540080118802200016024080012100
802042908280113801138011802403510080117802180016024080015100
802042906680113801138011802403540080118802200016031280052100
802042908080114801148011902403570080119802200016024080015100
802042912080113801138011802403510080117802200016024080013100
802042908280115801158011902403570080119802200016024080015100
802042908280113801138011802403540080118802200016024080012100
802042908280115801158011902403540080118802200016024080015100
802042908780115801158011902403540080118802200016024080015100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.3631

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
800243051980034800348003824010180020800201600208001110
800242915680021800218002024011480020800201600208001110
800242915580021800218002024011280020800201600208001110
800242910280021800218002024011580020800201600208001110
800242912980021800218002024011280020800201600208001110
800242912980021800218002024009680020800201600208001110
800242911380021800218002024009680020800201600208001110
800242915580021800218002024011280020800201600208001110
800242910380021800218002024010380020800201600208001110
800242912280021800218002024010380020800201600208001110