Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

NEGS (register, lsl, 64-bit)

Test 1: uops

Code:

  negs x0, x0, lsl #17
  mov x0, 1
  mov x1, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 2.000

Integer unit issues: 2.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
100420302001200110005226510001000100020011000
100420302001200110005226510001000100020011000
100420302001200110005226510001000100020011000
100420302001200110005226510001000100020011000
100420302001200110005226510001000100020011000
100420302001200110005226510001000100020011000
100420302001200110005226510001000100020011000
100420302001200110005226510001000100020011000
100420302001200110005226510001000100020011000
100420302001200110005226510001000100020011000

Test 2: Latency 1->2

Code:

  negs x0, x0, lsl #17
  mov x0, 1
  mov x1, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 2.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
1020420030201012010101010452900110104102061020800200010010100
1020420030201012010101010452908710104102081020800200010010100
1020420030201012010101010452908710104102081020800200010010100
1020420030201012010101010452908710104102081020800200010010100
1020420030201012010101010452908710104102081020800200010010100
1020420030201012010101010452908710104102081020800200010010100
1020420030201012010101010452908710104102081020800200010010100
1020420030201012010101010452908710104102081020800200010010100
1020420030201012010101010452908710104102081020800200010010100
1020420030201012010101010452908710104102081020800200010010100

1000 unrolls and 10 iterations

Result (median cycles for code): 2.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
100242003020021200211002552921510025100301003000200110010010
100242003020021200211002552927210025100321003200200110010010
100242003020021200211002552927210025100321003200200110010010
100242003020021200211002552927210025100321003200200110010010
100242003020021200211002552927210025100321003200200110010010
100242003020021200211002552927210025100321003200200110010010
100242003020021200211002552927210025100321003200200110010010
100242003020021200211002552927210025100321003200200110010010
100242003020021200211002552927210025100321003200200110010010
100242003020021200211002552927210025100321003200200110010010

Test 3: Latency 3->2

Chain cycles: 1

Code:

  negs x0, x1, lsl #17
  cset x1, cc
  mov x0, 1
  mov x1, 2
  mov x2, 3
  mov x3, 4
  mov x4, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 1 chain cycle): 2.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20204300303010130101201067891162010620214202123000120100
20204300303010130101201057893692010520212202123000120100
20204300303010130101201057893692010520212202123000120100
20204300303010130101201057893692010520212202123000120100
20204300303010130101201057893692010520212202123000120100
20204300303010130101201057893692010520212202123000120100
20204300303010130101201057893692010520212202123000120100
20204300303010130101201057893692010520212202123000120100
20204300303010130101201057893692010520212202123000120100
20204300303010130101201057893692010520212202123000120100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 1 chain cycle): 2.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
20024300303001130011200150789378020015200320200323000120010
20024300303001130011200100789438020010200200200203000120010
20024300303001130011200100789438020010200200200203000120010
20024300303001130011200100789438020010200200200203000120010
20024300303001130011200100789438020010200200200203000120010
20024300303001130011200100789438020010200200200203000120010
20024300303001130011200100789438020010200200200203000120010
20024300303001130011200100789438020010200200200203000120010
20024300303001130011200100789438020010200200200203000120010
20024300303001130011200100789438020010200200200203000120010

Test 4: throughput

Count: 8

Code:

  negs x0, x8, lsl #17
  negs x1, x8, lsl #17
  negs x2, x8, lsl #17
  negs x3, x8, lsl #17
  negs x4, x8, lsl #17
  negs x5, x8, lsl #17
  negs x6, x8, lsl #17
  negs x7, x8, lsl #17
  mov x8, 9
  mov x9, 10
  mov x10, 11

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.6675

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
802045340416011516011580122110007680123802248022416001880100
802045341716012216012280127110007680123802248022416001480100
802045340416011416011480123109998880127802288022416001480100
802045340416011416011480123110007680123802248022416001480100
802045340416011416011480123110007680123802248022416001480100
802045340416011416011480123110007680123802248022416001480100
802045340416011416011480123110007680123802248022416001480100
802045340416011416011480123110007680123802248022416001480100
802045340416011416011480123110007680123802248022416001480100
802045340416011416011480123110007680123802248022416001480100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.6671

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
8002453386160042160042008004801107564800208002080059001600780080010
8002453371160021160021008002001107732800208002080020001600110080010
8002453371160021160021008002001107425800578005780020001600110080010
8002453371160021160021008002001107732800208002080020001600110080010
8002553410160084160084008008801107732800208002080020001600110080010
8002453371160021160021008002001109213800578005780020001600110080010
8002453371160021160021008002001107216800968009680020001600110080010
8002453420160085160085008005701107732800208002080020001600110080010
8002453371160021160021008002001108021800958009580020001600110080010
8002453371160021160021008002001107732800208002080020001600110080010