Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

MOV (vector, 16B)

Test 1: uops

Code:

  mov v0.16b, v1.16b
  nop ; nop ; nop ; nop ; nop ; nop ; nop
  movi v0.16b, 1
  movi v1.16b, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires (minus 7 nops): 1.000

Issues: 0.000

Integer unit issues: 0.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)map simd uop (7e)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)
80043876111000200011000
80042173111000200011000
80042052111000200011000
80042043111000200011000
80042043111000200011000
80042043111000200011000
80042043111000200011000
80042043111000200011000
80042043111000200011000
80042043111000200011000

Test 2: Latency 1->2

Chain cycles: 14

Code:

  mov v0.16b, v1.16b
  add v1.16b, v1.16b, v0.16b
  add v1.16b, v1.16b, v0.16b
  add v1.16b, v1.16b, v0.16b
  add v1.16b, v1.16b, v0.16b
  add v1.16b, v1.16b, v0.16b
  add v1.16b, v1.16b, v0.16b
  add v1.16b, v1.16b, v0.16b
  movi v0.16b, 1
  movi v1.16b, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 14 chain cycles): 0.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
8020414003370101101700001007000030035692487010020080006200160008180000100
8020414003370101101700001007000030035692487010020080004200160008180000100
8020414003370101101700001007000030035692487010020080004200160008180000100
8020514006670109101700081007003430035692487010020080006200160008180000100
8020414003370101101700001007000030035692487010020080004200160008180000100
8020414003370101101700001007000030035692487010020080004200160008180000100
8020414003370101101700001007000030035692487010020080004200160008180000100
8020414003370101101700001007000030035695807013420080052200160008180000100
8020414003370101101700001007000030035692487010020080004200160008180000100
8020414003370101101700001007000030035692487010020080004200160008180000100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 14 chain cycles): 0.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
80024140033700212170000207000070356924770020208000620160000118000010
80025140066700292170008207003470356924770020208000020160000118000010
80024140033700212170000207000070356924870020208000020160000118000010
80024140033700212170000207000070356924870020208000020160000118000010
80024140086700332170012207003670356924870020208000020160000118000010
80024140033700212170000207000070356924870020208000020160000118000010
80024140033700212170000207000066356958070054208005620160000118000010
80024140033700212170000207000070356924870020208000020160000118000010
80024140033700212170000207000070356924870020208000020160000118000010
80024140033700212170000207000070356924870020208000020160000118000010

Test 3: throughput

Count: 8

Code:

  mov v0.16b, v8.16b
  mov v1.16b, v8.16b
  mov v2.16b, v8.16b
  mov v3.16b, v8.16b
  mov v4.16b, v8.16b
  mov v5.16b, v8.16b
  mov v6.16b, v8.16b
  mov v7.16b, v8.16b
  movi v8.16b, 9

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.2511

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
80204202954000910139908100399123001596524001220080024200160052180000100
80204200984000910139908100399123001596524001220080024200160048180000100
80204200864000910139908100399123001596524001220080024200160048180000100
80204200864000910139908100399123001596524001220080024200160048180000100
80204200864000910139908100399123001596524001220080024200160048180000100
80204200864000910139908100399123001596524001220080024200160048180000100
80204200864000910139908100399123001596524001220080024200160048180000100
80204200864000910139908100399123001596524001220080024200160052180000100
80204201204000910139908100399123001596524001220080024200160048180000100
80204200974000910139908100399123001596524001220080024200160048180000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.2508

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
800242206840021214000020400047016002040024208002820160000118000010
800242017340011213999020399907015996040010208000020160000118000010
800242007240011213999020399907015996040010208000020160000118000010
800242006340011213999020399907015995240008208000020160000118000010
800242006140011213999020399907015996040010208000020160000118000010
800242006140011213999020399907015996040010208000020160204118000010
800242009240011213999020399907015996040010208000020160000118000010
800242007040011213999020399907015996040010208000020160000118000010
800242006840011213999020399907015996040010208000020160000118000010
800242006140011213999020399907015996040010208000020160000118000010