Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

AESE + AESMC

Test 1: uops

Code:

  aese v0.16b, v1.16b
  aesmc v0.16b, v0.16b
  movi v0.16b, 1
  movi v1.16b, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 2.000

Issues: 1.000

Integer unit issues: 0.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 1.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch simd uop (57)ldst uops in schedulers (5b)dispatch uop (78)map simd uop (7e)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)
2004303710031100210027590510002000300012000
2004303310011100010007590510002000300012000
2004303310011100010007590510002000300012000
2004303310011100010007590510002000300012000
2004303310011100010007590510002000300012000
2004303310011100010007590510002000300012000
2004303310011100010007590510002000300012000
2004303310011100010007590510002000300012000
2004303310011100010007590510002000300012000
2004303310011100010007590510002000300012000

Test 2: Latency 1->1

Code:

  aese v0.16b, v1.16b
  aesmc v0.16b, v0.16b
  movi v0.16b, 1
  movi v1.16b, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
2020430037101031011000210010002030007689051010020002000620030006120000100
2020530066101071011000610010032030007689051010020002000420030006120000100
2020430033101011011000010010000030007689051010020002000420030006120000100
2020430033101011011000010010000030007689051010020002000420030006120000100
2020430033101011011000010010000030007689051010020002000420030006120000100
2020430033101011011000010010000030007689051010020002000420030006120000100
2020430033101011011000010010000030007689051010020002000420030006120000100
2020430033101011011000010010000030007689051010020002000420030006120000100
2020530068101081011000710010033030007689051010020002000420030006120000100
2020430033101011011000010010000030007692471013220002004820030006120000100

1000 unrolls and 10 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
2002430037100131110002101000230768905100102020006203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002530066100171110006101003230768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010

Test 3: Latency 1->2

Code:

  aese v0.16b, v0.16b
  aesmc v0.16b, v0.16b
  movi v0.16b, 1

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030069120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100

1000 unrolls and 10 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203007212000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010

Test 4: throughput

Count: 8

Code:

  movi v0.16b, 0
  aese v0.16b, v8.16b
  aesmc v0.16b, v0.16b
  movi v1.16b, 0
  aese v1.16b, v8.16b
  aesmc v1.16b, v1.16b
  movi v2.16b, 0
  aese v2.16b, v8.16b
  aesmc v2.16b, v2.16b
  movi v3.16b, 0
  aese v3.16b, v8.16b
  aesmc v3.16b, v3.16b
  movi v4.16b, 0
  aese v4.16b, v8.16b
  aesmc v4.16b, v4.16b
  movi v5.16b, 0
  aese v5.16b, v8.16b
  aesmc v5.16b, v5.16b
  movi v6.16b, 0
  aese v6.16b, v8.16b
  aesmc v6.16b, v6.16b
  movi v7.16b, 0
  aese v7.16b, v8.16b
  aesmc v7.16b, v7.16b
  movi v8.16b, 9

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 1.0005

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
24020480457801311018003010080032300320030801072001600122002400181240000100
24020480056801071018000610080008300320026801062001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100
24020580071801341018003310080047300320030801072001600122002400181240000100
24020480041801051018000410080006300320026801062001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 1.0230

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
240024840108027611802650108026730320809802102016001220240000124000010
240024827168020111801900108019030320526801402016000020240000124000010
240024817808013911801280108012830320508801362016000020240000124000010
240025825458021011801990108021630320509801342016000020240000124000010
240024831748023511802240108022430320553801462016000020240000124000010
240024817788013711801260108012630320492801322016000020240000124000010
240024826768018411801730108017330320524801402016000020240000124000010
240024818438013411801230108012330320514801372016000020240000124000010
240024818398014011801290108012930320497801332016000020240000124000010
240024818338013611801250108012530320500801342016000020240000124000010

Test 5: throughput

Count: 16

Code:

  aese v0.16b, v16.16b
  aesmc v0.16b, v0.16b
  aese v1.16b, v16.16b
  aesmc v1.16b, v1.16b
  aese v2.16b, v16.16b
  aesmc v2.16b, v2.16b
  aese v3.16b, v16.16b
  aesmc v3.16b, v3.16b
  aese v4.16b, v16.16b
  aesmc v4.16b, v4.16b
  aese v5.16b, v16.16b
  aesmc v5.16b, v5.16b
  aese v6.16b, v16.16b
  aesmc v6.16b, v6.16b
  aese v7.16b, v16.16b
  aesmc v7.16b, v7.16b
  aese v8.16b, v16.16b
  aesmc v8.16b, v8.16b
  aese v9.16b, v16.16b
  aesmc v9.16b, v9.16b
  aese v10.16b, v16.16b
  aesmc v10.16b, v10.16b
  aese v11.16b, v16.16b
  aesmc v11.16b, v11.16b
  aese v12.16b, v16.16b
  aesmc v12.16b, v12.16b
  aese v13.16b, v16.16b
  aesmc v13.16b, v13.16b
  aese v14.16b, v16.16b
  aesmc v14.16b, v14.16b
  aese v15.16b, v16.16b
  aesmc v15.16b, v15.16b
  movi v16.16b, 17

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5006

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
320204809211601101011600091001600133006400521601122003200242004800391320000100
320204801211601091011600081001600123006400521601122003200242004800361320000100
320204800891601091011600081001600123006400521601122003200242004800361320000100
320204800891601091011600081001600123006400521601122003200242004800391320000100
320204800891601091011600081001600123006400561601132003200262004800361320000100
320204800891601091011600081001600123006400521601122003200242004800361320000100
320204800891601091011600081001600123006400521601122003200242004800361320000100
320204800891601091011600081001600123006400521601122003200242004800361320000100
320204800891601091011600081001600123006400521601122003200242004800361320000100
320204800891601091011600081001600123006400521601122003200242004800361320000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5425

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
320024877601600201116000910160013306400521600222032002420480036132000010
320024874121600191116000810160012306400001600102032000020480000132000010
320025872621600651116005410160061306403731600842032009220480000132000010
320024868021600111116000010160000306400001600102032000020480000132000010
320024868841600111116000010160000306400001600102032000020480000132000010
320024869001600111116000010160000306400001600102032000020480000132000010
320024869191600111116000010160000306400001600102032000020480000132000010
320024872481600111116000010160000306400001600102032000020480000132000010
320024868111600111116000010160000306403691600832032009020480000132000010
320024868121600111116000010160000306400001600102032000020480000132000010