Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

AESE

Test 1: uops

Code:

  aese v0.16b, v1.16b
  movi v0.16b, 1
  movi v1.16b, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 0.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 1.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch simd uop (57)ldst uops in schedulers (5b)dispatch uop (78)map simd uop (7e)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)
1004303310011100010007590510001000200011000
1004303310011100010007590510001000200011000
1004303310011100010007590510001000200011000
1004303310011100010007590510001000200011000
1004303310011100010007590510001000200011000
1004303310011100010007590510001000200011000
1004303310011100010007590510001000200011000
1004303310011100010007590510001000200011000
1004303310011100010007590510001000200011000
1004303310011100010007590510001000200011000

Test 2: Latency 1->1

Code:

  aese v0.16b, v1.16b
  movi v0.16b, 1
  movi v1.16b, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
102043003310101101100001001000030076890510100200100032000200061010000100
102043003310101101100001001000030076890510100200100032000200061010000100
102043003310101101100001001000030076890510100200100032000200061010000100
102043003310101101100001001000030076890510100200100032000200061010000100
102043003310101101100001001000030076890510100200100032000200061010000100
102043003310101101100001001000030076890510100200100032000200061010000100
102043003310101101100001001000030076890510100200100032000200061010000100
102043003310101101100001001000030076890510100200100032000200061010000100
102043003310101101100001001000030076890510100200100032000200061010000100
102043003310101101100001001000030776916210133202100412000200061010000100

1000 unrolls and 10 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
1002430033100111110000101000030768905100102010003202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202008611000010
1002430033100111110000101000030768905100102010000202000011000010

Test 3: Latency 1->2

Code:

  aese v0.16b, v0.16b
  movi v0.16b, 1

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
10204300331010110110000010010000300768905101002001000320020006110000100
10204300331010110110000010010000300768905101002001000320020006110000100
10204300331010110110000010010000300768905101002001000320020006110000100
10204300331010110110000010010000300768905101002001000320020006110000100
10204300331010110110000010010000300768905101002001000320020006110000100
10204300331010110110000010010000300768905101002001000320020006110000100
10204300331010110110000010010000307771089102342021016120020006110000100
10204300331010110110000010010000300768905101002001000320020006110000100
10204300331010110110000010010000300768905101002001000320020006110000100
10204300331010110110000010010000300768905101002001000320020006110000100

1000 unrolls and 10 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010
1002430033100111110000101000030768905100102010000202000011000010

Test 4: throughput

Count: 8

Code:

  movi v0.16b, 0
  aese v0.16b, v8.16b
  movi v1.16b, 0
  aese v1.16b, v8.16b
  movi v2.16b, 0
  aese v2.16b, v8.16b
  movi v3.16b, 0
  aese v3.16b, v8.16b
  movi v4.16b, 0
  aese v4.16b, v8.16b
  movi v5.16b, 0
  aese v5.16b, v8.16b
  movi v6.16b, 0
  aese v6.16b, v8.16b
  movi v7.16b, 0
  aese v7.16b, v8.16b
  movi v8.16b, 9

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5011

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
1602044051780110101800091008001303000320056801132000800132001600261160000100
1602044011580109101800081008001203000320056801132000800132001600261160000100
1602044009180110101800091008001303000320056801132000800132001600261160000100
1602044009180110101800091008001303000320056801132000800132001600261160000100
1602044011580109101800081008001203000320056801132000800132001600261160000100
1602044051180242103801391028014303000320056801132000800132001600261160000100
1602044009180110101800091008001303000320316801782000800782001600261160000100
1602044009180110101800091008001303000320320801792000800792001600261160000100
1602044009180110101800091008001303000320188801462000800462001600261160000100
1602044009180110101800091008001303000320056801132000800132001600921160000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5056

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
1600244411880019118000810800123032000080010208000020160000116000010
1600244129480011118000010800003032000080010208000020160000116000010
1600244045180011118000010800003032000080010208000020160000116000010
1600244042980011118000010800003032000080010208000020160000116000010
1600244044980011118000010800003032000080010208000020160060116000010
1600244053180011118000010800003032000080010208000020160000116000010
1600244045380011118000010800003032000080010208000020160000116000010
1600244043380011118000010800003032000080010208000020160000116000010
1600244043380011118000010800003032000080010208000020160000116000010
1600244045180011118000010800003032000080010208000020160000116000010

Test 5: throughput

Count: 16

Code:

  aese v0.16b, v16.16b
  aese v1.16b, v16.16b
  aese v2.16b, v16.16b
  aese v3.16b, v16.16b
  aese v4.16b, v16.16b
  aese v5.16b, v16.16b
  aese v6.16b, v16.16b
  aese v7.16b, v16.16b
  aese v8.16b, v16.16b
  aese v9.16b, v16.16b
  aese v10.16b, v16.16b
  aese v11.16b, v16.16b
  aese v12.16b, v16.16b
  aese v13.16b, v16.16b
  aese v14.16b, v16.16b
  aese v15.16b, v16.16b
  movi v16.16b, 17

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5002

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
160204800601601071011600061001600103006400441601102001600152003200301160000100
160207802021602881011601871001601923006400441601102001600152003200301160000100
160204800401601071011600061001600103006400441601102001600152003200301160000100
160204800401601071011600061001600103006400441601102001600152003200301160000100
160204800401601071011600061001600103006400441601102001600152003200301160000100
160204800401601071011600061001600103006400441601102001600152003200301160000100
160204800401601071011600061001600103006400441601102001600152003200301160000100
160204800401601071011600061001600103006400441601102001600152003200301160000100
160204800401601071011600061001600103006400441601102001600152003200301160000100
160204800401601071011600061001600103006400441601102001600152003200301160000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5002

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
160024803481600171116000601016001030640000160010201600002003200001016000010
160024800611600111116000001016000030640000160010201600002003201221016000010
160024800351600111116000001016000030640000160010201600002003200001016000010
160024800351600111116000001016000030640000160010201600002003200001016000010
160024800351600111116000001016000030640000160010201600002003200001016000010
160024800351600111116000001016000030640000160010201600002003200001016000010
160024800351600111116000001016000030640000160010201600002003200001016000010
160024800351600111116000001016000030640000160010201600002003200001016000010
160024800351600111116000001016000030640288160081201600762003201221016000010
160024800351600111116000001016000030640000160010201600002003200001016000010