Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

AESD + AESIMC

Test 1: uops

Code:

  aesd v0.16b, v1.16b
  aesimc v0.16b, v0.16b
  movi v0.16b, 1
  movi v1.16b, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 2.000

Issues: 1.000

Integer unit issues: 0.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 1.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch simd uop (57)ldst uops in schedulers (5b)dispatch uop (78)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
2004303710031100210027590510002000143651210231776397615220898978
20043039100411003100375905100020000030001020000
20043033100111000100075905100020000030001020000
20043033100111000100075905100020000030001020000
20043033100111000100075905100020000030001020000
20043033100111000100075905100020000030001020000
20043033100111000100075905100020000030001020000
20043033100111000100075905100020000030001020000
20043033100111000100075905100020000030001020000
20043033100111000100075905100020000030001020000

Test 2: Latency 1->1

Code:

  aesd v0.16b, v1.16b
  aesimc v0.16b, v0.16b
  movi v0.16b, 1
  movi v1.16b, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
2020430037101031011000210010002300768905101002002000620030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100

1000 unrolls and 10 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
2002430037100131110002101000230768905100102020006203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010

Test 3: Latency 1->2

Code:

  aesd v0.16b, v0.16b
  aesimc v0.16b, v0.16b
  movi v0.16b, 1

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100
2020430033101011011000010010000300768905101002002000420030006120000100

1000 unrolls and 10 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
2002430033100111110000101000030768905100102020004203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010
2002430033100111110000101000030768905100102020000203000012000010

Test 4: throughput

Count: 8

Code:

  movi v0.16b, 0
  aesd v0.16b, v8.16b
  aesimc v0.16b, v0.16b
  movi v1.16b, 0
  aesd v1.16b, v8.16b
  aesimc v1.16b, v1.16b
  movi v2.16b, 0
  aesd v2.16b, v8.16b
  aesimc v2.16b, v2.16b
  movi v3.16b, 0
  aesd v3.16b, v8.16b
  aesimc v3.16b, v3.16b
  movi v4.16b, 0
  aesd v4.16b, v8.16b
  aesimc v4.16b, v4.16b
  movi v5.16b, 0
  aesd v5.16b, v8.16b
  aesimc v5.16b, v5.16b
  movi v6.16b, 0
  aesd v6.16b, v8.16b
  aesimc v6.16b, v6.16b
  movi v7.16b, 0
  aesd v7.16b, v8.16b
  aesimc v7.16b, v7.16b
  movi v8.16b, 9

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 1.0005

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
24020480496801341018003310080035300320026801062001600122002400841240000100
24020580095801351018003410080048300320026801062001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100
24020480036801051018000410080006300320030801072001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100
24020580071801351018003410080047300320026801062001600122002400181240000100
24020480036801051018000410080006300320026801062001600122002400181240000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 1.0216

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
24002484144802711180260108026230320770802022016000020240000124000010
24002482855801971180186108018630320489801302016000020240000124000010
24002481676801271180116108011630320464801252016000020240000124000010
24002582191801861180175108019230320469801272016000020240000124000010
24002481844801291180118108011830320767801852016005620240000124000010
24002481782801331180122108012230320489801292016000020240000124000010
24002481695801281180117108011730320493801302016000020240000124000010
24002481689801281180117108011730320476801282016000020240000124000010
24002481679801291180118108011830320476801272016000020240000124000010
24002481690801281180117108011730320464801262016000020240000124000010

Test 5: throughput

Count: 16

Code:

  aesd v0.16b, v16.16b
  aesimc v0.16b, v0.16b
  aesd v1.16b, v16.16b
  aesimc v1.16b, v1.16b
  aesd v2.16b, v16.16b
  aesimc v2.16b, v2.16b
  aesd v3.16b, v16.16b
  aesimc v3.16b, v3.16b
  aesd v4.16b, v16.16b
  aesimc v4.16b, v4.16b
  aesd v5.16b, v16.16b
  aesimc v5.16b, v5.16b
  aesd v6.16b, v16.16b
  aesimc v6.16b, v6.16b
  aesd v7.16b, v16.16b
  aesimc v7.16b, v7.16b
  aesd v8.16b, v16.16b
  aesimc v8.16b, v8.16b
  aesd v9.16b, v16.16b
  aesimc v9.16b, v9.16b
  aesd v10.16b, v16.16b
  aesimc v10.16b, v10.16b
  aesd v11.16b, v16.16b
  aesimc v11.16b, v11.16b
  aesd v12.16b, v16.16b
  aesimc v12.16b, v12.16b
  aesd v13.16b, v16.16b
  aesimc v13.16b, v13.16b
  aesd v14.16b, v16.16b
  aesimc v14.16b, v14.16b
  aesd v15.16b, v16.16b
  aesimc v15.16b, v15.16b
  movi v16.16b, 17

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5006

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
3202048090816011010116000901001600130300064005216011220003200242004800361320000100
3202048013916010910116000801001600120300064005216011220003200242004800361320000100
3202048008916010910116000801001600120300064005216011220003200242004800391320000100
3202048008916010910116000801001600120300064005216011220003200242004801381320000100
3202048010016011010116000901001600130300064005616011320003200262004800361320000100
3202048013616011010116000901001600130300064005616011320003200262004801381320000100
3202048010216011010116000901001600130300064005216011220003200242004800361320000100
3202048008916010910116000801001600120300064005216011220003200242004800361320000100
3202048008916010910116000801001600120300064005216011220003200242004800361320000100
3202048008916010910116000801001600120300064005216011220003200242004800361320000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5418

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
32002487525160020111600091016001330640056160023203200262004800001032000010
32002487451160011111600001016000030640000160010203200002004800001032000010
32002486688160011111600001016000030640000160010203200002004800001032000010
32002486686160011111600001016000030640000160010203200002004800001032000010
32002486697160011111600001016000030640000160010203200002004801381032000010
32002486720160011111600001016000030640000160010203200002004800001032000010
32002486794160011111600001016000030640000160010203200002004800001032000010
32002486937160011111600001016000030640000160010203200002004800001032000010
32002486801160011111600001016000030640000160010203200002004800001032000010
32002486836160011111600001016000030640000160010203200002004800001032000010