Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

TBL (four register table, 8B)

Test 1: uops

Code:

  tbl v0.8b, { v0.16b, v1.16b, v2.16b, v3.16b }, v4.8b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 3.000

Issues: 3.000

Integer unit issues: 0.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 3.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch simd uop (57)ldst uops in schedulers (5b)dispatch uop (78)map simd uop (7e)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)
30044033300113000300010017930003000900013000
30044033300113000300010017930003000900013000
30044033300113000300010017930003000900013000
30044033300113000300010017930003000900013000
30044033300113000300010017930003000900013000
30044033300113000300010017930003000933313000
30044033300113000300010017930003000900013000
30044033300113000300010017930003000900013000
30044033300113000300010017930003000900013000
30044033300113000300010017930003000900013000

Test 2: Latency 1->2

Code:

  tbl v0.8b, { v0.16b, v1.16b, v2.16b, v3.16b }, v4.8b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 4.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
3020440033302012013000020030000700100914830202200300102009003010130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000668100942730249200300632009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100

1000 unrolls and 10 iterations

Result (median cycles for code): 4.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
30024400333001111300001030000301009399300592030062209000013000010
30024400333001111300001030000341009422300602030057209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30025400663003411300231030049301009618300622030055209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010

Test 3: Latency 1->3

Code:

  tbl v1.8b, { v0.16b, v1.16b, v2.16b, v3.16b }, v4.8b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 4.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
3020440033302012013000020030000696100938430248200300622009003010130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030002700100917930200200300082009003010130000100
3020440033302012013000020030000700100917930200200300082009003010130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100
3020440033302012013000020030000700100917930200200300082009002410130000100

1000 unrolls and 10 iterations

Result (median cycles for code): 4.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
30024400333001111300001030000301009170300112030010209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010
30024400333001111300001030000301009179300102030000209000013000010

Test 4: Latency 1->4

Code:

  tbl v2.8b, { v0.16b, v1.16b, v2.16b, v3.16b }, v4.8b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 4.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
30204400333020120130000020030002070001009164302022000300102009003010130000100
30204400333020120130000020030000065001009421302502000300642009002410130000100
30204400333020120130000020030000070001009173302002000300082009002410130000100
30204400333020120130000020030000070001009173302002000300082009002410130000100
30204400333020120130000020030000070001009173302002000300082009002410130000100
30204400333020120130000020030000070001009173302002000300082009002410130000100
30204400333020120130000020030000070001009173302002000300082009002410130000100
30204400333020120130000020030000070001009173302002000300082009002410130000100
30204400333020120130000020030000070001009173302002000300082009002410130000100
30204400333020120130000020030000070001009173302002000300082009020410130000100

1000 unrolls and 10 iterations

Result (median cycles for code): 4.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
30024400333001111300001030000301009162300102030000209000013000010
30024400333001111300001030000301009173300102030000209000013000010
30024400333001111300001030000301009173300102030000209000013000010
30024400333001111300001030000301009173300102030000209000013000010
30024400333001111300001030000301009173300102030000209000013000010
30024400333001111300001030000301009173300102030000209000013000010
30024400333001111300001030000301009173300102030000209000013000010
30024400333001111300001030000301009173300102030000209000013000010
30024400333001111300001030000301009173300102030000209000013000010
30024400333001111300001030000301009173300102030000209000013000010

Test 5: Latency 1->5

Code:

  tbl v3.8b, { v0.16b, v1.16b, v2.16b, v3.16b }, v4.8b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 4.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
3020440033302012013000020030000700100915330202200300102009003010130000100
3020440033302012013000020030000700100917330200200300082009002410130000100
3020440033302012013000020030000700100917330200200300082009002410130000100
3020440033302012013000020030000700100917330200200300082009002410130000100
3020440033302012013000020030000700100917330200200300082009002410130000100
3020440033302012013000020030000700100917330200200300082009002410130000100
3020440033302012013000020030000700100917330200200300082009002410130000100
3020440033302012013000020030000700100917330200200300082009002410130000100
3020440033302012013000020030000700100917330200200300082009002410130000100
3020440033302012013000020030000700100917330200200300082009002410130000100

1000 unrolls and 10 iterations

Result (median cycles for code): 4.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
300244003330011113000001030000301009173300102030000209000013000010
300244003330011113000001030000301009173300102030000209000013000010
300244003330011113000001030000301009173300102030000209000013000010
300244003330011113000001030000301009173300102030000209019213000010
300244003330011113000001030000301009173300102030000209000013000010
300244003330011113000001030000301009173300102030000209000013000010
300244003330011113000001030000301009173300102030000209000013000010
300244003330011113000001030000301009173300102030000209000013000010
300244003330011113000001030000301009173300102030000209000013000010
300244003330011113000001030000301009173300102030000209000013000010

Test 6: Latency 1->6

Code:

  tbl v4.8b, { v0.16b, v1.16b, v2.16b, v3.16b }, v4.8b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 4.0034

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
3020440034302012013000020030001700100708830202200300102009003010130000100
3020440034302012013000020030002700100712530201200300082009002410130000100
3020440034302012013000020030001700100712530201200300082009002410130000100
3020440034302012013000020030001700100712530201200300082009002410130000100
3020440034302012013000020030001700100712530201200300082009002410130000100
3020440034302012013000020030001700100712530201200300082009002410130000100
3020440034302012013000020030001700100712530201200300082009002410130000100
3020440034302012013000020030001700100712530201200300082009002410130000100
3020440034302012013000020030001700100712530201200300082009002410130000100
3020440034302012013000020030001700100712530201200300082009002410130000100

1000 unrolls and 10 iterations

Result (median cycles for code): 4.0034

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
30024400343001111300001030000301007103300102030000209000013000010
30024400343001111300001030000301007120300102030000209000013000010
30024400343001111300001030000301007120300102030000209000013000010
30024400343001111300001030000301007120300102030000209000013000010
30024400343001111300001030000301007120300102030000209000013000010
30024400343001111300001030000301007120300102030000209000013000010
30024400343001111300001030000301007120300102030000209000013000010
30024400343001111300001030000301007120300102030000209000013000010
30024400343001111300001030000301007453300482030055209000013000010
30024400343001111300001030000301007120300102030000209000013000010

Test 7: throughput

Count: 8

Code:

  tbl v0.8b, { v8.16b, v9.16b, v10.16b, v11.16b }, v12.8b
  tbl v1.8b, { v8.16b, v9.16b, v10.16b, v11.16b }, v12.8b
  tbl v2.8b, { v8.16b, v9.16b, v10.16b, v11.16b }, v12.8b
  tbl v3.8b, { v8.16b, v9.16b, v10.16b, v11.16b }, v12.8b
  tbl v4.8b, { v8.16b, v9.16b, v10.16b, v11.16b }, v12.8b
  tbl v5.8b, { v8.16b, v9.16b, v10.16b, v11.16b }, v12.8b
  tbl v6.8b, { v8.16b, v9.16b, v10.16b, v11.16b }, v12.8b
  tbl v7.8b, { v8.16b, v9.16b, v10.16b, v11.16b }, v12.8b
  movi v8.16b, 9
  movi v9.16b, 10
  movi v10.16b, 11
  movi v11.16b, 12
  movi v12.16b, 13

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 1.5005

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
24020412005824020520124000402002400100700012000552402122000240018200720054101240000100
24020412003624020520124000402002400100700012000492402102000240016200720048101240000100
24020412003624020520124000402002400100700012000492402102000240016200720048101240000100
24020412003624020520124000402002400100700012000492402102000240016200720048101240000100
24020412003624020520124000402002400100700012000492402102000240016200720048101240000100
24020412003624020520124000402002400100700012000492402102000240016200720048101240000100
24020412003624020520124000402002400100700012000492402102000240016200720048101240000100
24020412003624020520124000402002400100700012000492402102000240016200720048101240000100
24020412003624020520124000402002400100700012000492402102000240016200720048101240000100
24020412003624020520124000402002400100700012000492402102000240016200720048101240000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 1.5005

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
24002412005624001411240003102400083011999962400102024000020720000124000010
24002512007324005411240043102400583011999962400102024000020720000124000010
24002412003624001111240000102400003011999982400102024000020720000124000010
24002412003624001111240000102400003011999982400102024000020720000124000010
24002412003624001111240000102400003011999982400102024000020720000124000010
24002512007224005011240039102400543011999982400102024000020720000124000010
24002412003624001111240000102400003012002662400682024006820720000124000010
24002412003624001111240000102400003011999982400102024000020720000124000010
24002412003624001111240000102400003012000392400182024001220720000124000010
24002412003624001111240000102400003011999982400102024000020720000124000010