Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

TBX (three register table, 16B)

Test 1: uops

Code:

  tbx v0.16b, { v1.16b, v2.16b, v3.16b }, v4.16b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4
  movi v4.16b, 5

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 3.000

Issues: 3.000

Integer unit issues: 0.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 3.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch simd uop (57)ldst uops in schedulers (5b)dispatch uop (78)map simd uop (7e)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)
30046033300113000300015224830003000900013000
30046033300113000300015224830003000900013000
30046033300113000300015224830003000900013000
30046033300113000300015224830003000900013000
30046033300113000300015224830003000900013000
30046033300113000300015224830003000900013000
30046033300113000300015224830003000900013000
30046033300113000300015224830003000900013000
30046033300113000300015224830003000900013000
30046033300113000300015224830003000900013000

Test 2: Latency 1->1

Code:

  tbx v0.16b, { v1.16b, v2.16b, v3.16b }, v4.16b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4
  movi v4.16b, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 6.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
3020460033302012013000020030000700152924730200200300062009001210130000100
3020460033302012013000020030000700152924730200200300062009001210130000100
3020460033302012013000020030000700152924830200200300042009014410030000100
3020460033302012013000020030000700152924830200200300042009001210130000100
3020460033302012013000020030000700152924830200200300042009001210130000100
3020460033302012013000020030000700152924830200200300042009001210130000100
3020460033302012013000020030000700152924830200200300042009001210130000100
3020460033302012013000020030000700152924830200200300042009001210130000100
3020460033302012013000020030000700152924830200200300042009001210130000100
3020460033302012013000020030000700152958030234200300482009001210130000100

1000 unrolls and 10 iterations

Result (median cycles for code): 6.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
3002460033300111130000103000030152924830010203000020090000103000010
3002460033300111130000103000030152924830010203000020090000103000010
3002460033300111130000103000030152924830010203000020090000103000010
3002460033300111130000103000030152924830010203000020090000103000010
3002460033300111130000103000030152924830010203000020090000103000010
3002460033300111130000103000030152924830010203000020090000103000010
3002460033300111130000103000030152924830010203000020090000103000010
3002460033300111130000103000030152924830010203000020090000103000010
3002460033300111130000103000030152924830010203000020090000103000010
3002460033300111130000103000030152924830010203000020090000103000010

Test 3: Latency 1->2

Chain cycles: 2

Code:

  movi v0.16b, 0
  tbx v0.16b, { v1.16b, v2.16b, v3.16b }, v4.16b
  add v1.16b, v0.16b, v0.16b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4
  movi v4.16b, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 6.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
502058006640109101400081004003430020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502048003340101101400001004000030020395804013420040044200110124150000100
502048003340101101400001004000030020392484010020040003200110011150000100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 6.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
500248023240059114004801040144030020392484001020040000200110000105000010
500248003340011114000001040000030020392484001020040000200110000105000010
500248003340011114000001040000030020392484001020040000200110000105000010
500248023440059114004801040144030020392484001020040000200110000105000010
500248003340011114000001040000030020392484001020040000200110000105000010
500248003340011114000001040000030020392484001020040000200110000105000010
500248003340011114000001040000030020392484001020040000200110000105000010
500248003340011114000001040000030020392484001020040000200110000105000010
500248003340011114000001040000030020392484001020040000200110000105000010
500248023040059114004801040144030020392484001020040000200110000105000010

Test 4: Latency 1->3

Chain cycles: 2

Code:

  movi v0.16b, 0
  tbx v0.16b, { v1.16b, v2.16b, v3.16b }, v4.16b
  add v2.16b, v0.16b, v0.16b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4
  movi v4.16b, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 4.0035

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
502046003540101101400001004000030015192404010020040008200110017150000100
502046003540101101400001004000030015192544010020040006200110017150000100
502046003540101101400001004000030015192544010020040006200110017150000100
502046003540101101400001004000030015192544010020040006200110017150000100
502046003540101101400001004000030015192544010020040006200110017150000100
502046003540101101400001004000030015192544010020040006200110017150000100
502046003540101101400001004000030015192544010020040006200110017150000100
502046003540101101400001004000030015192544010020040006200110017150000100
502046003540101101400001004000030015192544010020040006200110017150000100
502046003540101101400001004000030015192544010020040006200110017150000100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 4.0035

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
500246003540011114000010400000300151925040010200400062011000015000010
500246003540011114000010400000300151925440010200400002011000015000010
500246003540011114000010400000300151925440010200400002011000015000010
500246003540011114000010400000340151958140056200400552011000015000010
500246003540011114000010400000300151925440010200400002011000015000010
500246003540011114000010400000300151925440010200400002011000015000010
500246003540011114000010400000300151925440010200400002011000015000010
500246003540011114000010400000300151925440010200400002011000015000010
500246003540011114000010400000300151925440010200400002011000015000010
500246003540011114000010400000300151925440010200400002011000015000010

Test 5: Latency 1->4

Chain cycles: 2

Code:

  movi v0.16b, 0
  tbx v0.16b, { v1.16b, v2.16b, v3.16b }, v4.16b
  add v3.16b, v0.16b, v0.16b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4
  movi v4.16b, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 2.0037

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
50204400374010110140000100400010300099919740102200040008200110017150000100
50204400374010110140000100400010300099927040101200040006200110017150000100
50204400374010110140000100400010300099927040101200040006200110017150000100
50204400374010110140000100400010300099927040101200040006200110017150000100
50205400744014010140039100400660300099927040101200040006200110017150000100
50204400374010110140000100400010300099927040101200040006200110017150000100
50204400374010110140000100400010300099927040101200040006200110017150000100
50205400744013910140038100400641585159895380551001206445513968185540088200110022150000100
50204400374010110140000100400020300099927040101200040006200110017150000100
50204400374010110140000100400010300099927040101200040006200110017150000100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 2.0037

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
50024400374001111400001040001309992294001020400002011000015000010
50024400374001111400001040000309992654001020400002011000015000010
50024400374001111400001040000309992654001020400002011000015000010
50024400374001111400001040000309992654001020400002011000015000010
50024400374001111400001040000309992654001020400002011000015000010
50024400374001111400001040000309992654001020400002011000015000010
50024400374001111400001040000309992654001020400002011000015000010
50024400374001111400001040000309992654001020400002011000015000010
50024400374001111400001040000309992654001020400002011000015000010
50024400374001111400001040000309992654001020400002011000015000010

Test 6: Latency 1->5

Chain cycles: 2

Code:

  movi v0.16b, 0
  tbx v0.16b, { v1.16b, v2.16b, v3.16b }, v4.16b
  add v4.16b, v0.16b, v0.16b
  movi v0.16b, 1
  movi v1.16b, 2
  movi v2.16b, 3
  movi v3.16b, 4
  movi v4.16b, 5

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 6.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
502048003340101101400001004000030020392484010020040003200110011150000100
502048003340101101400001004000030020392484010020040004200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502058006640111103400081024003430020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502048003340101101400001004000030020392484010020040003200110009150000100
502058006640109101400081004003430020392484010020040003200110009150000100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 6.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
50024800334001111400001040000302039248400102040003200110000105000010
50024800334001111400001040000302039248400102040000200110000105000010
50024800334001111400001040000302039248400102040000200110000105000010
50024800334001111400001040000302039248400102040000200110000105000010
50024800334001111400001040000302039248400102040000200110000105000010
50024800334001111400001040000302039248400102040000200110000105000010
50024800334001111400001040000302039248400102040000200110124105000010
50024800334001111400001040000302039248400102040000200110000105000010
50024800334001111400001040000302039248400102040000200110000105000010
50024800334001111400001040000302039248400102040000200110000105000010

Test 7: throughput

Count: 8

Code:

  movi v0.16b, 0
  tbx v0.16b, { v8.16b, v9.16b, v10.16b }, v11.16b
  movi v1.16b, 0
  tbx v1.16b, { v8.16b, v9.16b, v10.16b }, v11.16b
  movi v2.16b, 0
  tbx v2.16b, { v8.16b, v9.16b, v10.16b }, v11.16b
  movi v3.16b, 0
  tbx v3.16b, { v8.16b, v9.16b, v10.16b }, v11.16b
  movi v4.16b, 0
  tbx v4.16b, { v8.16b, v9.16b, v10.16b }, v11.16b
  movi v5.16b, 0
  tbx v5.16b, { v8.16b, v9.16b, v10.16b }, v11.16b
  movi v6.16b, 0
  tbx v6.16b, { v8.16b, v9.16b, v10.16b }, v11.16b
  movi v7.16b, 0
  tbx v7.16b, { v8.16b, v9.16b, v10.16b }, v11.16b
  movi v8.16b, 9
  movi v9.16b, 10
  movi v10.16b, 11
  movi v11.16b, 12

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 1.5005

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
32020412004424010310124000210024000930019194542401102002400132007200391320000100
32020412003924010110124000010024000830019197582401102002400132007200361320000100
32020412003924010110124000010024000830019198942401082002400122007200361320000100
32020412003924010110124000010024000830019198942401082002400122007200361320000100
32020412003924010110124000010024000830019198942401082002400122007200361320000100
32020412003924010110124000010024000830019198942401082002400122007200361320000100
32020412003924010110124000010024000830019198942401082002400122007200361320000100
32020412003924010110124000010024000830019198942401082002400122007200361320000100
32020412003924010110124000010024000830019201202401622002400692007200361320000100
32020412003924010110124000010024000830019198942401082002400122007200361320000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 1.5007

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
3200241200872400111124000010240008301919309240074202400732007200361032000010
3200241200612400111124000010240000301919674240010202400002007200001032000010
3200241200522400111124000010240000301919674240010202400002007200001032000010
3200241200532400111124000010240000301919674240010202400002007200001032000010
3200241200532400111124000010240000301919674240010202400002007200001032000010
3200241200532400111124000010240000301919982240077202400752007200001032000010
3200241200532400111124000010240000301919674240010202400002007200001032000010
3200241200532400111124000010240000301919674240010202400002007200001032000010
3200241200532400111124000010240000301919674240010202400002007200001032000010
3200241200532400111124000010240000301919674240010202400002007200001032000010