Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

UDIV (medium, 64-bit)

Test 1: uops

Code:

  udiv x0, x1, x2
  mov x1, #0xffffffff
  mov x2, #3

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 2.000

Integer unit issues: 2.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000

Test 2: Latency 1->2

Chain cycles: 2

Code:

  udiv x0, x1, x2
  eor x1, x1, x0
  eor x1, x1, x0
  mov x1, #0xffffffff
  mov x2, #3

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
3020415003040201402013020340183723023030246602204010130100
3020415003040201402013020340180703020330212602244010130100
3020415003040201402013020340180703020330212602244010130100
3020415003040201402013020340180703020330212602944010430100
3020415003040201402013020340180703020330212602244010130100
3020415003040201402013020340180703020330212602244010130100
3020415003040201402013020340180703020330212602244010130100
3020415003040201402013020340180703020330212602964010430100
3020415003040201402013020340180703020330212602244010130100
3020415003040201402013020340180703020330212602244010130100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
3002415003040011400113001304018298030010300200600204000130010
3002415003040011400113001004018329030010300200600204000130010
3002515006040014400143004004018298030010300200600204000130010
3002415003040011400113001004018329030010300200600204000130010
3002415003040011400113001004018329030010300200600204000130010
3002415003040011400113001004018329030010300200600204000130010
3002515006040014400143004204018329030010300200600204000130010
3002415003040011400113001004018329030010300200600204000130010
3002415003040011400113001004018329030010300200600204000130010
3002415003040011400113001004018329030010300200600204000130010

Test 3: Latency 1->3

Chain cycles: 2

Code:

  udiv x0, x1, x2
  eor x2, x2, x0
  eor x2, x2, x0
  mov x1, #0xffffffff
  mov x2, #3

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
302041500304020140201003020300401804200302033021000602244010130100
302041500304020140201003020300401807000302033021200602244010130100
302041500304020140201003020300401807000302033021200602244010130100
302051500604020440204003023000401807000302033021200602244010130100
302041500304020140201003020300401807000302033021200602244010130100
302041500304020140201003020300401807000302033021200602244010130100
302041500304020140201003020300401807000302033021200602984010430100
302051500604020640206003023200401807000302033021200602244010130100
302041500304020140201003020300401807000302033021200602244010130100
302041500304020140201003020300401807000302033021200602244010130100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
3002415003040011400110030013040186483004230066600404000130010
3002415003040011400110030010040183293001030020600204000130010
3002415003040011400110030010040183293001030020601164000430010
3002415003040011400110030010040183293001030020600204000130010
348211711294409642487511558323434140182983001030020600204000130010
3002415003040011400110030010040183293001030020600204000130010
3002415003040011400110030010040183293001030020600204000130010
3002415003040011400110030010040183293001030020600204000130010
3002515006040014400140030041040183293001030020600204000130010
3002415003040011400110030010040183293001030020600204000130010

Test 4: throughput

Code:

  udiv x0, x1, x2
  mov x1, #0xffffffff
  mov x2, #3

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
102061300902010720107101180115983200101001020600202162000110100
102041300302010120101101000115983200101001020600202162000110100
102041300302010120101101000115983200101001020800202162000110100
102041300302010120101101000115983200101001020800202162000110100
102041300302010120101101000115994800101091022400202162000110100
102041300302010120101101000115983200101001020800202162000110100
102051300602010420104101090115994800101091022300202162000110100
102051300602010420104101090116006400101181024000202482000410100
102041300302010120101101000115983200101001020800202162000110100
102041300302010120101101000115983200101001020800202162000110100

1000 unrolls and 10 iterations

Result (median cycles for code): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1002413003020021200211002001159592010020100260200702001410010
1002413003020021200211002001159592010020100200200202001110010
1002413003020021200211002001159708010029100440200202001110010
1002413003020021200211002001159592010020100200200202001110010
1002413003020021200211002001159592010020100200200202001110010
1002413003020021200211002001159592010020100200200682001410010
1002413003020021200211002001159592010020100200200682001410010
1002413003020021200211002001159592010020100200200622001410010
1002413003020021200211002001159592010020100200200202001110010
1002413003020021200211002001159592010020100200200202001110010