Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

UDIV (slow, 32-bit)

Test 1: uops

Code:

  udiv w0, w1, w2
  mov w1, #0xffffffff
  mov w2, #3

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 2.000

Integer unit issues: 2.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000

Test 2: Latency 1->2

Chain cycles: 2

Code:

  udiv w0, w1, w2
  eor x1, x1, x0
  eor x1, x1, x0
  mov w1, #0xffffffff
  mov w2, #3

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
30204150030402014020103020340180423020330210602240401010030100
30204150030402014020103020340180703020330212602240401010030100
30204150030402014020103020340180703020330212602240401010030100
30205150060402044020403023140180703020330212602240401010030100
30204150030402014020103020340180703020330212602240401010030100
30204150030402014020103020340180703020330212602240401010030100
30204150030402014020103020340180703020330212602240401010030100
30205150060402044020403023240180703020330212602240401010030100
30204150030402014020103020340180703020330212602240401010030100
30204150030402014020103020340180703020330212602240401010030100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
3002415003040011400113001340183173001330030600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040186793004230068600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040186703004030065600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010

Test 3: Latency 1->3

Chain cycles: 2

Code:

  udiv w0, w1, w2
  eor x2, x2, x0
  eor x2, x2, x0
  mov w1, #0xffffffff
  mov w2, #3

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? int retires (ef)
302041500304020140201003020304018070302033021260556222402168830230
302041500304020140201003020304018070302033021260224040101030100
302041500304020140201003020304018070302033021260224040101030100
302041500304020140201003020304018070302033021260224040101030100
302051500604020440204003023004018070302033021260224040101030100
302041500304020140201003020304018070302033021260224040101030100
302041500304020140201003020304018070302033021260224040101030100
302041500304020140201003020304018070302033021260224040101030100
302051500604020640206003023404018070302033021260224040101030100
302041500304020140201003020304018070302033021260224040101030100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
3002415003040011400113001340182983001030020601244000430010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020601164000430010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020601164000430010
3002415003040011400113001040183293001030020600204000130010

Test 4: throughput

Code:

  udiv w0, w1, w2
  mov w1, #0xffffffff
  mov w2, #3

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
1020413003020101201011010011598321010010206202160200010010100
1020413003020101201011010011599481010910222202160200010010100
1020413003020101201011010011598321010010208202160200010010100
1020413003020101201011010011598321010010208202160200010010100
1020413003020101201011010011599481010910224202160200010010100
1020413003020101201011010011599481010910222202120200010010100
1020413003020101201011010011598321010010208202160200010010100
1020413003020101201011010011598321010010208202160200010010100
1020413003020101201011010011598321010010208202160200010010100
1020413003020101201011010011599481010910224202160200010010100

1000 unrolls and 10 iterations

Result (median cycles for code): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
1002413003020021200211002011597081002910043200202001110010
1002413003020021200211002011595921002010020200202001110010
1002413003020021200211002011595921002010020200202001110010
1002413003020021200211002011595921002010020200202001110010
1002413003020021200211002011595921002010020200202001110010
1002513006020024200241002911595921002010020200202001110010
1002413003020021200211002011595921002010020200202001110010
1002413003020021200211002011595921002010020200202001110010
1002413003020021200211002011595921002010020200622001410010
1002513006020024200241002911595921002010020200202001110010