Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

UDIV (fast, 32-bit)

Test 1: uops

Code:

  udiv w0, w1, w2
  mov w1, #0
  mov w2, #0

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 2.000

Integer unit issues: 2.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000

Test 2: Latency 1->2

Chain cycles: 2

Code:

  udiv w0, w1, w2
  eor x1, x1, x0
  eor x1, x1, x0
  mov w1, #0
  mov w2, #0

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
302049003040201402013020323986353020330210602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323988533023430250602204010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323986773020330210602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302069009040212402123026423987183020330212602244010130100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
300249003040011400113001323989543001330030600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300279012040030400303010723989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010

Test 3: Latency 1->3

Chain cycles: 2

Code:

  udiv w0, w1, w2
  eor x2, x2, x0
  eor x2, x2, x0
  mov w1, #0
  mov w2, #0

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
302049003040201402013020323986763020330210602244010130100
302059006140206402063023323986773020330210602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323990753023330252602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
300249003040011400110030013023989353001030020600204000130010
300249003040011400110030010023989773001030020606424008530010
300259006040018400180030045023989353001030020600204000130010
300249003040011400110030010023989773001030020600204000130010
300249003040011400110030010023989773001030020600204000130010
300249003040011400110030010023989773001030020600204000130010
300249003040011400110030010023989773001030020600204000130010
300249003040011400110030010023989773001030020600204000130010
300249003040011400110030010023993443004330072600204000130010
300249003040011400110030010023989773001030020600204000130010

Test 4: throughput

Code:

  udiv w0, w1, w2
  mov w1, #0
  mov w2, #0

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204700302010120101001010006200481010010206202122000110100
10204702042012720127001014606200481010010208202162000110100
1500190605244802272160169912606466200481010010208202162000110100
10204700302010120101001010006207241014410282202162000110100
10204700302010120101001010006200481010010208202162000110100
10204700302010120101001010006200481010010208202162000110100
10204700302010120101001010006200481010010208202162000110100
10204700302010120101001010006200481010010208202162000110100
10204700302010120101001010006200481010010208202162000110100
10204700302010120101001010006200481010010208202162000110100

1000 unrolls and 10 iterations

Result (median cycles for code): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10025700602002520025100300619808010020100260200202001110010
10024700302002120021100200619808010020100200200202001110010
10024700302002120021100200619808010020100200200202001110010
10024700302002120021100200619808010020100200200202001110010
10024700302002120021100200619808010020100200200202001110010
10024700302002120021100200619808010020100200200202001110010
10024700302002120021100200619808010020100200200202001110010
10024700302002120021100200619808010020100200200682001510010
10024700302002120021100200619808010020100200200202001110010
10024700302002120021100200619808010020100200200202001110010