Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

UDIV (fast, 64-bit)

Test 1: uops

Code:

  udiv x0, x1, x2
  mov x1, #0
  mov x2, #0

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 2.000

Integer unit issues: 2.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000
100470302001200110006174810001000200020011000

Test 2: Latency 1->2

Chain cycles: 2

Code:

  udiv x0, x1, x2
  eor x1, x1, x0
  eor x1, x1, x0
  mov x1, #0
  mov x2, #0

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
302049003040201402013020302398969030235302500602204010130100
302049003040201402013020302398718030203302120602244010130100
302049003040201402013020302398718030203302120602244010130100
302049003040201402013020302398718030203302120602244010130100
302049003040201402013020302398718030203302120602244010130100
302049003040201402013020302398718030203302120602244010130100
302049003040201402013020302398718030203302120603044010730100
302049003040201402013020302398718030203302120602244010130100
302049003040201402013020302398718030203302120602244010130100
302049003040201402013020302398718030203302120602244010130100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
300249003040011400113001323989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023993383004430072600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020601244000830010

Test 3: Latency 1->3

Chain cycles: 2

Code:

  udiv x0, x1, x2
  eor x2, x2, x0
  eor x2, x2, x0
  mov x1, #0
  mov x2, #0

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
302049003040201402013020323986343020330210602204010130100
302049003040201402013020323987183020330212603044010630100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100
302059006040206402063023323987183020330212602244010130100
302049003040201402013020323987183020330212602244010130100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
300249003040011400113001323989353001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010
300249003040011400113001023989773001030020600204000130010

Test 4: throughput

Code:

  udiv x0, x1, x2
  mov x1, #0
  mov x2, #0

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10204700302010120101101006200481010010206202162000110100
10204700302010120101101006200481010010208202162000110100
10204700302010120101101006200481010010208202162000110100
10204700302010120101101006200481010010208202162000110100
10204700302010120101101006200481010010208202162000110100
10204700302010120101101006201641011010224202162000110100
10204700302010120101101006200481010010208202162000110100
10204700302010120101101006200481010010208202162000110100
10204700302010120101101006200481010010208202162000110100
10204700302010120101101006200481010010208202162000110100

1000 unrolls and 10 iterations

Result (median cycles for code): 7.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024700302002120021100206198081002010028200202001110010
10024700302002120021100206198081002010020200202001110010
10024700302002120021100206199241003010044200202001110010
10025700602002520025100306198081002010020200202001110010
10024700302002120021100206198081002010020200202001110010
10024700302002120021100206198081002010020200202001110010
10024700302002120021100206198081002010020200722001510010
10024700302002120021100206198081002010020200202001110010
10024700302002120021100206198081002010020200202001110010
10024700302002120021100206198081002010020200202001110010