Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

SDIV (medium, 64-bit)

Test 1: uops

Code:

  sdiv x0, x1, x2
  mov x1, #0xffffffff
  mov x2, #3

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 2.000

Integer unit issues: 2.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 0.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000
10041303020012001100011553210001000200020011000

Test 2: Latency 1->2

Chain cycles: 2

Code:

  sdiv x0, x1, x2
  eor x1, x1, x0
  eor x1, x1, x0
  mov x1, #0xffffffff
  mov x2, #3

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)dispatch uop (78)map int uop (7c)map ldst uop (7d)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
3020415003040201402013020304018014030203302100602204010130100
3020415003040201402013020304018070030203302120602244010130100
3020515006040204402043023104018070030203302120603044010430100
3020415003040201402013020304018070030203302120602244010130100
3020415003040201402013020304018070030203302120602244010130100
3020415003040201402013020304018070030203302120602244010130100
3020515006040204402043023004018070030203302120602244010130100
3020415003040201402013020304018070030203302120602244010130100
3020415003040201402013020304018070030203302120602244010130100
3020415003040201402013020304018070030203302120602244010130100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
300241500304001140011300130401829800300103002000600204000130010
300241500304001140011300100401832900300103002000600204000130010
300241500304001140011300100401832900300103002000601164000430010
300241500304001140011300100401832900300103002000600204000130010
300241500304001140011300100401832900300103002000600204000130010
300241500304001140011300100401832900300103002000600204000130010
300241500304001140011300100401832900300103002000601884000730010
300241500304001140011300100401832900300103002000600464000130010
300241500304001140011300100401832900300103002000600204000130010
300241500304001140011300100401832900300103002000600204000130010

Test 3: Latency 1->3

Chain cycles: 2

Code:

  sdiv x0, x1, x2
  eor x2, x2, x0
  eor x2, x2, x0
  mov x1, #0xffffffff
  mov x2, #3

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
30204150030402014020130203401801430203302106022400401010030100
30204150030402014020130203401807030203302126022400401010030100
30204150030402014020130203401842830230302476022400401010030100
30204150030402014020130203401807030203302126022400401010030100
30204150030402014020130203401807030203302126022400401010030100
30204150030402014020130203401807030203302126022400401010030100
30204150030402014020130203401840530232302486022400401010030100
30204150030402014020130203401840230231302486022400401010030100
30204150030402014020130203401807030203302126022400401010030100
30204150030402014020130203401807030203302126022400401010030100

1000 unrolls and 10 iterations

Result (median cycles for code, minus 2 chain cycles): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
3002415003040011400113001340182983001030020600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020601164000430010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040190193006930106601164000430010
3002415003040011400113001040183293001030020600204000130010
3002415003040011400113001040183293001030020600204000130010

Test 4: throughput

Code:

  sdiv x0, x1, x2
  mov x1, #0xffffffff
  mov x2, #3

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)map ldst uop inputs (80)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
1020413003020101201010010100011599481010910226202120200010010100
1020513006020104201040010109011598321010010208202160200010010100
1020413003020101201010010100011598321010010208202160200010010100
1020413003020101201010010100011598321010010208202160200010010100
1020413003020101201010010100011598321010010208202160200010010100
1020413003020101201010010100011598321010010208202160200010010100
1020413003020101201010010100011598321010010208202160200010010100
1020413003020101201010010100011598321010010208202160200010010100
1020413003020101201010010100011598321010010208202160200010010100
1020413003020101201010010100011598321010010208202160200010010100

1000 unrolls and 10 iterations

Result (median cycles for code): 13.0030

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule ldst uop (55)dispatch int uop (56)int uops in schedulers (59)dispatch uop (78)map int uop (7c)map int uop inputs (7f)? int output thing (e9)? int retires (ef)
10024130030200212002101002011595921002010028200202001110010
10024130030200212002101002011595921002010020200202001110010
10025130060200242002401002911595921002010020200202001110010
10024130030200212002101002011595921002010020200202001110010
10024130030200212002101002011595921002010020200202001110010
10024130030200212002101002011595921002010020200202001110010
10024130030200212002101002011595921002010020200202001110010
10024130030200212002101002011595921002010020200202001110010
10024130030200212002101002011595921002010020200202001110010
10024130030200212002101002011597081002910044200202001110010