Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

UQSHRN (D)

Test 1: uops

Code:

  uqshrn s0, d0, #3
  movi v0.16b, 1
  movi v1.16b, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 0.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 1.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch simd uop (57)ldst uops in schedulers (5b)dispatch uop (78)map simd uop (7e)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)
1004303310011100010007590510001000100011000
1004303310011100010007590510001000100011000
1004303310011100010007590510001000100011000
1004303310011100010007590510001000100011000
1004303310011100010007590510001000100011000
1004303310011100010007590510001000100011000
1004303310011100010007590510001000100011000
1004303310011100010007590510001000100011000
1004303310011100010007590510001000100011000
1004303310011100010007590510001000100011000

Test 2: Latency 1->2

Code:

  uqshrn s0, d0, #3
  movi v0.16b, 1
  movi v1.16b, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
1020430033101011011000010010000300768905101002001000420010006110000100
1020430033101011011000010010000300768905101002001000420010004110000100
1020430033101011011000010010000300768905101002001000420010004110000100
1020430033101011011000010010000300768905101002001000420010004110000100
1020430033101011011000010010000300768905101002001000420010004110000100
1020430033101011011000010010000300768905101002001000420010004110000100
1020430033101011011000010010000300768905101002001000420010004110000100
1020430033101011011000010010000300768905101002001000420010004110000100
1020430033101011011000010010000300768905101002001000420010044110000100
1020430033101011011000010010000300768905101002001000420010004110000100

1000 unrolls and 10 iterations

Result (median cycles for code): 3.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
10024300331002121100002010000707689051002020100002010000111000010
10024300331002121100002010000707689051002020100002010000111000010
10024300331002121100002010000707689051002020100002010000111000010
10024300331002121100002010000667692471005120100442010000111000010
10024300331002121100002010000707689051002020100002010000111000010
10024300331002121100002010000707689051002020100002010000111000010
10024300331002121100002010000707689051002020100002010000111000010
10024300331002121100002010000707689051002020100002010000111000010
10024300331002121100002010000707689051002020100002010000111000010
10024300331002121100002010000707689051002020100002010000111000010

Test 3: throughput

Count: 8

Code:

  uqshrn s0, d8, #3
  uqshrn s1, d8, #3
  uqshrn s2, d8, #3
  uqshrn s3, d8, #3
  uqshrn s4, d8, #3
  uqshrn s5, d8, #3
  uqshrn s6, d8, #3
  uqshrn s7, d8, #3
  movi v8.16b, 9

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
802044005580107101800061008001030032003680108200800122000800141080000100
802044003580107101800061008001030032003680108200800122000800121080000100
802044003580105101800041008000830032003680108200800122000800121080000100
802044003580105101800041008000830032044080209200801132020801612080000100
802044028680255101801541008015830032084480310200802142000801581080000100
802044029480252101801511008015530732084080311202802132000801671080000100
802044036680155101800541008005830032024480160200800642000800121080000100
802044010880155101800541008005830032003680108200800122000800121080000100
802044003580105101800041008000830032003680108200800122000800121080000100
802044003580105101800041008000830032003680108200800122000800121080000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
8002440176800272180006208001070320000800202080000200800001108000010
8002440035800212180000208000070320000800202080000200800001108000010
8002440035800212180000208000070320000800202080000200800001108000010
8002440035800212180000208000070320000800202080000200800001108000010
8002540070800652180044208005670320000800202080000200800001108000010
8002440035800212180000208000070320000800202080000200800001108000010
8002440035800212180000208000070320000800202080000200800001108000010
8002440035800212180000208000070320000800202080000200800001108000010
8002440035800212180000208000070320000800202080000200800001108000010
8002440035800212180000208000070320000800202080000200800001108000010