Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

SQSHLU (vector, 2S)

Test 1: uops

Code:

  sqshlu v0.2s, v0.2s, #3
  movi v0.16b, 1
  movi v1.16b, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 0.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 1.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch simd uop (57)ldst uops in schedulers (5b)dispatch uop (78)map simd uop (7e)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)
1004203310011100010005024810001000100011000
1004203310011100010005024810001000100011000
1004203310011100010005024810001000100011000
1004203310011100010005024810001000100011000
1004203310011100010005024810001000100011000
1004203310011100010005024810001000100011000
1004203310011100010005024810001000100011000
1004203310011100010005024810001000100011000
1004203310011100010005024810001000100011000
1004203310011100010005024810001000100011000

Test 2: Latency 1->2

Code:

  sqshlu v0.2s, v0.2s, #3
  movi v0.16b, 1
  movi v1.16b, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 2.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
1020420033101011011000010010000300509248101002001000620210048210000100
1020420033101011011000010010000300509248101002001000620010004110000100
1020420033101011011000010010000300509248101002001000420010004110000100
1020420033101011011000010010000300509248101002001000420010004110000100
1020420033101011011000010010000300509248101002001000420010006110000100
1020420033101011011000010010000300509248101002001000620010004110000100
1020420033101011011000010010000307509580101362021004620010004110000100
1020420033101011011000010010000300509248101002001000420010004110000100
1020420033101011011000010010000300509248101002001000420010004110000100
1020420033101011011000010010000300509248101002001000420010004110000100

1000 unrolls and 10 iterations

Result (median cycles for code): 2.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
10024201351004723100242210072665103121009220100852010132111000010
10024200331002121100002010000705092481002020100002010000111000010
10024201351004521100242010072705097801005620100402010125111000010
10024201371004521100242010072695108441012820101242010123111000010
10024201361004521100242010072755108441013022101242010082111000010
10024201371004521100242010072675102961009220100812010082111000010
10024200841003321100122010036755108441013022101262010082111000010
10024201351004521100242010072665103121009220100832010126111000010
10024201891005721100362010108685103121009220100852210123121000010
10024201881005721100362010108725108441013022101222010124111000010

Test 3: throughput

Count: 8

Code:

  sqshlu v0.2s, v8.2s, #3
  sqshlu v1.2s, v8.2s, #3
  sqshlu v2.2s, v8.2s, #3
  sqshlu v3.2s, v8.2s, #3
  sqshlu v4.2s, v8.2s, #3
  sqshlu v5.2s, v8.2s, #3
  sqshlu v6.2s, v8.2s, #3
  sqshlu v7.2s, v8.2s, #3
  movi v8.16b, 9

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 0.5004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
8020440057801051018000410080008300320036801082008001220080012180000100
8020440034801071018000610080010300320036801082008001220080012180000100
8020440034801051018000410080008300320036801082008001220080012180000100
8020440034801051018000410080008300320036801082008001220080012180000100
8020440034801051018000410080008300320036801082008001220080012180000100
8020440034801051018000410080008300320036801082008001220080012180000100
8020440034801051018000410080008300320036801082008001220080012180000100
8020440034801051018000410080008300320036801082008001220080012180000100
8020440034801051018000410080008300320036801082008001220080012180000100
8020540069801431018004210080054300320044801102008001420080012180000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 0.5004

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)dispatch ldst uop (58)int uops in schedulers (59)simd uops in schedulers (5a)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map ldst uop (7d)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
80024401208002921800082080012070032000080020200800002080000118000010
80024400348002121800002080000070032000080020200800002080000118000010
80025400698006321800422080054070032000080020200800002080000118000010
80024400348002121800002080000070032000080020200800002080000118000010
80024400348002121800002080000070032000080020200800002080000118000010
80024400348002121800002080000070032000080020200800002080000118000010
80024400348002121800002080000070032000080020200800002080000118000010
80024400348002121800002080000070032000080020200800002080000118000010
80024400348002121800002080000070032000080020200800002080000118000010
80024400348002121800002080000070032000080020200800002080064118000010