Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

FSQRT (scalar, S)

Test 1: uops

Code:

  fsqrt s0, s0
  movi v0.16b, 1
  movi v1.16b, 2

(no loop instructions)

1000 unrolls and 1 iteration

Retires: 1.000

Issues: 1.000

Integer unit issues: 0.001

Load/store unit issues: 0.000

SIMD/FP unit issues: 1.000

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch simd uop (57)ldst uops in schedulers (5b)dispatch uop (78)map simd uop (7e)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)
100410033100111000100012806910001000100011000
100410033100111000100012806910001000100011000
100410033100111000100012806910001000100011000
100410033100111000100012806910001000100011000
100410033100111000100012806910001000100011000
100410033100111000100012806910001000100011000
100410033100111000100012806910001000100011000
100410033100111000100012806910001000100011000
100410033100111000100012806910001000100011000
100410033100111000100012806910001000100011000

Test 2: Latency 1->2

Code:

  fsqrt s0, s0
  movi v0.16b, 1
  movi v1.16b, 2

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code): 10.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)schedule ldst uop (55)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map ldst uop inputs (80)map simd uop inputs (81)? int output thing (e9)? ldst retires (ed)? simd retires (ee)? int retires (ef)
102041000331010110110000010010000300128906910100200100062000100061010000100
102041000331010110110000010010000300128924110115200100292000100041010000100
102041000331010110110000010010000300128906910100200100042000100041010000100
102041000331010110110000010010000300128906910100200100042000100041010000100
102041000331010110110000010010000300128906910100200100042000100041010000100
102041000331010110110000010010000300128906910100200100042000100041010000100
102041000331010110110000010010000300128906910100200100042000100041010000100
102051000661010310110002010010015300128906910100200100042000100041010000100
102041000331010110110000010010000300128906910100200100042000100041010000100
102041000331010110110000010010000300128906910100200100042000100041010000100

1000 unrolls and 10 iterations

Result (median cycles for code): 10.0033

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
1002410003310021211000020100007012890691002020100002010000111000010
1002410003310021211000020100007012890691002020100002010000111000010
1002410003310021211000020100007012890691002020100002010000111000010
1002410003310021211000020100007012890691002020100002010028111000010
1002410003310021211000020100007012890691002020100002010000111000010
1002510006610023211000220100157012890691002020100002010000111000010
1002410003310021211000020100007012890691002020100002010000111000010
1002410003310021211000020100007012890691002020100002010000111000010
1002410003310021211000020100007012890691002020100002010000111000010
1002410003310021211000020100007012890691002020100002010000111000010

Test 3: throughput

Count: 8

Code:

  fsqrt s0, s8
  fsqrt s1, s8
  fsqrt s2, s8
  fsqrt s3, s8
  fsqrt s4, s8
  fsqrt s5, s8
  fsqrt s6, s8
  fsqrt s7, s8
  movi v8.16b, 9

(fused SUBS/B.cc loop)

100 unrolls and 100 iterations

Result (median cycles for code divided by count): 2.0005

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
802041600418010110180000100800003001999769801002008000420080004180000100
802041600418010110180000100800003001999769801002008000420080004180000100
802041600418010110180000100800003001999769801002008000420080004180000100
802041600418010110180000100800003002000007801252008003620080004180000100
802041600418010110180000100800003001999769801002008000420080004180000100
802041600418010110180000100800003001999769801002008000420080004180000100
802041600418010110180000100800003001999769801002008000420080004180000100
802041600418010110180000100800003001999769801002008000420080036180000100
802041600418010110180000100800003001999769801002008000420080004180000100
802051600828011310180012100800253001999769801002008000620080004180000100

1000 unrolls and 10 iterations

Result (median cycles for code divided by count): 2.0005

retire uop (01)cycle (02)schedule uop (52)schedule int uop (53)schedule simd uop (54)dispatch int uop (56)dispatch simd uop (57)int uops in schedulers (59)ldst uops in schedulers (5b)dispatch uop (78)map int uop (7c)map simd uop (7e)map int uop inputs (7f)map simd uop inputs (81)? int output thing (e9)? simd retires (ee)? int retires (ef)
8002416004180021218000020800007019997698002020800002080000118000010
8002416004180021218000020800007019997698002020800002080000118000010
8002416004180021218000020800007020000078004520800362080000118000010
8002516008280033218001220800257019997698002020800002080000118000010
8002416004180021218000020800007019997698002020800002080000118000010
8002416004180021218000020800007019997698002020800002080000118000010
8002416004180021218000020800007019997698002020800002080036118000010
8002416004180021218000020800007019997698002020800002080000118000010
8002416004180021218000020800007019997698002020800002080000118000010
8002416004180021218000020800007019997698002020800002080000118000010