SVE Instruction List by Dougall Johnson

See "FMMLA (widening, FP8 to FP16)" in the exploration tools

FMMLA (widening, FP8 to FP16): 8-bit floating-point matrix multiply-accumulate to half-precision

FMMLA Zda.H, Zn.B, Zm.B (SVE2+F8F16MM+NS

128-bit SVE

Within each 64-bit segment, interpreting the 8-bit floating-point values from (1) and (2) as 2-by-4 and 4-by-2 matrices respectively, and the 16-bit floats from (3) as a 2-by-2 matrix, multiply (1) by (2), add the resulting 2-by-2 matrix to (3), and write the result to (4). This is equivalent to a 4-way dot product per destination element. The FP8 format for each 8-bit source operand is selected independently by FPMR. See the documentation for the exact order of operations.

256-bit SVE

Within each 64-bit segment, interpreting the 8-bit floating-point values from (1) and (2) as 2-by-4 and 4-by-2 matrices respectively, and the 16-bit floats from (3) as a 2-by-2 matrix, multiply (1) by (2), add the resulting 2-by-2 matrix to (3), and write the result to (4). This is equivalent to a 4-way dot product per destination element. The FP8 format for each 8-bit source operand is selected independently by FPMR. See the documentation for the exact order of operations.

512-bit SVE

Within each 64-bit segment, interpreting the 8-bit floating-point values from (1) and (2) as 2-by-4 and 4-by-2 matrices respectively, and the 16-bit floats from (3) as a 2-by-2 matrix, multiply (1) by (2), add the resulting 2-by-2 matrix to (3), and write the result to (4). This is equivalent to a 4-way dot product per destination element. The FP8 format for each 8-bit source operand is selected independently by FPMR. See the documentation for the exact order of operations.

Larger sizes

1024-bit SVE

Within each 64-bit segment, interpreting the 8-bit floating-point values from (1) and (2) as 2-by-4 and 4-by-2 matrices respectively, and the 16-bit floats from (3) as a 2-by-2 matrix, multiply (1) by (2), add the resulting 2-by-2 matrix to (3), and write the result to (4). This is equivalent to a 4-way dot product per destination element. The FP8 format for each 8-bit source operand is selected independently by FPMR. See the documentation for the exact order of operations.

2048-bit SVE

Within each 64-bit segment, interpreting the 8-bit floating-point values from (1) and (2) as 2-by-4 and 4-by-2 matrices respectively, and the 16-bit floats from (3) as a 2-by-2 matrix, multiply (1) by (2), add the resulting 2-by-2 matrix to (3), and write the result to (4). This is equivalent to a 4-way dot product per destination element. The FP8 format for each 8-bit source operand is selected independently by FPMR. See the documentation for the exact order of operations.

Report mistakes or give feedback
Inspired by and based on the x86/x64 SIMD Instruction List by Daytime.