# BFMMLA: BFloat16 floating-point matrix multiply-accumulate into 2×2 matrices

BFMMLA Zda.S, Zn.H, Zm.H (SVE+BF16+NS

svfloat32_t svbfmmla[_f32](svfloat32_t op1, svbfloat16_t op2, svbfloat16_t op3)

## 128-bit SVE

Within each 128-bit segment, interpreting the BFloat16 values from (1) and (2) as 2-by-4 and 4-by-2 matrices respectively, and the

32-bit floats from (3) as a 2-by-2 matrix, multiply (1) by (2), add the resulting 2-by-2 matrix to (3), and write the result to (4). See

the documentation for the exact order of operations.

## 256-bit SVE

Within each 128-bit segment, interpreting the BFloat16 values from (1) and (2) as 2-by-4 and 4-by-2 matrices respectively, and the

32-bit floats from (3) as a 2-by-2 matrix, multiply (1) by (2), add the resulting 2-by-2 matrix to (3), and write the result to (4). See

the documentation for the exact order of operations.

## 512-bit SVE

Within each 128-bit segment, interpreting the BFloat16 values from (1) and (2) as 2-by-4 and 4-by-2 matrices respectively, and the

32-bit floats from (3) as a 2-by-2 matrix, multiply (1) by (2), add the resulting 2-by-2 matrix to (3), and write the result to (4). See

the documentation for the exact order of operations.

## Larger sizes

## 1024-bit SVE

Within each 128-bit segment, interpreting the BFloat16 values from (1) and (2) as 2-by-4 and 4-by-2 matrices respectively, and the

32-bit floats from (3) as a 2-by-2 matrix, multiply (1) by (2), add the resulting 2-by-2 matrix to (3), and write the result to (4). See

the documentation for the exact order of operations.

## 2048-bit SVE

Within each 128-bit segment, interpreting the BFloat16 values from (1) and (2) as 2-by-4 and 4-by-2 matrices respectively, and the

32-bit floats from (3) as a 2-by-2 matrix, multiply (1) by (2), add the resulting 2-by-2 matrix to (3), and write the result to (4). See

the documentation for the exact order of operations.

Inspired by and based on the x86/x64 SIMD Instruction List by Daytime.