Firestorm Overview

Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

This is an attempt at microarchitecture documentation for the CPU in the Apple M1, inspired by and building on the amazing work of Andreas Abel, Andrei Frumusanu, @Veedrac, Travis Downs, Henry Wong, Agner Fog and Maynard Handley. This documentation is my best effort, but it is based on black-box reverse engineering, and there are definitely mistakes. No warranty of any kind (and not just as a legal technicality). To make it easier to verify the information and/or identify such errors, entries in the instruction tables link to the experiments and results (~35k tables of counter values).

Firestorm is the high-performance microarchitecture used by the four P-cores in the M1.

Firestorm Pipeline Overview

As instructions are executed, they are mapped to operations inside the processor. Typically, an instruction is decoded to one or more uops, each uop is mapped and renamed and placed into a dispatch queue. These uops are then dispatched from the queue to a scheduler. When the uop's input operands are ready, it issues from the scheduler, and is executed. After execution, it is marked as completed in the reorder buffer. Completed entries in the reorder buffer are then released in order.

This work focuses on measuring each instruction's uops and issues. Uops count towards the "pipeline width" limit of 8 uops per cycle, and are measured using the RETIRE_UOP counter. Issues count towards the "execution unit" limits (one issue per unit per cycle), and can be measured using the documented SCHEDULE_UOP counter. Three undocumented counters measure issues to the integer, load/store, and simd units separately, so these values are provided in the instruction tables.

All instructions have at least one uop, and most instructions have the same number of issues as uops, but some have fewer (e.g. nop, ldp, and fused instructions). On the other hand, some instructions have more issues than uops, e.g ALU operations with an optional shift or extend (which allows Firestorm to sustain 11 issues per cycle in contrived cases).

Firestorm Units

I use the name "units", but these may also be refered to as "ports" or "pipes".

Integer units:

  1:  alu + ubfm/sbfm + flags + branch + adr + msr/mrs nzcv + mrs
  2:  alu + ubfm/sbfm + flags + branch + adr + msr/mrs nzcv + indirect branch + ptrauth
  3:  alu + ubfm/sbfm + flags + mov-from-simd/fp?
  4:  alu + ubfm/sbfm + mov-from-simd/fp?
  5:  alu + ubfm/sbfm + mul + div
  6:  alu + ubfm/sbfm + mul + madd + crc + bfm/extr

Load/store units (up to 128-bit loads and stores, including address generation with shifts up to LSL #3):

  7:  store + amx
  8:  load/store + amx
  9:  load
  10: load

SIMD units:

  11: fp/simd
  12: fp/simd
  13: fp/simd + fcsel + to-gpr
  14: fp/simd + fcsel + to-gpr + fcmp/e + fdiv + frecpe + frsqrte + fjcvtzs + ursqrte + urecpe + sha

Instruction Fusion

Certain instructions are able to issue as one uop if they appear consecutively in the instruction stream.

adds/subs/ands/cmp/tst + b.cc (complete fusion when fused instructions read no more than 4 registers per 6 instructions)
add/sub/and/orr/eor/bic/etc. + cbnz/cbz (usually fused, if destination matches cbz operand. also works with instruction variants that set flags)
aese + aesmc (usually fused if operands match pattern "A, B ; A, A")
aesd + aesimc (usually fused if operands match pattern "A, B ; A, A")
aese/aesd + eor (usually fused if operands match pattern "A, B ; A, A, C" or "A, B ; A, C, A")
pmull + eor (usually fused if operands match pattern "A, B, C ; A, A, D" or "A, B, C ; A, D, A")
amx + amx (excluding loads and stores - probably fuses to something like a stp)

Branch fusion does not work with implicit shift or extend, nor instructions that read flags (like adc)

Other tested patterns are not fused, including adrp + add, mov + movk, mul + umulh, and udiv + msub.

Elimination

Certain instructions do not need to issue:

mov x0, #0 (handled by renaming, as is 32-bit version)
mov x0, x1 (usually handled by renaming, 64-bit version only)
mov x0, #123 (handled by renaming at a max of 2 per 8 instructions, both 32-bit and 64-bit. Includes all tested immediate "mov" aliases e.g. bitwise/movz/movn)
movi v0.16b, #0 (handled by renaming, including other types)
mov v0.16b, v1.16b (usually handled by renaming. Includes other full-width types e.g. v0.8h but not v0.8b)
nop (never issues)
b (unconditional branch never issues)

Other tested instructions are not eliminated, including adr/adrp, mov w0, w1 and mov x0, xzr.

Complex Latencies

Several instructions have latencies that aren't adequately described in the instruction tables:

MADD's output can be passed to its third operand (the addend) with 1c latency, but if it's chained with other instructions it has 3c latency.
Loads may be passed to the base address of other loads with 3c latency (which is nice for linked lists), but chaining with ALU operations gives a latency of 4c. (Although the second destination register in an LDP always has 4c latency.)
Loads may have an extra 1-cycle of latency when the index register is the output of a shift-like instruction. Addition and bitwise operations do not have this penalty. Other latency chains have not yet been tested.
Integer to SIMD/FP roundtrip latency can be as low as 7c (e.g. when chaining flags operations).

Other limits

Firestorm has a pipeline width of eight instructions per cycle.

Uops per cycle: 8
Taken branches per cycle: 1
Coalesced retire queue size: ~330
In-flight renames: ~623
Integer physical register file size: ~380-394?
FP/SIMD physical register file size: ~432?
Flags physical register file size: ~128
In-flight branches: ~144
In-flight loads: ~130
In-flight stores: ~60

These numbers mostly come from my M1 buffer size measuring tool. See also my blog post Apple M1: Load and Store Queue Measurements.

Completion, and the Reorder Buffer (ROB)

Firestorm uses an unconventional reorder buffer, which I described as a ~330 entry "coalesced retire queue" and a ~623 "rename retire queue" (equivalent to a "Physical Register Reclaim Table").

Firestorm coalesces uops into retire groups, which all retire together. A retire group may contain up to seven uops. Uops which can fail before retiring (such as memory accesses) must appear at the start of a group, and uops that can fail after retiring (such as conditional branches) must appear at the end of a group. Groups of seven uops are only observed for eliminated instructions, such as nop and mov with issuing uops limited to roughly four per group. The coalesced retire queue consists of ~330 such groups. This allows an out-of-order window of just over 1000 (contrived) instructions that issue, or over 2200 nop instructions.

Any time an architectural register is written, that write must be retired (regardless of whether the instruction is eliminated). This is tracked in a separate structure, called the "rename retire queue", which allows for up to ~623 renames. (For some examples: cbz or str do not require entries, add and mov need one entry, and adds or ldp need two entries.)

I use retirement to refer to an instruction's entry in the reorder buffer being released. However, as I've described in a blog post, loads and stores may commit out-of-order if they are non-speculative. This has some interesting implications. In particular, once a store has "completed", the architectural state must have advanced past that point.

Retirement-rate is measured at up to eight coalesced retire queue entries per cycle, or up to sixteen rename retire queue entries per cycle.