Apple M1 Microarchitecture Research by Dougall Johnson Firestorm: Overview | Base Instructions | SIMD and FP Instructions Icestorm: Overview | Base Instructions | SIMD and FP Instructions
This is an early attempt at microarchitecture documentation for the CPU in the Apple M1, inspired by and building on the amazing work of Andreas Abel, Andrei Frumusanu, @Veedrac, Travis Downs, Henry Wong and Agner Fog. This documentation is my best effort, but it is based on black-box reverse engineering, and there are definitely mistakes. No warranty of any kind (and not just as a legal technicality). To make it easier to verify the information and/or identify such errors, entries in the instruction tables link to the experiments and results (~35k tables of counter values).
Firestorm is the high-performance microarchitecture used by the four P-cores in the M1.
These are refered to as "units", to try to avoid confusion if Apple releases official documentation, as they probably refer to them as "ports" or "pipes", and order them differently. (If this just causes more confusion, I apologise.)
Integer units: 1: alu + flags + branch + adr + msr/mrs nzcv + mrs 2: alu + flags + branch + adr + msr/mrs nzcv + ptrauth 3: alu + flags + mov-from-simd/fp? 4: alu + mov-from-simd/fp? 5: alu + mul + div 6: alu + mul + madd + crc + bfm/extr Load and store units (up to 128-bit loads and stores, including address generation with shifts up to LSL #3): 7: store + amx 8: load/store + amx 9: load 10: load FP/SIMD units: 11: fp/simd 12: fp/simd 13: fp/simd + fcsel + to-gpr 14: fp/simd + fcsel + to-gpr + fcmp/e + fdiv + frecpe + frsqrte + fjcvtzs + ursqrte + urecpe + sha
Certain instructions are able to issue as one uop if they appear consecutively in the instruction stream.
b.cc(complete fusion when fused instructions read no more than 4 registers per 6 instructions)
cbz(usually fused, if destination matches cbz operand. also works with instruction variants that set flags)
aesmc(always fused if operands match pattern "A, B ; A, A")
aesimc(always fused if operands match pattern "A, B ; A, A")
eor(usually fused if operands match pattern "A, B, C ; A, A, D" or "A, B, C ; A, D, A")
amx(excluding loads and stores - probably fuses to something like a
Branch fusion does not work with implicit shift or extend, nor instructions that read flags (like
Other tested patterns are not fused, including
Certain instructions do not need to issue:
mov x0, #0(handled by renaming)
mov x0, x1(usually handled by renaming)
movi v0.16b, #0(handled by renaming)
mov v0.16b, v1.16b(usually handled by renaming)
mov x0, #123(handled by renamer at a max of 2 per 8 instructions, includes all tested immediate "mov" aliases e.g. bitwise/movz/movn)
b(unconditional branch never issues)
Other tested instructions are not eliminated, including
mov x0, xzr.
Several instructions have latencies that aren't adequately described in the instruction tables:
Firestorm can retire eight instructions per cycle, but can issue more uops (using implicit shifts or extends on ALU operations, as thus far other uops retire separately).
These numbers mostly come from my M1 buffer size measuring tool. The M1 seems to use something other than an entirely conventional reorder buffer, which complicates measurements a bit. So these may or may not be accurate. (This paragraph previously said "it seems to use something along the lines of a validation buffer". I think the VB hypothesis has since been disproven. Various attempts to measure ROB size have yielded values 623, 853, and 2295 (see the previous link). My uninformed hypothesis is that this may imply a kind of distributed/coalesced reorder buffer, where only structures that need to know about a given operation track them.)
My current ROB theory is the "Coalesced Retire Queue". Each "entry" can describe up to ~7 uops-that-retire (although this rate may only be hit for "nops" and eliminated mov instructions). Only one of these may be a load/store, and only one of these may be a branch instruction (probably the first and last respectively?). A separate "Rename Retire Queue" tracks in-flight renames, and each coalesced retire queue probably records how many entries to retire from the rename queue. Retirement rate is up to eight coalesced entries per cycle, and up to sixteen renames per cycle. This theory is probably not complete, but predicts ROB limits quite well. (Load and store buffers may be released before retirement, making it easy to observe the ~330 limit. Some noise is observed around these limits, which might be explained by varying of retire-group size/alignment, or might indicate problems with this theory.)
I've posted to Twitter a WIP diagram with scheduler/dispatch queue sizes.
As instructions are executed, they are mapped to operations inside the processor. This work describes two kinds, "operations that retire" (which I call "retires", but should maybe be called "retirement slots"), and "operations that issue" (uops). Operations that issue are limited by how many ports are available that can execute that operation in a given cycle. Retirement slots are limited to eight per cycle, which I have called the "retires per cycle" limit, but likely corresponds to a frontend "decode width", rather than a limit related to retirement itself (which I suspect could be either out-of-order, or much wider). All instructions have at least one retire, but some instructions have more uops than retires (e.g. ADD (shift)), and others have fewer (e.g. NOP, LDP).
These two types of operations can be measured, using the retire counter (counter 01, undocumented, shown in the Retire column in the tables), and the issue counter (counter 52, documented). The following three undocumented counters (53, 54, and 55) count the same kind of uops as the issue counter, but at a different point in the pipeline. These correspond to the Int, Mem and FP columns in the tables, and count uops that issue to units of the given type.
Finally, the Units column is based on finding conflicts when measuring throughput - if two uops block each other, we can tell they both use the same unit. (Unfortunately these experiments are not yet automated nor included in the documentation, so there may be some mistakes.)
To access the uop counters, I used a kernel module to bypass an allow-list in xnu. I do not recommend or support this process, but my code is available for reference.