Apple Microarchitecture Research by Dougall Johnson M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions M1/A14 E-core (Icestorm): Overview | Base Instructions | SIMD and FP Instructions
This is an attempt at microarchitecture documentation for the CPU in the Apple M1, inspired by and building on the amazing work of Andreas Abel, Andrei Frumusanu, @Veedrac, Travis Downs, Henry Wong, Agner Fog and Maynard Handley. This documentation is my best effort, but it is based on black-box reverse engineering, and there are definitely mistakes. No warranty of any kind (and not just as a legal technicality). To make it easier to verify the information and/or identify such errors, entries in the instruction tables link to the experiments and results (~35k tables of counter values).
Icestorm is the high-efficiency microarchitecture used by the four E-cores in the M1. Low-power ARM cores are generally a bit less novel, so the notes here are a bit less thorough.
These are refered to as "units", to try to avoid confusion if Apple releases official documentation, as they probably refer to them as "ports" or "pipes", and order them differently. (If this just causes more confusion, I apologise.)
Integer units: 1: alu + br + mrs 2: alu + br + div + ptrauth 3: alu + mul + bfm + crc Load and store units (up to 128-bit loads and stores, including address generation with shifts up to LSL #3): 4: load/store + amx 5: load FP/SIMD units: 6: fp/simd 7: fp/simd + fdiv + to-int + div + recp + sqrt + sha + jcvtzs
Icestorm eliminates movz
+ movk
(pair only, but any shift on both) and adr
/adrp
as well as the Firestorm things.
Mostly the same as Firestorm. Icestorm also has movz
+ movk
elimination, but still
not adrp
+ add
fusion (although it is one uop on account of adr
/adrp
elimination).
Several instructions have latencies that aren't adequately described in the instruction tables:
These numbers mostly come from my M1 buffer size measuring tool. The M1 seems to use something other than an entirely conventional reorder buffer, which complicates measurements a bit. So these may or may not be accurate. (This paragraph previously said "it seems to use something along the lines of a validation buffer". I think the VB hypothesis has since been disproven.)