Apple Microarchitecture Research by Dougall Johnson

M1/A14 P-core (Firestorm): Overview | Base Instructions | SIMD and FP Instructions
M1/A14 E-core (Icestorm):  Overview | Base Instructions | SIMD and FP Instructions

This is an attempt at microarchitecture documentation for the CPU in the Apple M1, inspired by and building on the amazing work of Andreas Abel, Andrei Frumusanu, @Veedrac, Travis Downs, Henry Wong, Agner Fog and Maynard Handley. This documentation is my best effort, but it is based on black-box reverse engineering, and there are definitely mistakes. No warranty of any kind (and not just as a legal technicality). To make it easier to verify the information and/or identify such errors, entries in the instruction tables link to the experiments and results (~35k tables of counter values).

Icestorm is the high-efficiency microarchitecture used by the four E-cores in the M1. Low-power ARM cores are generally a bit less novel, so the notes here are a bit less thorough.

Icestorm Units (ports)

These are refered to as "units", to try to avoid confusion if Apple releases official documentation, as they probably refer to them as "ports" or "pipes", and order them differently. (If this just causes more confusion, I apologise.)

Integer units:

  1:  alu + br + mrs
  2:  alu + br + div + ptrauth
  3:  alu + mul + bfm + crc

Load and store units (up to 128-bit loads and stores, including address generation with shifts up to LSL #3):

  4: load/store + amx
  5: load

FP/SIMD units:

  6: fp/simd
  7: fp/simd + fdiv + to-int + div + recp + sqrt + sha + jcvtzs

Elimination

Icestorm eliminates movz + movk (pair only, but any shift on both) and adr/adrp as well as the Firestorm things.

Instruction Fusion

Mostly the same as Firestorm. Icestorm also has movz + movk elimination, but still not adrp + add fusion (although it is one uop on account of adr/adrp elimination).

Complex Latencies

Several instructions have latencies that aren't adequately described in the instruction tables:

Other limits

These numbers mostly come from my M1 buffer size measuring tool. The M1 seems to use something other than an entirely conventional reorder buffer, which complicates measurements a bit. So these may or may not be accurate. (This paragraph previously said "it seems to use something along the lines of a validation buffer". I think the VB hypothesis has since been disproven.)