Apple’s AFM 3 Core Advanced makes 20B on-device models work by routing per prompt

The flash-based architecture breaks the DRAM bottleneck, but offload visibility is still a compliance question.

ByYousef Al-ZahraniTechnology Correspondent, The Executives Brief

about 2 months ago·5 min read

Apple’s AFM 3 Core Advanced makes 20B on-device models work by routing per prompt

Executive summary

Apple’s third-generation foundation models announced at WWDC26 include AFM 3 Core Advanced, a 20-billion-parameter on-device model that stores weights in NAND flash instead of DRAM. For enterprise teams evaluating agentic workloads, the memory wall shifts from capability limits to architecture, privacy boundaries, and offload transparency.

If you have ever tried to plan serious on-device AI, you’ve hit the same hard wall: weights have to sit in DRAM, and that caps practical model sizes far below what you can deploy on servers. Apple’s WWDC26 answer is AFM 3 Core Advanced, a 20-billion-parameter model that stores its full weight set in NAND flash rather than DRAM, and then makes routing decisions per prompt.

That “per prompt” detail is the entire point. Apple says standard architectures cannot swap weights token by token because NAND-to-DRAM bandwidth is too slow to keep up with generation. So AFM 3 Core Advanced routes once at prompt time, picks which expert set to load, moves only what’s needed into DRAM, and then generates all tokens using the same configuration. In other words, Apple is working around the memory problem by changing when the router runs and how frequently weights move, not by pretending consumer hardware can magically behave like a datacenter.

To understand why this matters, zoom out to how on-device models typically get constrained. The basic expectation has been: if you want a large model, the active weights must be available quickly. DRAM is fast, so models store weights there. But DRAM capacity is limited in consumer devices, which pushes practical parameter counts down. Meanwhile, server-side deployments can lean on different hardware realities, so the industry’s big “agentic” ambitions have often defaulted to cloud dependence, where latency and throughput trade-offs can be engineered at scale.

Apple’s AFM 3 family is structured to straddle both worlds. The lineup spans five models: two on-device and three server-based, all running within Apple’s Private Cloud Compute boundary. The server-side models include AFM 3 Cloud Pro for agentic tool use and complex reasoning, and it runs on Nvidia GPUs in Google Cloud. The on-device tier uses Apple’s own architecture. AFM 3 Core Advanced is the spotlight here, and it stores the entire parameter set in NAND flash rather than active memory.

Apple’s research team laid out the logic directly: “Instead of forcing the entire model into DRAM, the full model is stored in flash memory.” They add that NAND-to-DRAM bandwidth is too slow to swap weights token by token, as standard MoE models require, which is why AFM 3 Core Advanced makes routing decisions per prompt. External researchers also described the core issue in simpler terms. Awni Hannun, a researcher at Anthropic and former Apple research scientist, posted on X that you cannot put 20B parameters in RAM at any reasonable precision, and that the approach uses a small model to predict from the query which experts to load from NAND into RAM.

Technically, the mechanism has a few moving parts shaped by the hardware constraints. First, the full 20B weight set lives in flash, not DRAM. Second, DRAM becomes a working buffer for whichever experts a prompt requires. Third, expert routing happens once per prompt, not per token. That last part is the sharp break from typical Mixture of Experts behavior: a conventional MoE router selects different experts for each token generated. Doing that continuously would require ongoing weight movement between flash and DRAM at inference speed, which the NAND-to-DRAM bandwidth cannot support. Apple’s approach selects a fixed expert set at prompt time, loads it into DRAM alongside always-active shared experts, and then generates all tokens from that same configuration.

AFM 3 Core Advanced also adjusts active parameter count to task complexity. Instead of always activating the same size, it scales from 1B to 4B depending on the request: 1 billion for simpler operations, up to 4 billion for harder ones, drawn from the 20-billion-parameter pool stored in flash. That matters for performance planning, because it suggests on-device compute demand is not constant across agentic workloads. For enterprises, the shift is from “choose a model size your device can store” to “predict which tasks will trigger heavier activation and where routing sends the workload.”

There is also a gap in what Apple has publicly disclosed, and it is the kind of gap that turns into paperwork later. The architecture is detailed on memory design and sparse activation mechanism, but it is less forthcoming on practical deployment constraints. Apple’s profiling tools expose timing but not the metrics that decide production viability. Marco Abis, who is building Ziraph, a profiler for local AI on Apple silicon, posted on X: “Energy, memory bandwidth, thermal? Not in the docs.” He also did not find a statement in Apple’s documentation, across the Core AI docs, the Foundation Models docs, or the Private Cloud Compute security post, of when an on-device request transparently offloads or whether that routing is visible to the developer or the user.

For regulated industries, that is not a minor detail. If inference location affects compliance, policy documentation can become impossible without clarity on offload behavior. Apple has indicated a full technical report with benchmarks is coming later this summer, which should help close at least the performance side. What is clear already is the architectural decision enterprises now have to make: simpler requests could stay on-device, while complex agentic tasks route to AFM 3 Cloud Pro running on Nvidia GPUs in Google Cloud via Private Cloud Compute. Private Cloud Compute provides a privacy guarantee, but it does not remove Google Cloud as a dependency for server-side inference. So the DRAM bottleneck is no longer the only limiting factor. Hardware, routing rules, and transparency around offload now sit at the center of deployment decisions.

In other words, WWDC26 did not just “make models bigger.” It changed the boundary. The on-device story is now about flash-backed sparsity and prompt-time routing. The enterprise story is now about where the line is drawn between local inference and private cloud inference, and whether teams can document that line well enough to pass internal and external scrutiny. If you are building or underwriting agentic systems, that is the new reckoning: capability no longer ends at DRAM. It begins with architecture you can explain.

Executive ActionsLocked