Investigating the GPU Neural Accelerators on Apple A19/M5

Taras Zakharko, 15. October 2015

On 19. September 2025, Apple released the iPhone 17 family. A particular highlight was the new A19 chips with "GPU Neural Accelerators", advertising up to 4x higher GPU AI compute compared to earlier hardware. This marks the first appearance of dedicated matrix multiplication acceleration hardware on Apple GPUs and invites the question about the capabilities of these units and what they mean for the future of machine learning applications on Apple platforms.

I was able to run some initial microbenchmarks on an iPhone 17 (A19 chip with 5 GPU cores) and will present my findings here. These results should also translate to the new M5 series chips. Note that my focus is on the peak performance of the Neural Accelerator hardware, without considering the problem of moving the data between the memory and the GPU. Real-world application performance might differ significantly. Apple also offers additional software frameworks that might make use of Neural Accelerators (e.g., Core ML), which I do not take into account here — I am only concerned with what is achievable using Metal Shading Language.

Key Takeaways:

Operation	Measured (A19, 5 GPU cores)	Speedup over previous gen. (iso-clock)	Estimated M5 Max (40 cores, ~1750 Mhz)
SIMD FP32	1880 GFLOPS	1x	18 TFLOPS
SIMD FP16	3200 GFLOPS	~0.7x	30 TFLOPS
Matrix FP16 (accumulating into FP16/FP32)	7500 GFLOPS	~4x	70 TFLOPS
Matrix INT8 (accumulating into INT32)	13500 GFLOPS	~7.5x	130 TFLOPS

Estimated 1024 FLOPS per GPU core per cycle for FP16 matrix-multiply accumulate
Estimated <~2048 OPS per GPU core per cycle for INT8 (with caveats)
Optimal matrix size appears to be 32 \(\times\) 32 or larger for all tested data formats
Matrix transpose appears to be directly supported by the hardware and incurs no additional performance cost

A note on terminology

GPU matrix acceleration technology was first popularized by Nvidia, which dubbed its accelerator "Tensor Cores". Since then, the term became a colloquial reference for any kind of GPU matrix multiplication acceleration hardware. For the sake of clarity, I will talk about "matrix multiplication units" in general or "Neural Accelerators" to refer specifically to Apple's implementation.

Another common point of confusion is how GPU specs are advertised. Most GPU vendors count the number of individual scalar FP32 arithmetic pipelines (e.g., "CUDA cores", "shader units", or similar). Apple instead uses "GPU cores", which refer to minimal replicable building blocks. A single Apple GPU core contains 128 FP32 "shader cores" and is roughly comparable to a "Streaming Multiprocessor" (SM) in an Nvidia Blackwell GPU. Similarly, the GPU performance is typically advertised as the theoretically attainable peak throughput using FP32 multiply-add operations. Real-world performance is often substantially different (usually lower) and depends on the workload and the details of the specific GPU architecture.

Apple GPU Architecture

Modern GPUs are massively parallel processors that rely on wide SIMD (single instruction, multiple data) execution to maximize the compute density, and Apple GPUs are no exception. The native execution width of Apple hardware is 32, with arithmetic instructions operating on 32 data elements in parallel. The programming model presents this as multiple threads running the same shader code but potentially different data, with groups of 32 threads executing in lockstep and able to communicate (e.g., read each other's data) directly. Apple Metal refers to these execution groups as SIMD-groups (other mainstream designations include "wavefronts" and "warps"). Multiple SIMD-groups can be organized into threadgroups that have access to fast shared memory but require explicit synchronization. Finally, multiple threadgroups can be launched as part of the same compute kernel, but cannot directly synchronize their work and need to communicate using global memory and atomic memory operations.

Image title — Figure 1. Apple Silicon GPU Architecture Diagram (simplified)

This logical organization is directly reflected in the hardware organization (Figure 1). The basic building block is a compute partition, responsible for executing shader programs. Each compute partition consists of an instruction dispatch circuitry, execution circuitry (SIMD units), register cache, and additional supporting hardware and state (e.g., raytracing and texture sampling, omitted here for brevity). The execution circuitry consists of multiple physically distinct execution SIMD datapaths that handle instructions of different types. Experimental evidence and Apple WWDC presentations suggest that these include at least a full-precision floating point FMA pipe, a half-precision floating point FMA pipe, and an integer ADD pipe, plus potentially others. Blocks of four compute partitions are organized into GPU cores and share the fast local cache memory. A threadgroup is executed on a single GPU core, using the four compute partitions to provide within-threadgroup parallelism. Furthermore, each GPU core will execute instructions from several different threadgroups concurrently, switching between threads every cycle to maximize hardware occupancy and hide execution latencies. Therefore, the GPU relies on constantly being fed work in order to maintain high execution efficiency.

From the compute perspective, each partition therefore contains 32 FP32 multiply-accumulate units and is capable of performing 64 FLOPS per cycle (32 combined multiplications+additions). For a GPU core, this means 128 FP32 FMA units or 256 FLOPS per cycle. Note that this is only a theoretical maximum for FP32 performance, and actual shader performance can differ significantly depending on the instruction mix and memory access patterns. With newer Apple GPUs, the achievable performance can also occasionally be higher due to the ability to issue multiple instructions simultaneously (see below).

While the basic GPU core organization remained unchanged at least since A14/M1 (perhaps even earlier), there have been significant changes to the compute units and memory hierarchy over the years. Apple Family 9 GPUs (featured in A17 Pro/M3/M4 series of chips) introduced two major improvements to shader execution. The first is the ability to issue two instructions per clock to different datapaths from a single compute partition. For example, full-precision (FP32) instructions can be issued simultaneously with half-precision (FP16) or integer (INT) instructions, effectively improving runtime performance. The second is the introduction of dynamic shader memory management (which Apple dubs "Dynamic Caching") that allows all memory resources, including registers, shared memory, texture memory, and more, to be backed by the L2 cache and allocated dynamically. This feature can improve hardware occupancy for complex shaders that have different resource needs depending on some runtime parameter (e.g., a branch condition). For more information, refer to the Explore GPU advancements in M3 and A17 Pro (Apple Tech Talk).

Apple A19/M5 GPUs introduce additional major changes to the compute units. Here is an incomplete list of new features I know of:

New matric multiplication datapath with support for FP16 and INT8 data formats (Neural Accelerators)
Two half-precision (FP16) instructions can be issued concurrently, doubling the FP16 compute
The throughput of INT32 multiplication has been doubled from earlier
The throughput of some special operations (log/exp/popcnt) has been doubled

At the same time, the clock frequency of the GPU has only moderately increased over the years. The original M1 GPU was clocked at around 1300 MHz, while I estimate the A19 GPU frequency to be around 1460 MHz (and the M5 will probably be around 1500 - 1600 MHz)¹. This means that the majority of the substantial performance improvements stem from architectural changes and increased execution efficiency.

Neural Accelerators

Apple does not disclose any information about the architecture of the Neural Accelerator hardware. Some information might be inferred from the patents published in 2025:

In particular, Fig. 2 and 3 from the WO2025071810 patent describe a dot product circuit that operates on four pairs of values per data lane, and the other patent describes a routing network for efficiently transposing matrix elements in order to supply such a circuit with data. The benchmark results are consistent with the idea of a 32-wide 4-way hardware dot product datapath that has an effective throughput of 128 half-precision multiply-accumulate operations each cycle. The dot product throughput for each GPU core is therefore 512 FP16 FMAs per cycle or 1024 FLOPS per cycle.

Regardless of the implementation details, the matrix multiplication hardware is not directly exposed to the developers writing Metal Shading Language kernels. As of Xcode 26.1, the only supported way to access it is by using the high-level Metal Performance Primitives (MPP) framework in combination with the new Metal Tensor APIs. These frameworks provide C++ templates for configuring and executing accelerated operations on tensors, currently including matrix multiplication and convolution. The matrix multiplication operation is performed on matrices of dimensions \(M \times K\) and \(K \times N\), producing a matrix of dimensions \(M \times N\)². Compile-time parameters can be used to configure the behavior of matrix multiplication; an overview is provided in Table 1.

Parameter	Explanation	Notes
Matrix dimensions	Constants \(M\), \(N\), and \(K\)	\(K\) can be inferred at runtime from the tensor shape
Matrix transpose	Flags indicating whether to transpose the input matrices
Precision	Flag indicating whether reduced precision is acceptable	Appears to have no effect on A19
Operation type	Multiply with or without accumulation	Accumulation updates the output tensor in place
Execution scope	Thread, SIMD-Group, or multiple SIMD-Groups	A single SIMD-group appears to work best on current implementation

Table 1. MPP Matrix Multiplication Configuration as Compile-Time Constants

Configurable execution scope is a particularly interesting feature, as it allows the programmer to choose between different execution modes. There is a thread-scope, where each thread performs its own matrix multiplication — probably most useful for small matrices with divergent execution flow, e.g., for graphical tasks. The SIMD-groups scope is a classical hardware-accelerated cooperative matrix multiplication mode, where all threads in a compute partition concurrently work towards producing the result. This scope can also be configured to involve multiple parallel SIMD-groups, presumably pooling the hardware resources of several partitions, but in my experiments, this always resulted in a significant performance drop. This will likely be fixed in a future update.

The MPP framework requires the output matrix size (tile size) to be fixed at compile time by providing the constants \(M\) and \(N\). The remaining input matrix dimension can be either a constant or determined dynamically from the input tensors' shape. While the framework is quite flexible in setting up the dimensions, it seems that only a few choices work well in practice. Using too small or too large matrices generally resulted in a poor performance or a shader compilation error. Some dimensions produced wrong results (which appears to be a bug with masking out unused lanes in the dot product hardware). Performance figures for different tile sizes are presented further down.

The framework supports several data types as inputs and outputs for matrix operations. Hardware-accelerated data formats include FP16 (with FP16 or FP32 output) and INT8 (with INT32 output). Although undocumented, FP32 is supported as well, but appears to run using the general-purpose SIMD pipe on the A19. A notable omission is the bfloat16 format — it is unclear whether the first-generation Neural Accelerator hardware lacks dedicated support for this format or whether it is not yet exposed in the Metal Shading Language.

Finally, the output of the matrix multiplication operation can be either a tensor object (usually stored in device memory) or a special cooperative_tensor object that partitions the tensor storage among all participating threads in an implementation-defined way (this is similar to "matrix fragments" in Nvidia PTX). Cooperative tensors are useful when performing multiple operations in a sequence or post-processing the output, since the data resides in local registers and is very fast to access. Perhaps surprisingly, cooperative tensors currently cannot be used as inputs for matrix multiplication.

Experimental Setup

To measure the peak performance of Neural Accelerators, I have set up a simple testing framework using Swift and Metal Shading Language. The code is available in the project repository. A helper function is used to generate and build specialized matrix multiplication kernels for the given matrix dimensions and data types. We then launch a very large number of threadgroups and measure the elapsed time. Since we are interested in peak performance, special care is taken to ensure that memory access overhead is negligible:

We use the same inputs for each kernel, ensuring that the data will be in the fast GPU-local cache instead of the slow device memory
We repeat the multiply-accumulate operation multiple times (64 repetitions) and use a cooperative tensor to ensure high performance

For given matrix dimensions \(M\), \(K\), and \(N\) and the elapsed time \(t\), we can calculate the kernel throughput as follows³:

\[ \text{OPS} = \frac{64\times(2MNK)}{t} \]

These tests are performed for the supported input data types (FP16 and INT8) with supported accumulator types. We also test a variety of matrix dimensions for \(M, N \in\) {8, 16, 32} and \(K \in\) {8, 16, 32, 64, 128}. As choosing certain matrix dimensions produces invalid results on A19, we also validate the kernel output by comparing it with a reference implementation. Invalid results are discarded. To account for variance in measurements, we repeat each test multiple times and take the highest measured value. The tests were performed on an A19 chip with 5 GPU cores (iPhone 17), with reference data provided using a M3 Max MacBook Pro and a M4 iPad Pro. The A19 tests were performed outdoors during a particularly cold Swiss autumn evening to help with the thermal performance, which might have been more of a placebo than not.

Results for FP16

MPP supports matrix multiplication for FP16 inputs using a FP16 or a FP32 accumulator matrix. The performance remains the same for both accumulator types. This is in contrast to some other implementations, which offer higher performance when a FP16 accumulator is used (e.g., Nvidia Tensor Cores). The performance is highly dependent on the matrix dimensions (see the following sections for a more detailed discussion). The results are summarized in Figure 2 below. The output matrix (tile) dimensions are \(M\) (rows) and \(N\) (columns). The number \(K\) is the other dimension of the input matrices (columns for the first matrix and rows for the second matrix). We only show the results for the FP32 accumulator since they are identical to the FP16 accumulator. There is no difference in measured performance if transposed matrices are used, suggesting that the hardware is able to access row and column data at no additional cost.

The peak measured throughput is approximately 7.4 TFLOPS for the 5-core A19 smartphone chip. Given the estimated GPU clock frequency of 1460 MHz, each core provides around 1000 FLOPS of matrix multiplication throughput per cycle. This number is very close to 1024 FLOPS, or 128 dot product FMA operations per cycle per compute partition. As mentioned previously, this is very similar to the 4-way 32-wide dot product accelerator described in a published patent, and too close a number to be a coincidence. Even if the hardware implementation details differ, I believe it is safe to conclude that we have \(128\) matrix FMAs per partition.

Results for INT8

MPP supports matrix multiplication for INT8 inputs using a FP32 accumulator matrix. As previously mentioned, the performance is highly dependent on the matrix dimensions. The results are summarized in Figure 3 below. The output matrix (tile) dimensions are \(M\) (rows) and \(N\) (columns). The number \(K\) is the other dimension of the input matrices (columns for the first matrix and rows for the second matrix). Just as in the FP16 case, transposing the matrices does not affect the performance.

The peak measured throughput is approximately 13.4 TOPS, just short of a double compared to the FP16 figures. Given the estimated GPU clock frequency of 1460 MHz, this translates to approximately 1900 OPS per GPU core per cycle. Unlike before, this is not a very nice power of two number, but it is sufficiently close to 2048, which is a power of two. It would be convenient to assume 2048 OPS per core per cycle, since that would translate to a plausible 256 dot product MAC (integer multiply and accumulate) — double of FP16 compute. Still, I am a bit uneasy about making this jump given the larger discrepancy between this plausible value and the measured value. It is possible that the GPU throttles slightly while running the Neural Accelerator hardware in INT mode, which would explain a slightly lower performance. I did obtain a measurement of over 14 TOPS during some preliminary benchmark runs, but was not able to replicate it using the final version of the benchmark suite. At any rate, we do observe a noticeable improvement over FP16 implementation, suggesting that Apple has actually undersold the capability of these chips in their marketing materials.

Optimal Tile Size

The experimental results show that the matrix multiplication performance highly depends on the chosen matrix tile size. It appears that the optimal matrix size for both FP16 and INT8 data is around 32 \(\times\) 32 for both input matrices. Doubling the matrix size (not shown in the graphs) appears to decrease the performance again, and increasing the dimension beyond certain limits seems to crash the Metal compiler on the A19 target.

With 128 presumed matrix FMAs per partition, 256 cycles are required in total to compute the matrix product for \(M=N=K=\)32, which seems like a surprisingly high number of cycles just to hide the latency. As Apple does not provide any architectural or performance tuning details at this moment, I find it difficult to speculate on the exact reason. It might be a latency issue, or maybe we are observing the operand routing network in action, which could require a specific data layout to perform well. Regardless, pending any official guidance from Apple, I would go for at least 32 \(\times\) 32 or even 32 \(\times\) 64 matrix chunks. What's interesting is that matrix transposes appear to be free, so no extra data shuffling is required when loading the data.

Conclusions and Outlook

While not without limitations, the first generation of Neural Accelerators is a significant step forward for programmable ML applications on Apple Silicon. We observe:

1024 FLOPS per GPU core per cycle for FP16 matrix operations
~2048 OPS per GPU core per cycle for INT8 matrix operations
Optimal matrix tile size of approximately 32 \(\times\) 32

In particular, I'd like to highlight the strong FP16 performance with FP32 accumulators, which compares favorably with the current state of the art implementations. On the other hand, INT8 performance lags behind, and support for bfloat16 and additional reduced-precision ML-optimized data types is lacking. Regardless, the performance improvements are significant (the A19 iPhone GPU matches my M3 Max MacBook Pro for FP16 matrix multiplication!), which should make M5-based machines considerably more attractive for ML applications, especially when paired with larger unified memory pools compared to other consumer systems.

The results presented here focused on hardware capability and peak performance. Real-world applications are more complex because they also need to move data to and from the device memory. Based on simple napkin math, the required data bandwidth far exceeds the capabilities of current memory interfaces. A single half-precision 32 \(\times\) 32 matrix is 2KB of data, so 4KB needs to be fetched per compute partition every 256 cycles to keep the Neural Accelerators fully utilized. That's 16 bytes per cycle per partition or 64 bytes per cycle for a single GPU core. At the GPU clock frequency of 1460 MHz, this translates to 93.44 GB/s inbound bandwidth per GPU core. In comparison, the new M5 Mac has ~ 150 GB/s for 10 GPU cores. State-of-the-art algorithms for Nvidia hardware rely on sophisticated data prefetching and caching strategies. It would be interesting to explore how these ideas can be adapted to the Apple Silicon architecture and the new Metal 4 Tensor APIs.

References

The frequency was estimated by measuring the maximal achievable throughput of an FMA chain kernel, which tops out at around \(1880 GFLOPS\). With \(128\) FMA pipelines per core (\(640\) pipelines in total) and two FLOPS per pipeline, this results in \(1880/(640*2) \approx 1.46\) GHz. Note that this is a non-pro A19, the A19 Pro reportedly runs a 10% higher clock frequency. ↩
Metal stores matrices in row-major order, which means that the matrix dimensions are reversed for a tensor. The first dimension specifies the number of columns (elements per row), and the second dimension specifies the number of rows. This does not appear to be currently documented and can be somewhat confusing. ↩
Each matrix multiplication requires \(MNK\) FMA operations, and an FMA counts as two operations (multiplication and addition). We repeat this 64 times. ↩