TensorOps is a Metal Shading Language library that accelerates tensor operations on the GPU, now extended in iOS/macOS 27 with support for new quantized data types (FP8, INT2, MX scaling formats) and the ability to use cooperative tensors directly as matmul inputs ā enabling custom operations like FlashAttention without round-tripping through threadgroup memory.
⢠Added FP8 (E4M3, E5M2), FP8 E8M0 scale factors, and INT2 quantized data types (INT4/INT8 were added in iOS/macOS 26)
⢠MTLTensor now supports an auxiliary scale plane, letting a single tensor object carry quantized data + block-wise FP8 E8M0 scales together
⢠Cooperative tensors can now be passed directly as inputs to matmul2d operations via get_left_input_cooperative_tensor, eliminating threadgroup memory round-trips in fused kernels
⢠M5 neural accelerator (new hardware block per shader core) is now targeted automatically by TensorOps for dense compute-bound prefill workloads
⢠New FP8/INT2/MX-format quantized tensor types let you compress model weights aggressively and feed them directly into TensorOps without manual dequantization, cutting memory bandwidth usage for LLM inference
⢠Cooperative tensors can now be passed directly as inputs to subsequent matmul operations, enabling fused operations like FlashAttention without expensive threadgroup memory round-trips
⢠A single MTLTensor can now carry an FP8 E8M0 block-wise scale plane alongside quantized data, simplifying quantized weight management in custom Metal kernels
Shows how to create an MTLTensor with an FP8 quantized data plane and an attached E8M0 block-wise scale plane in Swift, then dispatch a TensorOps-based matrix multiplication kernel that dequantizes on the fly.
import Metal
import MetalPerformancePrimitives
/// Demonstrates creating a quantized MTLTensor with an FP8 E8M0 scale plane
/// and dispatching a custom TensorOps matmul kernel in iOS 27 / macOS 27.
func runQuantizedMatMul(device: MTLDevice, commandQueue: MTLCommandQueue) throws {
// --- 1. Build the scale-plane descriptor (FP8 E8M0, block size 32 along columns) ---
let scalePlaneDescriptor = MTLTensorAuxiliaryPlaneDescriptor()
scalePlaneDescriptor.dataType = .float8E8M0 // FP8 E8M0 scale format (new in iOS 27)
scalePlaneDescriptor.blockFactors = MTLTensorBlockFactors(width: 32, height: 1)
// Map this auxiliary plane as a "scales" plane
let auxPlanesMap = MTLTensorAuxiliaryPlanesMap()
auxPlanesMap.setPlane(scalePlaneDescriptor, for: .scales)
// --- 2. Build the main tensor descriptor (FP8 E4M3 quantized weights, 1024 x 2048) ---
let tensorDescriptor = MTLTensorDescriptor()
tensorDescriptor.dataType = .float8E4M3 // FP8 quantized data type (new in iOS 27)
tensorDescriptor.shape = [1024, 2048]
tensorDescriptor.auxiliaryPlanesMap = auxPlanesMap // attach scale plane
// --- 3. Allocate the tensor on the device ---
guard let quantizedTensor = device.makeTexture(descriptor: MTLTextureDescriptor()) as? MTLTensor
?? device.makeTensor(descriptor: tensorDescriptor) else {
throw NSError(domain: "TensorOps", code: -1,
userInfo: [NSLocalizedDescriptionKey: "Failed to allocate quantized tensor"])
}
// --- 4. Fill weight data and scale factors (normally loaded from model file) ---
// quantizedTensor.plane(.data) ā fill with FP8 E4M3 bytes
// quantizedTensor.plane(.scales) ā fill with FP8 E8M0 scale bytes (one per 32-element block)
// --- 5. Allocate a standard FP16 activation tensor (M x K = 1 x 1024) ---
let activationDescriptor = MTLTensorDescriptor()
activationDescriptor.dataType = .half
activationDescriptor.shape = [1, 1024]
guard let activationTensor = device.makeTensor(descriptor: activationDescriptor) else {
throw NSError(domain: "TensorOps", code: -2,
userInfo: [NSLocalizedDescriptionKey: "Failed to allocate activation tensor"])
}
// --- 6. Allocate FP16 output tensor (1 x 2048) ---
let outputDescriptor = MTLTensorDescriptor()
outputDescriptor.dataType = .half
outputDescriptor.shape = [1, 2048]
guard let outputTensor = device.makeTensor(descriptor: outputDescriptor) else {
throw NSError(domain: "TensorOps", code: -3,
userInfo: [NSLocalizedDescriptionKey: "Failed to allocate output tensor"])
}
// --- 7. Load and dispatch the custom TensorOps MSL kernel ---
let library = try device.makeDefaultLibrary(bundle: .main)
let function = try library.makeFunction(name: "quantized_matmul_fp8")
let pipeline = try device.makeComputePipelineState(function: function)
guard let cmdBuffer = commandQueue.makeCommandBuffer(),
let encoder = cmdBuffer.makeComputeCommandEncoder() else { return }
encoder.setComputePipelineState(pipeline)
// Bind the MTLTensor objects ā the kernel sees quantized weights + scales together
encoder.setTensor(activationTensor, index: 0)
encoder.setTensor(quantizedTensor, index: 1) // carries data + scale plane
encoder.setTensor(outputTensor, index: 2)
// Grid: one threadgroup per output tile (e.g. 64-wide tiles across 2048 columns)
let tileWidth = 64
let gridWidth = (2048 + tileWidth - 1) / tileWidth
encoder.dispatchThreadgroups(
MTLSize(width: gridWidth, height: 1, depth: 1),
threadsPerThreadgroup: MTLSize(width: 32, height: 4, depth: 1) // 4 simdgroups
)
encoder.endEncoding()
cmdBuffer.commit()
cmdBuffer.waitUntilCompleted()
print("Quantized FP8 matmul complete. Output tensor shape:", outputTensor.shape)
}New quantized data types (FP8, INT2) have stricter alignment requirements than larger types ā check Metal documentation before allocating buffers. Not every cooperative tensor layout is compatible as a direct matmul input; always call is_compatible_as_left_input / is_compatible_as_right_input before reuse, and fall back to threadgroup memory if incompatible. Scale plane block factors must evenly divide tensor dimensions.
Full hardware acceleration for the neural accelerator requires M5 chip family; TensorOps falls back to GPU pipelines on earlier Apple Silicon automatically
More iOS 27 APIs land every week.
Get notified when new capabilities are published ā no noise, just signal.