PyTorch on Apple Silicon: I Got 3x Faster Inference with Metal Backend (No CUDA Required)

So here's the thing - everyone keeps saying "you need NVIDIA for deep learning" but I just got my PyTorch models running 3x faster on my M2 MacBook Pro using Metal kernels. And honestly? The setup was way easier than I expected.

The Problem: PyTorch Defaults to CPU on Mac

If you install PyTorch on Apple Silicon and just run your model, it defaults to CPU. Which is... fine? But you're leaving massive performance on the table. The M-series chips have these incredible Neural Engine and GPU cores that PyTorch can tap into through Metal Performance Shaders (MPS).

Quick benchmark I ran yesterday:

CPU inference: ~340ms per batch
MPS (Metal) inference: ~95ms per batch
That's 3.5x speedup for literally changing one line of code

Why Metal Backend Matters (and Why I Started Experimenting)

Apple's Metal framework is their answer to CUDA. When PyTorch 1.12 dropped MPS support, I was skeptical tbh. But after spending a weekend benchmarking, I'm convinced this is production-ready for inference workloads.

The real magic? torch.compile now generates optimized Metal kernels automatically. You dont need to write custom kernels or mess with low-level GPU code.

Setting Up PyTorch with Metal Support

First, make sure you're on the right PyTorch version. This bit me hard when I tried this on 1.11 - MPS support wasn't stable yet.

# check your current setup
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"MPS built: {torch.backends.mps.is_built()}")

# if MPS shows False, reinstall with:
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

The tricky part (that took me an hour to figure out): you need macOS 12.3+ for MPS. I was on 12.2 and got cryptic errors about missing Metal libraries.

Performance Testing: CPU vs MPS vs torch.compile

Okay so I ran these benchmarks on a ResNet50 model doing inference on 224x224 images. My setup: M2 Pro, 16GB RAM, macOS 14.2.

import torch
import torchvision.models as models
import time

# load pretrained model
model = models.resnet50(pretrained=True)
model.eval()

# dummy input batch
batch_size = 32
dummy_input = torch.randn(batch_size, 3, 224, 224)

def benchmark_inference(model, input_tensor, device, warmup=10, iterations=100):
    """
    my go-to performance testing setup
    warmup runs are critical - learned this the hard way when
    first runs were 10x slower due to Metal shader compilation
    """
    model = model.to(device)
    input_tensor = input_tensor.to(device)
    
    # warmup phase (SUPER important for Metal)
    with torch.no_grad():
        for _ in range(warmup):
            _ = model(input_tensor)
    
    # actual benchmark
    torch.mps.synchronize() if device.type == 'mps' else None
    start = time.perf_counter()
    
    with torch.no_grad():
        for _ in range(iterations):
            output = model(input_tensor)
            if device.type == 'mps':
                torch.mps.synchronize()  # wait for GPU to finish
    
    end = time.perf_counter()
    avg_time = (end - start) / iterations * 1000  # convert to ms
    
    print(f"{device} - Average: {avg_time:.2f}ms, Throughput: {batch_size/avg_time*1000:.1f} img/s")
    return avg_time

# test 1: CPU baseline
cpu_time = benchmark_inference(model, dummy_input, torch.device('cpu'))

# test 2: MPS (Metal)
mps_time = benchmark_inference(model, dummy_input, torch.device('mps'))

# test 3: MPS + torch.compile (this is where it gets interesting)
model_compiled = torch.compile(model, backend='aot_eager')
compiled_time = benchmark_inference(model_compiled, dummy_input, torch.device('mps'))

print(f"\nSpeedup: MPS vs CPU = {cpu_time/mps_time:.2f}x")
print(f"Speedup: Compiled vs CPU = {cpu_time/compiled_time:.2f}x")

My Results (your mileage may vary):

CPU: 342ms (93 img/s)
MPS: 98ms (326 img/s) - 3.5x faster
MPS + compile: 87ms (367 img/s) - 3.9x faster

The compiled version was only ~11% faster than raw MPS. Honestly expected more, but I'll take it.

The Unexpected Discovery: Memory Management is Different

So here's something that caught me off guard. Metal's memory management works differently than CUDA. On CUDA, you can usually get away with lazy memory cleanup. On Metal? Not so much.

I kept hitting this error when processing large batches:

RuntimeError: MPS backend out of memory

Even though Activity Monitor showed plenty of RAM available. Turns out MPS has a separate memory pool and doesn't automatically garbage collect as aggressively.

The fix that saved my sanity:

def process_large_dataset(dataloader, model, device):
    """
    Process data with proper Metal memory management
    btw this pattern also works for CUDA but Metal is stricter about it
    """
    model = model.to(device)
    results = []
    
    for batch_idx, (images, _) in enumerate(dataloader):
        images = images.to(device)
        
        with torch.no_grad():
            outputs = model(images)
            # move results back to CPU immediately
            results.append(outputs.cpu())
        
        # critical for Metal - clear cache every N batches
        if batch_idx % 10 == 0 and device.type == 'mps':
            torch.mps.empty_cache()
            # this is like hitting the reset button on GPU memory
    
    return torch.cat(results, dim=0)

After adding empty_cache() calls, I could process 5x larger datasets without OOM errors.

torch.compile Deep Dive: What's Actually Happening?

When you call torch.compile() on Mac, PyTorch uses TorchInductor to generate optimized Metal kernels. The process is pretty cool:

Traces your model to build a computation graph
Analyzes which operations can be fused
Generates custom Metal shaders for fused ops
Compiles shaders at runtime (first run is slower)

import torch._dynamo as dynamo

# see what torch.compile is doing under the hood
dynamo.config.verbose = True

model_compiled = torch.compile(
    model,
    mode='reduce-overhead',  # optimizes for inference
    backend='aot_eager'
)

# first inference triggers compilation (slow)
# subsequent runs use cached kernels (fast)

The mode parameter matters more than I thought:

default: balanced optimization
reduce-overhead: minimize Python overhead (best for inference)
max-autotune: tries multiple kernel configs, picks fastest (slow compile, fast runtime)

I tested all three modes. max-autotune took 3 minutes to compile but only gave 5% speedup over reduce-overhead. Not worth it for my use case.

Production-Ready Inference Pipeline

Here's the full pipeline I'm using in production now. It handles edge cases I discovered through painful debugging sessions:

import torch
import torch.nn.functional as F
from pathlib import Path

class MetalInferencePipeline:
    def __init__(self, model_path, use_compile=True):
        """
        Initialize inference pipeline with Metal optimization
        
        Args:
            model_path: path to saved model weights
            use_compile: whether to use torch.compile (recommended)
        """
        self.device = self._get_device()
        self.model = self._load_model(model_path)
        
        if use_compile and self.device.type == 'mps':
            print("Compiling model for Metal... (first run will be slow)")
            self.model = torch.compile(
                self.model,
                mode='reduce-overhead',
                backend='aot_eager'
            )
        
        # warmup compilation
        self._warmup()
    
    def _get_device(self):
        """
        Get best available device
        Fallback order: MPS -> CUDA -> CPU
        """
        if torch.backends.mps.is_available():
            print("Using Metal Performance Shaders (MPS)")
            return torch.device('mps')
        elif torch.cuda.is_available():
            return torch.device('cuda')
        else:
            print("Warning: Running on CPU, inference will be slow")
            return torch.device('cpu')
    
    def _load_model(self, model_path):
        """Load model with error handling"""
        try:
            model = torch.load(model_path, map_location='cpu')
            model = model.to(self.device)
            model.eval()
            return model
        except Exception as e:
            raise RuntimeError(f"Failed to load model: {e}")
    
    def _warmup(self, num_runs=5):
        """
        Warmup runs to compile Metal shaders
        This prevents slow first inference in production
        """
        dummy_input = torch.randn(1, 3, 224, 224).to(self.device)
        with torch.no_grad():
            for _ in range(num_runs):
                _ = self.model(dummy_input)
                if self.device.type == 'mps':
                    torch.mps.synchronize()
    
    @torch.no_grad()
    def predict(self, images):
        """
        Run inference on batch of images
        
        Args:
            images: torch.Tensor of shape (B, C, H, W)
        
        Returns:
            predictions: torch.Tensor on CPU
        """
        # move to device
        images = images.to(self.device)
        
        # inference
        outputs = self.model(images)
        
        # move back to CPU and apply softmax
        predictions = F.softmax(outputs.cpu(), dim=1)
        
        # cleanup Metal memory
        if self.device.type == 'mps':
            torch.mps.empty_cache()
        
        return predictions

# usage
pipeline = MetalInferencePipeline('resnet50.pth')
predictions = pipeline.predict(image_batch)

Edge Cases and Gotchas I Learned the Hard Way

1. Not all PyTorch ops are Metal-optimized

Some operations fall back to CPU silently. Use profiler to check:

from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    output = model(input_tensor)

print(prof.key_averages().table(sort_by="cpu_time_total"))

If you see high CPU time for GPU tensors, that op isn't Metal-accelerated.

2. Mixed precision (fp16) is weird on Metal

CUDA handles fp16 beautifully. Metal? Not so much. I got worse performance with fp16:

# this actually made things SLOWER on M2
model = model.half()  # convert to fp16
# MPS fp16 support is still maturing as of PyTorch 2.1

Stick with fp32 for now on Apple Silicon.

3. DataLoader num_workers strikes again

Set num_workers=0 when using MPS. Multi-process data loading conflicts with Metal's memory model:

# this causes random crashes with MPS
train_loader = DataLoader(dataset, num_workers=4)  # dont do this

# use single-process loading instead
train_loader = DataLoader(dataset, num_workers=0)  # stable

Took me 3 hours of debugging to figure this out. Save yourself the pain.

When NOT to Use Metal Backend

Real talk - Metal isn't always faster. Here's when to stick with CPU:

Small models (< 10M parameters): overhead isn't worth it
Batch size = 1: GPU shines with parallel processing
Training (as of Dec 2024): MPS training has rough edges, use CPU or cloud GPU
Non-vision models: NLP models often hit unsupported ops

I benchmarked a small LSTM model and CPU was actually 20% faster than MPS due to memory transfer overhead.

Comparing to Other Solutions

vs Cloud GPUs (A100, V100)

MPS: great for development/small-scale inference
Cloud: necessary for training or high-throughput serving
Cost: local is free after hardware purchase

vs ONNX Runtime with CoreML

ONNX: slightly faster for pure inference (~10-15%)
PyTorch: easier debugging, more flexible
I use ONNX for production deployment, PyTorch for experimentation

vs TensorFlow on Mac

TF also has Metal support via PluggableDevice
PyTorch's torch.compile feels more mature imo
Both are viable options tbh

Monitoring Performance in Production

Here's my monitoring setup to catch Metal performance regressions:

import time
from collections import deque

class PerformanceMonitor:
    def __init__(self, window_size=100):
        self.latencies = deque(maxlen=window_size)
        self.errors = 0
    
    def record(self, latency_ms):
        self.latencies.append(latency_ms)
    
    def get_stats(self):
        if not self.latencies:
            return None
        
        latencies = list(self.latencies)
        return {
            'p50': sorted(latencies)[len(latencies)//2],
            'p95': sorted(latencies)[int(len(latencies)*0.95)],
            'mean': sum(latencies)/len(latencies),
            'error_rate': self.errors / len(latencies)
        }

# integrate with inference
monitor = PerformanceMonitor()

def monitored_inference(pipeline, images):
    start = time.perf_counter()
    try:
        result = pipeline.predict(images)
        latency = (time.perf_counter() - start) * 1000
        monitor.record(latency)
        return result
    except Exception as e:
        monitor.errors += 1
        raise

I log these stats every hour to catch performance degradation.

Final Thoughts and Next Steps

After weeks of experimentation, here's my Metal + PyTorch workflow:

Develop on Mac with MPS (3-4x faster than CPU)
Use torch.compile with reduce-overhead mode
Profile carefully - not all ops are optimized
Deploy to cloud GPUs for training/heavy workloads

The biggest surprise? How stable Metal backend has become. A year ago this was experimental. Now I'm running production inference on it.

Things I want to explore next:

Quantization (int8) on Metal - theoretically 4x faster
Multi-model inference pipeline
Combining Metal with Core ML for Neural Engine access

If you're doing ML on Apple Silicon, give Metal backend a shot. Worst case, you fall back to CPU. Best case, you get my laptop to stop sounding like a jet engine during inference.

btw if you hit issues, check PyTorch's GitHub issues - the Metal backend team is super responsive.

sCoding

Search This Blog