MLX on Apple Silicon: Running Open-Source Models Locally


Apple’s Foundation Models framework gives you a polished, on-device LLM for iOS 26 and macOS Tahoe — but it ships one model, you cannot swap it, and it does not run on older hardware. If you need to run Llama 3, Mistral, Phi-3, or a fine-tuned model of your own on any Apple Silicon Mac, MLX is the escape hatch.

This post covers what MLX is, how to load and run open-source LLMs from Swift using the mlx-swift bindings, how to stream tokens, and where MLX fits alongside Core ML and Foundation Models. We will not cover model training or fine-tuning — those deserve their own dedicated treatment.

Contents

The Problem

You are building a macOS tool that summarizes movie scripts. The on-device Foundation Models framework works great on macOS Tahoe, but your team has two requirements it cannot meet: first, the model must be a fine-tuned Llama 3 variant trained on Pixar screenplay conventions; second, the app needs to ship on macOS Ventura machines in the studio’s editing bays.

A server-side API would solve both problems, but the scripts are under NDA and cannot leave the local machine. You need local inference with a model you control.

Core ML can run arbitrary models, but converting a multi-billion-parameter LLM into the Core ML format is a project in itself — quantization, attention-head mapping, and token-by-token generation loops are all on you. Here is a rough sketch of the manual work involved:

import CoreML

// Simplified for clarity
func generateNextToken(using model: MLModel, inputIDs: MLMultiArray) throws -> Int {
    let input = try MLDictionaryFeatureProvider(
        dictionary: ["input_ids": inputIDs]
    )
    let prediction = try model.prediction(from: input)

    guard let logits = prediction.featureValue(for: "logits")?.multiArrayValue else {
        throw InferenceError.missingLogits
    }

    // Manual argmax over the vocabulary dimension
    var maxIndex = 0
    var maxValue = logits[0].floatValue
    for i in 1..<logits.count {
        if logits[i].floatValue > maxValue {
            maxValue = logits[i].floatValue
            maxIndex = i
        }
    }
    return maxIndex
}

That is a single token. You still need a tokenizer, a sampling strategy, KV-cache management, and a generation loop. MLX collapses all of that into a handful of calls.

What Is MLX?

MLX is an open-source array framework created by Apple’s machine learning research team. It is purpose-built for Apple Silicon and uses a unified memory architecture — the same physical memory is shared between CPU and GPU with zero-copy transfers. If you have used NumPy or PyTorch, the programming model will feel familiar: lazy evaluation, automatic differentiation, and composable function transforms.

The ecosystem splits into several packages:

  • mlx — The core C++ library with Python bindings. Most community models target this.
  • mlx-swift — Native Swift bindings published as a Swift package. This is what you use in Xcode.
  • mlx-swift-examples — Reference apps from Apple Research showing LLM inference, image generation, and speech recognition.
  • mlx-community — A Hugging Face organization hosting hundreds of models pre-converted to MLX’s weight format.

Note: MLX is a macOS-only framework. It requires Apple Silicon (M1 or later) and does not run on iOS, iPadOS, or Intel Macs. If you need on-device inference on iPhone, use Foundation Models or Core ML.

Setting Up MLX Swift

Add mlx-swift-examples as a Swift package dependency. This meta-package brings in the core MLX library, the MLXLLM module for language model inference, and the MLXRandom module for sampling.

In Xcode, go to File > Add Package Dependencies and enter:

https://github.com/ml-explore/mlx-swift-examples

Pin to the latest stable release. Then add MLXLLM and MLX to your target’s frameworks.

Your Package.swift dependency block looks like this if you are building a Swift package instead:

dependencies: [
    .package(
        url: "https://github.com/ml-explore/mlx-swift-examples",
        from: "1.0.0"
    )
],
targets: [
    .executableTarget(
        name: "ScriptSummarizer",
        dependencies: [
            .product(name: "MLXLLM", package: "mlx-swift-examples"),
        ]
    )
]

Tip: The first build takes a few minutes because MLX compiles its Metal shaders. Subsequent builds are fast thanks to shader caching.

Loading and Running a Model

The MLXLLM module provides a ModelContainer that handles downloading, caching, and loading model weights. You point it at a Hugging Face repository in MLX format and let it do the rest.

Here is a minimal inference pipeline that loads a quantized Llama model and generates a single completion:

import MLXLLM
import MLX

struct ScriptAnalyzer {
    let modelContainer: ModelContainer

    init() async throws {
        // Load a 4-bit quantized Llama 3 from mlx-community
        let configuration = ModelConfiguration(
            id: "mlx-community/Meta-Llama-3-8B-Instruct-4bit"
        )
        self.modelContainer = try await ModelContainer(
            configuration: configuration
        )
    }

    func summarize(script: String) async throws -> String {
        let prompt = """
        You are a Pixar story analyst. Summarize the following script \
        in three bullet points, focusing on character arcs:

        \(script)
        """

        let result = try await modelContainer.perform { context in
            let input = try await context.processor.prepare(input: .init(prompt: prompt))
            return try MLXLMCommon.generate(
                input: input,
                parameters: .init(temperature: 0.6, topP: 0.9),
                context: context
            )
        }
        return result.output
    }
}

A few things to notice:

  • ModelConfiguration accepts a Hugging Face repo ID. The mlx-community organization hosts pre-quantized versions of most popular models. You can also pass a local file path if you bundle weights with your app.
  • modelContainer.perform gives you a context with the loaded model and tokenizer. This closure is thread-safe — the container serializes access internally.
  • GenerateParameters controls sampling. temperature governs randomness (lower is more deterministic) and topP limits the cumulative probability of candidate tokens.

Choosing a Model

Not every model on Hugging Face works out of the box. Look for repositories under mlx-community — these have weights in the .safetensors format that MLX expects, along with a config.json describing the architecture.

ModelParametersQuantizationRAM RequiredUse Case
Llama 3 8B Instruct 4-bit8B4-bit~6 GBGeneral instruction following
Mistral 7B Instruct 4-bit7B4-bit~5 GBFast, strong at structured output
Phi-3 Mini 4-bit3.8B4-bit~3 GBLightweight tasks, code generation
Llama 3 70B Instruct 4-bit70B4-bit~40 GBMaximum quality, M2 Ultra+ only

Warning: Model weights are large. A 4-bit 8B model is roughly 4-5 GB on disk. Plan your download and caching strategy carefully — ModelContainer caches to ~/Library/Caches by default, but you may want to manage this explicitly for shipping apps.

Streaming Token Generation

Generating the entire response before returning it is fine for batch processing, but a user-facing app needs to stream tokens as they are produced. The MLXLLM module supports this through an AsyncStream-based API.

import MLXLLM
import MLX

func streamSummary(
    container: ModelContainer,
    script: String
) -> AsyncStream<String> {
    AsyncStream { continuation in
        Task {
            let prompt = """
            Summarize this Pixar screenplay in one paragraph:

            \(script)
            """

            do {
                try await container.perform { context in
                    let input = try await context.processor.prepare(
                        input: .init(prompt: prompt)
                    )
                    // Generate tokens one at a time
                    try MLXLMCommon.generate(
                        input: input,
                        parameters: .init(temperature: 0.7),
                        context: context
                    ) { token in
                        continuation.yield(token)
                        return .more // Return .stop to halt generation early
                    }
                }
                continuation.finish()
            } catch {
                continuation.finish()
            }
        }
    }
}

The callback receives each decoded token string as it is generated. Return .more to continue or .stop to halt early — useful for implementing a cancel button or a maximum-length guard.

On the SwiftUI side, consuming the stream is straightforward:

import SwiftUI

struct SummaryView: View {
    @State private var output = ""
    @State private var isGenerating = false
    let container: ModelContainer

    var body: some View {
        ScrollView {
            Text(output)
                .font(.body)
                .padding()
                .frame(maxWidth: .infinity, alignment: .leading)
        }
        .task {
            isGenerating = true
            let stream = streamSummary(
                container: container,
                script: sampleScript
            )
            for await token in stream {
                output += token
            }
            isGenerating = false
        }
        .overlay {
            if isGenerating && output.isEmpty {
                ProgressView("Loading model...")
            }
        }
    }
}

Tip: The first call to modelContainer.perform triggers model loading into GPU memory. On an 8B model this takes 2-4 seconds on an M1 and under 1 second on an M3. Show a progress indicator during this window.

Advanced Usage

Custom Model Configurations

When a model is not hosted on mlx-community, or you have fine-tuned your own, you can point ModelConfiguration at a local directory containing the weights and tokenizer:

let localConfig = ModelConfiguration(
    directory: URL(filePath: "/Users/woody/Models/pixar-script-llama-4bit")
)
let container = try await ModelContainer(configuration: localConfig)

The directory must contain config.json, tokenizer.json, tokenizer_config.json, and one or more .safetensors weight files. If you are converting from Hugging Face format, the mlx-lm Python package provides a one-liner:

mlx_lm.convert --hf-path meta-llama/Meta-Llama-3-8B-Instruct -q --q-bits 4

This outputs an MLX-compatible directory with quantized weights ready for Swift consumption.

Memory Management

MLX uses a memory pool that grows as needed but does not automatically shrink. After running a large model, you may want to reclaim memory explicitly:

import MLX

// After generation is complete and the model is no longer needed
MLX.GPU.set(cacheLimit: 0)

This clears the GPU memory cache. The next inference call will reallocate as needed, so only do this when you are done with the model or switching to a different one.

Multi-Turn Conversations

For chat-style interactions, you accumulate messages and re-encode the full history on each turn. The tokenizer applies the model’s chat template automatically:

func chat(
    container: ModelContainer,
    history: [(role: String, content: String)]
) async throws -> String {
    let messages = history.map { message in
        ["role": message.role, "content": message.content]
    }

    let result = try await container.perform { context in
        let input = try await context.processor.prepare(
            input: .init(messages: messages)
        )
        return try MLXLMCommon.generate(
            input: input,
            parameters: .init(temperature: 0.7, topP: 0.9),
            context: context
        )
    }
    return result.output
}

// Usage
let history: [(role: String, content: String)] = [
    (role: "system", content: "You are a Pixar story consultant."),
    (role: "user", content: "What makes Woody a compelling protagonist?"),
]
let response = try await chat(container: container, history: history)

Warning: Each turn re-encodes the full conversation history. For long conversations, token counts grow quickly and will eventually exceed the model’s context window (typically 4,096 or 8,192 tokens for most 7-8B models). Implement a sliding window or summarization strategy for production use.

Performance Considerations

MLX’s unified memory architecture means there is no CPU-to-GPU transfer overhead — a significant advantage over CUDA-based pipelines on discrete GPUs. But performance still varies dramatically based on hardware, model size, and quantization level.

Rough token generation rates on common hardware (4-bit Llama 3 8B, greedy decoding):

HardwarePrompt ProcessingToken GenerationPeak Memory
M1 (8 GB)~35 tokens/s~10 tokens/s~6 GB
M1 Pro (16 GB)~55 tokens/s~15 tokens/s~6 GB
M2 Max (32 GB)~90 tokens/s~25 tokens/s~6 GB
M3 Max (48 GB)~120 tokens/s~35 tokens/s~6 GB
M4 Max (64 GB)~150 tokens/s~40 tokens/s~6 GB

A few things affect throughput:

  • Quantization level — 4-bit models are roughly twice as fast as 8-bit and four times as fast as full precision (float16). Quality degradation at 4-bit is minimal for most instruction-following tasks.
  • Memory bandwidth — LLM inference is memory-bandwidth bound, not compute bound. The M3 Max’s 400 GB/s bandwidth is why it generates tokens 3-4x faster than the base M1’s 68.25 GB/s.
  • Context length — Prompt processing scales linearly with input length. A 2,000-token prompt takes roughly twice as long as a 1,000-token prompt.
  • KV-cache growth — The key-value cache for attention grows with each generated token. On an 8 GB machine, you may hit memory pressure after ~2,000 generated tokens with an 8B model.

Use the MLX.GPU.snapshot() API to profile memory usage during inference:

import MLX

let before = MLX.GPU.snapshot()
let result = try await container.perform { context in
    let input = try await context.processor.prepare(input: .init(prompt: prompt))
    return try MLXLMCommon.generate(
        input: input,
        parameters: .init(temperature: 0.7),
        context: context
    )
}
let after = MLX.GPU.snapshot()

print("Peak memory: \(after.peakMemory / 1_048_576) MB")
print("Active memory: \(after.activeMemory / 1_048_576) MB")
print("Cache memory: \(after.cacheMemory / 1_048_576) MB")

Apple Docs: Metal Performance Shaders — MLX compiles custom Metal kernels for matrix operations, building on the same GPU compute infrastructure that Metal Performance Shaders uses.

When to Use (and When Not To)

MLX occupies a specific niche. Here is a decision framework for choosing between the three on-device ML options on Apple platforms:

ScenarioRecommendation
iOS/iPadOS app, Apple Intelligence availableUse Foundation Models — native API, no model management
macOS Tahoe app, standard LLM tasksUse Foundation Models — same rationale, plus system-level optimizations
macOS app needing a specific open-source modelUse MLX — full model choice, quantization control
macOS app targeting pre-Tahoe (Ventura/Sonoma)Use MLX — Foundation Models requires macOS Tahoe
App needing Vision or NLP classifiersUse Core ML — purpose-built for task-specific models
Cross-platform model deployment (Apple + Linux)Use MLX Python on both platforms, MLX Swift on Mac
Production iOS app with custom LLMUse Core ML with a converted model — MLX does not run on iOS
Fine-tuning or training on deviceUse MLX Python — the Swift bindings focus on inference

The honest trade-off: MLX gives you model freedom at the cost of app size, memory management complexity, and the responsibility of keeping model weights updated. Foundation Models gives you zero model management at the cost of zero model choice.

For most production iOS apps, Foundation Models is the right default. MLX shines in three scenarios: macOS developer tools, internal enterprise apps that cannot send data to the cloud, and research prototypes that need to iterate on different models quickly.

Summary

  • MLX is Apple Research’s open-source array framework for Apple Silicon, offering zero-copy CPU/GPU memory sharing and native Swift bindings.
  • The MLXLLM module provides a high-level API for loading, running, and streaming open-source LLMs like Llama 3, Mistral, and Phi-3.
  • Models from the mlx-community Hugging Face organization work out of the box with ModelConfiguration — just pass the repo ID.
  • Performance is memory-bandwidth bound. 4-bit quantized models offer the best speed-to-quality ratio on consumer Apple Silicon.
  • MLX is macOS-only and complements, rather than replaces, Foundation Models (for standard on-device AI) and Core ML (for task-specific classifiers).

If you are building an iOS app and want Apple’s managed on-device AI experience, start with Foundation Models. If your use case demands model flexibility on Mac hardware, MLX is the tool to reach for.