MLX on Apple Silicon: Running Open-Source Models Locally
Apple’s Foundation Models framework gives you a polished, on-device LLM for iOS 26 and macOS Tahoe — but it ships one model, you cannot swap it, and it does not run on older hardware. If you need to run Llama 3, Mistral, Phi-3, or a fine-tuned model of your own on any Apple Silicon Mac, MLX is the escape hatch.
This post covers what MLX is, how to load and run open-source LLMs from Swift using the mlx-swift bindings, how to
stream tokens, and where MLX fits alongside Core ML and Foundation Models. We will not cover model training or
fine-tuning — those deserve their own dedicated treatment.
Contents
- The Problem
- What Is MLX?
- Setting Up MLX Swift
- Loading and Running a Model
- Streaming Token Generation
- Advanced Usage
- Performance Considerations
- When to Use (and When Not To)
- Summary
The Problem
You are building a macOS tool that summarizes movie scripts. The on-device Foundation Models framework works great on macOS Tahoe, but your team has two requirements it cannot meet: first, the model must be a fine-tuned Llama 3 variant trained on Pixar screenplay conventions; second, the app needs to ship on macOS Ventura machines in the studio’s editing bays.
A server-side API would solve both problems, but the scripts are under NDA and cannot leave the local machine. You need local inference with a model you control.
Core ML can run arbitrary models, but converting a multi-billion-parameter LLM into the Core ML format is a project in itself — quantization, attention-head mapping, and token-by-token generation loops are all on you. Here is a rough sketch of the manual work involved:
import CoreML
// Simplified for clarity
func generateNextToken(using model: MLModel, inputIDs: MLMultiArray) throws -> Int {
let input = try MLDictionaryFeatureProvider(
dictionary: ["input_ids": inputIDs]
)
let prediction = try model.prediction(from: input)
guard let logits = prediction.featureValue(for: "logits")?.multiArrayValue else {
throw InferenceError.missingLogits
}
// Manual argmax over the vocabulary dimension
var maxIndex = 0
var maxValue = logits[0].floatValue
for i in 1..<logits.count {
if logits[i].floatValue > maxValue {
maxValue = logits[i].floatValue
maxIndex = i
}
}
return maxIndex
}
That is a single token. You still need a tokenizer, a sampling strategy, KV-cache management, and a generation loop. MLX collapses all of that into a handful of calls.
What Is MLX?
MLX is an open-source array framework created by Apple’s machine learning research team. It is purpose-built for Apple Silicon and uses a unified memory architecture — the same physical memory is shared between CPU and GPU with zero-copy transfers. If you have used NumPy or PyTorch, the programming model will feel familiar: lazy evaluation, automatic differentiation, and composable function transforms.
The ecosystem splits into several packages:
- mlx — The core C++ library with Python bindings. Most community models target this.
- mlx-swift — Native Swift bindings published as a Swift package. This is what you use in Xcode.
- mlx-swift-examples — Reference apps from Apple Research showing LLM inference, image generation, and speech recognition.
- mlx-community — A Hugging Face organization hosting hundreds of models pre-converted to MLX’s weight format.
Note: MLX is a macOS-only framework. It requires Apple Silicon (M1 or later) and does not run on iOS, iPadOS, or Intel Macs. If you need on-device inference on iPhone, use Foundation Models or Core ML.
Setting Up MLX Swift
Add mlx-swift-examples as a Swift package dependency. This meta-package brings in the core MLX library, the MLXLLM
module for language model inference, and the MLXRandom module for sampling.
In Xcode, go to File > Add Package Dependencies and enter:
https://github.com/ml-explore/mlx-swift-examples
Pin to the latest stable release. Then add MLXLLM and MLX to your target’s frameworks.
Your Package.swift dependency block looks like this if you are building a Swift package instead:
dependencies: [
.package(
url: "https://github.com/ml-explore/mlx-swift-examples",
from: "1.0.0"
)
],
targets: [
.executableTarget(
name: "ScriptSummarizer",
dependencies: [
.product(name: "MLXLLM", package: "mlx-swift-examples"),
]
)
]
Tip: The first build takes a few minutes because MLX compiles its Metal shaders. Subsequent builds are fast thanks to shader caching.
Loading and Running a Model
The MLXLLM module provides a ModelContainer that handles downloading, caching, and loading model weights. You point
it at a Hugging Face repository in MLX format and let it do the rest.
Here is a minimal inference pipeline that loads a quantized Llama model and generates a single completion:
import MLXLLM
import MLX
struct ScriptAnalyzer {
let modelContainer: ModelContainer
init() async throws {
// Load a 4-bit quantized Llama 3 from mlx-community
let configuration = ModelConfiguration(
id: "mlx-community/Meta-Llama-3-8B-Instruct-4bit"
)
self.modelContainer = try await ModelContainer(
configuration: configuration
)
}
func summarize(script: String) async throws -> String {
let prompt = """
You are a Pixar story analyst. Summarize the following script \
in three bullet points, focusing on character arcs:
\(script)
"""
let result = try await modelContainer.perform { context in
let input = try await context.processor.prepare(input: .init(prompt: prompt))
return try MLXLMCommon.generate(
input: input,
parameters: .init(temperature: 0.6, topP: 0.9),
context: context
)
}
return result.output
}
}
A few things to notice:
ModelConfigurationaccepts a Hugging Face repo ID. Themlx-communityorganization hosts pre-quantized versions of most popular models. You can also pass a local file path if you bundle weights with your app.modelContainer.performgives you a context with the loaded model and tokenizer. This closure is thread-safe — the container serializes access internally.GenerateParameterscontrols sampling.temperaturegoverns randomness (lower is more deterministic) andtopPlimits the cumulative probability of candidate tokens.
Choosing a Model
Not every model on Hugging Face works out of the box. Look for repositories under
mlx-community — these have weights in the .safetensors format that MLX
expects, along with a config.json describing the architecture.
| Model | Parameters | Quantization | RAM Required | Use Case |
|---|---|---|---|---|
| Llama 3 8B Instruct 4-bit | 8B | 4-bit | ~6 GB | General instruction following |
| Mistral 7B Instruct 4-bit | 7B | 4-bit | ~5 GB | Fast, strong at structured output |
| Phi-3 Mini 4-bit | 3.8B | 4-bit | ~3 GB | Lightweight tasks, code generation |
| Llama 3 70B Instruct 4-bit | 70B | 4-bit | ~40 GB | Maximum quality, M2 Ultra+ only |
Warning: Model weights are large. A 4-bit 8B model is roughly 4-5 GB on disk. Plan your download and caching strategy carefully —
ModelContainercaches to~/Library/Cachesby default, but you may want to manage this explicitly for shipping apps.
Streaming Token Generation
Generating the entire response before returning it is fine for batch processing, but a user-facing app needs to stream
tokens as they are produced. The MLXLLM module supports this through an AsyncStream-based API.
import MLXLLM
import MLX
func streamSummary(
container: ModelContainer,
script: String
) -> AsyncStream<String> {
AsyncStream { continuation in
Task {
let prompt = """
Summarize this Pixar screenplay in one paragraph:
\(script)
"""
do {
try await container.perform { context in
let input = try await context.processor.prepare(
input: .init(prompt: prompt)
)
// Generate tokens one at a time
try MLXLMCommon.generate(
input: input,
parameters: .init(temperature: 0.7),
context: context
) { token in
continuation.yield(token)
return .more // Return .stop to halt generation early
}
}
continuation.finish()
} catch {
continuation.finish()
}
}
}
}
The callback receives each decoded token string as it is generated. Return .more to continue or .stop to halt early
— useful for implementing a cancel button or a maximum-length guard.
On the SwiftUI side, consuming the stream is straightforward:
import SwiftUI
struct SummaryView: View {
@State private var output = ""
@State private var isGenerating = false
let container: ModelContainer
var body: some View {
ScrollView {
Text(output)
.font(.body)
.padding()
.frame(maxWidth: .infinity, alignment: .leading)
}
.task {
isGenerating = true
let stream = streamSummary(
container: container,
script: sampleScript
)
for await token in stream {
output += token
}
isGenerating = false
}
.overlay {
if isGenerating && output.isEmpty {
ProgressView("Loading model...")
}
}
}
}
Tip: The first call to
modelContainer.performtriggers model loading into GPU memory. On an 8B model this takes 2-4 seconds on an M1 and under 1 second on an M3. Show a progress indicator during this window.
Advanced Usage
Custom Model Configurations
When a model is not hosted on mlx-community, or you have fine-tuned your own, you can point ModelConfiguration at a
local directory containing the weights and tokenizer:
let localConfig = ModelConfiguration(
directory: URL(filePath: "/Users/woody/Models/pixar-script-llama-4bit")
)
let container = try await ModelContainer(configuration: localConfig)
The directory must contain config.json, tokenizer.json, tokenizer_config.json, and one or more .safetensors
weight files. If you are converting from Hugging Face format, the mlx-lm Python package provides a one-liner:
mlx_lm.convert --hf-path meta-llama/Meta-Llama-3-8B-Instruct -q --q-bits 4
This outputs an MLX-compatible directory with quantized weights ready for Swift consumption.
Memory Management
MLX uses a memory pool that grows as needed but does not automatically shrink. After running a large model, you may want to reclaim memory explicitly:
import MLX
// After generation is complete and the model is no longer needed
MLX.GPU.set(cacheLimit: 0)
This clears the GPU memory cache. The next inference call will reallocate as needed, so only do this when you are done with the model or switching to a different one.
Multi-Turn Conversations
For chat-style interactions, you accumulate messages and re-encode the full history on each turn. The tokenizer applies the model’s chat template automatically:
func chat(
container: ModelContainer,
history: [(role: String, content: String)]
) async throws -> String {
let messages = history.map { message in
["role": message.role, "content": message.content]
}
let result = try await container.perform { context in
let input = try await context.processor.prepare(
input: .init(messages: messages)
)
return try MLXLMCommon.generate(
input: input,
parameters: .init(temperature: 0.7, topP: 0.9),
context: context
)
}
return result.output
}
// Usage
let history: [(role: String, content: String)] = [
(role: "system", content: "You are a Pixar story consultant."),
(role: "user", content: "What makes Woody a compelling protagonist?"),
]
let response = try await chat(container: container, history: history)
Warning: Each turn re-encodes the full conversation history. For long conversations, token counts grow quickly and will eventually exceed the model’s context window (typically 4,096 or 8,192 tokens for most 7-8B models). Implement a sliding window or summarization strategy for production use.
Performance Considerations
MLX’s unified memory architecture means there is no CPU-to-GPU transfer overhead — a significant advantage over CUDA-based pipelines on discrete GPUs. But performance still varies dramatically based on hardware, model size, and quantization level.
Rough token generation rates on common hardware (4-bit Llama 3 8B, greedy decoding):
| Hardware | Prompt Processing | Token Generation | Peak Memory |
|---|---|---|---|
| M1 (8 GB) | ~35 tokens/s | ~10 tokens/s | ~6 GB |
| M1 Pro (16 GB) | ~55 tokens/s | ~15 tokens/s | ~6 GB |
| M2 Max (32 GB) | ~90 tokens/s | ~25 tokens/s | ~6 GB |
| M3 Max (48 GB) | ~120 tokens/s | ~35 tokens/s | ~6 GB |
| M4 Max (64 GB) | ~150 tokens/s | ~40 tokens/s | ~6 GB |
A few things affect throughput:
- Quantization level — 4-bit models are roughly twice as fast as 8-bit and four times as fast as full precision (float16). Quality degradation at 4-bit is minimal for most instruction-following tasks.
- Memory bandwidth — LLM inference is memory-bandwidth bound, not compute bound. The M3 Max’s 400 GB/s bandwidth is why it generates tokens 3-4x faster than the base M1’s 68.25 GB/s.
- Context length — Prompt processing scales linearly with input length. A 2,000-token prompt takes roughly twice as long as a 1,000-token prompt.
- KV-cache growth — The key-value cache for attention grows with each generated token. On an 8 GB machine, you may hit memory pressure after ~2,000 generated tokens with an 8B model.
Use the MLX.GPU.snapshot() API to profile memory usage during inference:
import MLX
let before = MLX.GPU.snapshot()
let result = try await container.perform { context in
let input = try await context.processor.prepare(input: .init(prompt: prompt))
return try MLXLMCommon.generate(
input: input,
parameters: .init(temperature: 0.7),
context: context
)
}
let after = MLX.GPU.snapshot()
print("Peak memory: \(after.peakMemory / 1_048_576) MB")
print("Active memory: \(after.activeMemory / 1_048_576) MB")
print("Cache memory: \(after.cacheMemory / 1_048_576) MB")
Apple Docs: Metal Performance Shaders — MLX compiles custom Metal kernels for matrix operations, building on the same GPU compute infrastructure that Metal Performance Shaders uses.
When to Use (and When Not To)
MLX occupies a specific niche. Here is a decision framework for choosing between the three on-device ML options on Apple platforms:
| Scenario | Recommendation |
|---|---|
| iOS/iPadOS app, Apple Intelligence available | Use Foundation Models — native API, no model management |
| macOS Tahoe app, standard LLM tasks | Use Foundation Models — same rationale, plus system-level optimizations |
| macOS app needing a specific open-source model | Use MLX — full model choice, quantization control |
| macOS app targeting pre-Tahoe (Ventura/Sonoma) | Use MLX — Foundation Models requires macOS Tahoe |
| App needing Vision or NLP classifiers | Use Core ML — purpose-built for task-specific models |
| Cross-platform model deployment (Apple + Linux) | Use MLX Python on both platforms, MLX Swift on Mac |
| Production iOS app with custom LLM | Use Core ML with a converted model — MLX does not run on iOS |
| Fine-tuning or training on device | Use MLX Python — the Swift bindings focus on inference |
The honest trade-off: MLX gives you model freedom at the cost of app size, memory management complexity, and the responsibility of keeping model weights updated. Foundation Models gives you zero model management at the cost of zero model choice.
For most production iOS apps, Foundation Models is the right default. MLX shines in three scenarios: macOS developer tools, internal enterprise apps that cannot send data to the cloud, and research prototypes that need to iterate on different models quickly.
Summary
- MLX is Apple Research’s open-source array framework for Apple Silicon, offering zero-copy CPU/GPU memory sharing and native Swift bindings.
- The
MLXLLMmodule provides a high-level API for loading, running, and streaming open-source LLMs like Llama 3, Mistral, and Phi-3. - Models from the
mlx-communityHugging Face organization work out of the box withModelConfiguration— just pass the repo ID. - Performance is memory-bandwidth bound. 4-bit quantized models offer the best speed-to-quality ratio on consumer Apple Silicon.
- MLX is macOS-only and complements, rather than replaces, Foundation Models (for standard on-device AI) and Core ML (for task-specific classifiers).
If you are building an iOS app and want Apple’s managed on-device AI experience, start with Foundation Models. If your use case demands model flexibility on Mac hardware, MLX is the tool to reach for.