Apple's Foundation Models Framework: Running AI On-Device in Your iOS App
Every time you send user-entered text to a cloud LLM, that text leaves the device, gets logged on a remote server, travels over a network connection you don’t control, and costs you fractions of a cent per token that add up fast at scale. Apple’s Foundation Models framework eliminates all four of those concerns at once — language model inference runs locally on the device, text never leaves, it works completely offline, and there’s no per-inference cost.
This post is a comprehensive guide to integrating the Foundation Models framework into a production iOS app. You’ll
learn how to check availability, generate text, use structured output with @Generable schemas, stream responses
token-by-token, and handle the error cases that matter. We won’t cover Core ML custom model files or third-party LLM
integrations — those have their own dedicated posts.
Note: Foundation Models requires iOS 26+ and macOS 26+. All code in this post requires
@available(iOS 26, *)annotations or an appropriate deployment target.
Contents
- The Problem: Cloud LLM Integration
- The Foundation Models Framework
- Basic Text Generation
- Managing Conversation Context
- Guided Generation with
@Generable - Streaming Responses
- Advanced Usage
- Performance Considerations
- When to Use (and When Not To)
- Summary
The Problem: Cloud LLM Integration
Before Foundation Models, adding LLM-powered features to an iOS app meant calling a cloud API. Here’s what that code typically looks like:
// Calling OpenAI-style cloud API — the traditional approach
struct CloudLLMClient {
private let apiKey: String
private let session = URLSession.shared
func generateFilmSynopsis(title: String) async throws -> String {
var request = URLRequest(url: URL(string: "https://api.openai.com/v1/chat/completions")!)
request.httpMethod = "POST"
request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
request.setValue("application/json", forHTTPHeaderField: "Content-Type")
let body: [String: Any] = [
"model": "gpt-4o",
"messages": [
["role": "user", "content": "Write a synopsis for a Pixar film called '\(title)'"]
]
]
request.httpBody = try JSONSerialization.data(withJSONObject: body)
let (data, response) = try await session.data(for: request)
guard let httpResponse = response as? HTTPURLResponse,
httpResponse.statusCode == 200 else {
throw LLMError.requestFailed
}
// Parse the response — error prone, format-dependent
let json = try JSONDecoder().decode(OpenAIResponse.self, from: data)
return json.choices.first?.message.content ?? ""
}
}
This approach has four problems that compound each other in production:
- Network dependency. No connection means no feature. A user on a plane, in a tunnel, or on a congested network gets a broken experience.
- Latency. A round-trip to a remote server adds hundreds of milliseconds at minimum, often over a second, before the first token arrives.
- Cost. At scale, per-token pricing adds up. A feature that works in testing at negligible cost can become a meaningful line item with real users.
- Privacy. The user typed text into your app. It now lives in a server log somewhere. For healthcare, finance, journaling, or any sensitive context, this is a serious liability.
The Foundation Models Framework
Apple Docs:
FoundationModels— Apple Developer Documentation
Foundation Models provides access to Apple’s on-device language model — the same model that powers Writing Tools and Siri’s intelligence features. Import the framework and check availability before attempting any inference:
import FoundationModels
@available(iOS 26, *)
func checkModelAvailability() {
switch SystemLanguageModel.default.availability {
case .available:
print("Foundation Models ready")
case .unavailable(let reason):
switch reason {
case .deviceNotEligible:
print("Device does not support on-device LLM")
case .appleIntelligenceNotEnabled:
print("User has not enabled Apple Intelligence")
case .modelNotReady:
print("Model is downloading or warming up")
@unknown default:
print("Unknown availability issue")
}
}
}
SystemLanguageModel.default is the entry point to Apple’s on-device model. It’s not a model you choose or configure —
it’s the system model provisioned on that device. The availability property gives you a clear enum to act on:
gracefully degrade your feature rather than showing a blank UI when the model isn’t ready.
Basic Text Generation
The minimum viable integration is three lines: create a session, call respond(to:), read the result.
import FoundationModels
@available(iOS 26, *)
func generateFilmSynopsis(title: String) async throws -> String {
let session = LanguageModelSession()
let response = try await session.respond(
to: "Write a one-paragraph synopsis for a Pixar film called '\(title)'"
)
return response.content
}
LanguageModelSession is the workhorse of the framework — it manages model access, maintains conversation history, and
handles the underlying inference pipeline.
Apple Docs:
LanguageModelSession— FoundationModels
Calling respond(to:) is an async throws operation. It suspends the current task until inference completes and
returns a LanguageModelSession.Response whose content property holds the generated text. The call throws
LanguageModelError for availability problems, context overflow, and other runtime failures — we’ll cover those in
Advanced Usage.
Managing Conversation Context
A LanguageModelSession remembers every exchange. Each call to respond(to:) appends both the user prompt and the
model’s response to an internal transcript. This means multi-turn conversation works naturally:
@available(iOS 26, *)
final class FilmScriptAssistant {
private let session: LanguageModelSession
init() {
// System prompt establishes the model's role and constraints for
// the entire session lifetime
self.session = LanguageModelSession(
instructions: """
You are a creative assistant for Pixar filmmakers.
Help with brainstorming, plot analysis, and character development.
Keep responses focused and concise — under 150 words unless asked for more.
Do not invent real Pixar employee names.
"""
)
}
func ask(_ question: String) async throws -> String {
let response = try await session.respond(to: question)
return response.content
}
}
// Usage — each call builds on the previous
let assistant = FilmScriptAssistant()
let setup = try await assistant.ask("What are the core themes of 'Up'?")
let followUp = try await assistant.ask("How could a sequel explore those themes differently?")
// The model knows the context of the first question when answering the second
The instructions parameter sets a persistent system prompt — a set of constraints and persona instructions that apply
to every response in the session. This is your primary tool for shaping consistent output across a multi-turn
interaction.
One important implication: every call to respond(to:) on the same session includes the full conversation history in
the model’s context window. Long sessions accumulate tokens. When the session’s context fills up, the framework throws
LanguageModelError.contextWindowExceeded. Design for this — either create a fresh session when appropriate, or
summarize old context before it overflows.
Guided Generation with @Generable
Free-text responses are useful but hard to parse. If your app needs structured output — a rating, a list of keywords, a
set of fields — you can constrain the model to produce a specific type using the @Generable macro.
Apple Docs:
Generable— FoundationModels
import FoundationModels
@available(iOS 26, *)
@Generable
struct FilmScriptAnalysis {
@Guide(description: "Overall quality rating from 1 to 10")
var rating: Int
@Guide(description: "One sentence summarizing the script's central conflict")
var centralConflict: String
@Guide(description: "List of the main themes explored in the script")
var themes: [String]
@Guide(description: "Whether the script is appropriate for a family audience")
var isFamilyFriendly: Bool
}
@available(iOS 26, *)
func analyzeScript(_ scriptExcerpt: String) async throws -> FilmScriptAnalysis {
let session = LanguageModelSession(
instructions: "You are a professional script analyst for an animation studio."
)
let analysis = try await session.respond(
to: "Analyze this script excerpt:\n\n\(scriptExcerpt)",
generating: FilmScriptAnalysis.self
)
return analysis
}
The @Generable macro generates a GenerationSchema for the type, which the framework uses to constrain the model’s
output. The @Guide attribute on each property provides a natural-language description that guides the model toward
producing appropriate values for that field. The overload of respond(to:generating:) returns the concrete type
directly — no JSON parsing, no decoding, no format guessing.
This is the pattern to reach for whenever your downstream code needs to act on the response programmatically. Parsing
free-text responses with string manipulation is fragile; @Generable is not.
Streaming Responses
For UI that shows generated text in real time — like a typing animation or progressive disclosure — use
streamResponse(to:) instead of respond(to:):
import FoundationModels
@available(iOS 26, *)
@Observable
final class ScriptSuggestionViewModel {
var streamedText: String = ""
var isGenerating: Bool = false
private let session = LanguageModelSession(
instructions: "You are a Pixar story consultant. Suggest creative plot twists."
)
func suggestPlotTwists(for film: String) async {
streamedText = ""
isGenerating = true
defer { isGenerating = false }
do {
let stream = session.streamResponse(
to: "Suggest three plot twists for a sequel to '\(film)'"
)
for try await partialResponse in stream {
// partialResponse.content grows token by token
streamedText = partialResponse.content
}
} catch {
streamedText = "Unable to generate suggestions."
}
}
}
Apple Docs:
streamResponse(to:)— FoundationModels
streamResponse(to:) returns an AsyncSequence. Each element in the sequence is a partial response whose content
reflects everything the model has generated so far — not just the new token, but the full accumulated string. Assigning
partialResponse.content directly to your @Observable property is safe and efficient.
Combining @Observable with streamResponse gives you reactive UI updates with no additional plumbing. SwiftUI
automatically re-renders any view that reads streamedText as each new partial response arrives.
Advanced Usage
Error Handling
LanguageModelSession throws LanguageModelError in three meaningful scenarios:
@available(iOS 26, *)
func handleModelErrors() async {
let session = LanguageModelSession()
do {
let response = try await session.respond(
to: "Suggest plot twists for a Pixar sequel"
)
print(response.content)
} catch let error as LanguageModelError {
switch error {
case .modelUnavailable:
// Apple Intelligence disabled, device ineligible, or model not downloaded
// Show a UI prompt to enable Apple Intelligence in Settings
showAppleIntelligencePrompt()
case .contextWindowExceeded:
// The session transcript is too long — create a fresh session or
// summarize previous exchanges before continuing
await recoverFromContextOverflow()
case .guardrailViolation:
// The prompt or response triggered Apple's content policy
// Log for review but don't expose details to the user
logGuardrailEvent()
@unknown default:
// New error cases added in future OS versions
logUnknownError(error)
}
} catch {
// Non-LanguageModelError failures (e.g., task cancellation)
handleGenericError(error)
}
}
The guardrailViolation case deserves particular attention. Apple’s on-device model enforces content policies. Your app
should handle this gracefully — show a neutral error message, log the event for review, and avoid retrying the same
prompt unchanged.
System Prompt Engineering
The instructions parameter is your primary lever for shaping consistent model behavior across a session. Effective
system prompts for on-device models share a few characteristics:
@available(iOS 26, *)
func createProductionSession() -> LanguageModelSession {
LanguageModelSession(
instructions: """
You are a film recommendation assistant embedded in the "Pixar Picks" app.
Rules:
- Only recommend films from Pixar Animation Studios
- Always respond in the language the user writes in
- Keep responses under 80 words
- If asked about anything unrelated to Pixar films, politely redirect
- Never make up film titles or release dates
Format for recommendations:
Title: [film title]
Why: [one sentence reason]
"""
)
}
On-device models are smaller and more instruction-sensitive than large cloud models. Explicit, short rules in the system
prompt outperform vague persona descriptions. Defining an output format directly in the instructions — or better, using
@Generable — removes the model’s latitude to invent formats.
Context Window Management
When your session grows long, you have two recovery strategies. The simple approach creates a new session and re-provides any essential context. The sophisticated approach summarizes the existing session before it overflows:
@available(iOS 26, *)
actor FilmConsultationSession {
private var session: LanguageModelSession
private var exchangeCount: Int = 0
private let maxExchanges = 20 // Tune based on your prompt sizes
init() {
self.session = LanguageModelSession(
instructions: "You are a Pixar story consultant."
)
}
func respond(to prompt: String) async throws -> String {
if exchangeCount >= maxExchanges {
await resetSession()
}
let response = try await session.respond(to: prompt)
exchangeCount += 1
return response.content
}
private func resetSession() async {
session = LanguageModelSession(
instructions: "You are a Pixar story consultant."
)
exchangeCount = 0
}
}
Wrapping the session in an actor ensures that exchangeCount and session resets are accessed safely from concurrent
callers. This pattern is appropriate for any long-running conversation feature.
Performance Considerations
Foundation Models inference runs on the Apple Neural Engine, not the CPU or GPU. The practical implication: inference is fast and thermally efficient compared to a naive CPU implementation, but the Neural Engine is a shared resource.
Monitor thermal state before submitting long inference tasks:
import Foundation
import FoundationModels
@available(iOS 26, *)
func generateWithThermalCheck(_ prompt: String) async throws -> String? {
let thermalState = ProcessInfo.processInfo.thermalState
// Defer expensive inference when the device is hot
guard thermalState != .critical && thermalState != .serious else {
return nil // Signal to the caller to try again later
}
let session = LanguageModelSession()
let response = try await session.respond(to: prompt)
return response.content
}
Apple Docs:
ProcessInfo.thermalState— Foundation
For streaming use cases, the model returns the first token within a few hundred milliseconds on recent hardware. This is fast enough for interactive UI — the perceived latency is much lower than a cloud round-trip because the response begins immediately.
Avoid creating a new LanguageModelSession on every inference call. Session initialization has overhead. For features
that make many inference calls, maintain a session-level instance rather than creating one per request.
When to Use (and When Not To)
| Scenario | Recommendation |
|---|---|
| User data is sensitive (health, journal, finance) | Use Foundation Models — data never leaves the device |
| Feature must work offline | Use Foundation Models — no network dependency |
| Output needs to be a typed Swift value | Use Foundation Models with @Generable |
| Response quality must match GPT-4 / Claude level | Use cloud API — on-device models are smaller |
| Task requires knowledge of events after iOS 26 ship date | Use cloud API — on-device model has fixed training cutoff |
| Very long document summarization (100k+ tokens) | Use cloud API — on-device context window is limited |
| Low-volume prototyping with flexible budget | Either works; on-device removes auth/key management overhead |
| High-volume user-generated content classification | Use Core ML with a fine-tuned classifier — faster and more reliable for constrained tasks |
The most important dimension is output quality vs. privacy. On-device models are capable within their scope — short to medium text, structured extraction, classification, summarization of reasonably sized inputs. They are not the right tool for tasks that demand frontier model capability or massive context windows.
Summary
- Foundation Models brings on-device LLM inference to iOS 26+. Text never leaves the device, inference works offline, and there is no per-token cost.
- Always check
SystemLanguageModel.default.availabilitybefore attempting inference and provide a graceful fallback when the model is unavailable. LanguageModelSessionmaintains conversation history automatically — eachrespond(to:)call builds on the session transcript.- Use
@Generablewith@Guideannotations when your code needs to act on the model’s output programmatically. Structured output is more reliable than parsing free text. streamResponse(to:)returns anAsyncSequenceof partial responses that integrates naturally with@Observablefor progressive SwiftUI updates.- Design for context window limits — monitor session length and reset or summarize when approaching the limit.
- Monitor
ProcessInfo.thermalStatebefore submitting expensive inference tasks on the Neural Engine.
With Foundation Models handling text generation, the natural next step is incorporating structured classification and image understanding — both of which Core ML handles with a different but complementary set of APIs. See Integrating Core ML Models in SwiftUI for that deep dive, and Designing Prompts for On-Device AI for the prompt engineering patterns that make Foundation Models integrations production-grade.