Apple's Foundation Models Framework: Running AI On-Device in Your iOS App


Every time you send user-entered text to a cloud LLM, that text leaves the device, gets logged on a remote server, travels over a network connection you don’t control, and costs you fractions of a cent per token that add up fast at scale. Apple’s Foundation Models framework eliminates all four of those concerns at once — language model inference runs locally on the device, text never leaves, it works completely offline, and there’s no per-inference cost.

This post is a comprehensive guide to integrating the Foundation Models framework into a production iOS app. You’ll learn how to check availability, generate text, use structured output with @Generable schemas, stream responses token-by-token, and handle the error cases that matter. We won’t cover Core ML custom model files or third-party LLM integrations — those have their own dedicated posts.

Note: Foundation Models requires iOS 26+ and macOS 26+. All code in this post requires @available(iOS 26, *) annotations or an appropriate deployment target.

Contents

The Problem: Cloud LLM Integration

Before Foundation Models, adding LLM-powered features to an iOS app meant calling a cloud API. Here’s what that code typically looks like:

// Calling OpenAI-style cloud API — the traditional approach
struct CloudLLMClient {
    private let apiKey: String
    private let session = URLSession.shared

    func generateFilmSynopsis(title: String) async throws -> String {
        var request = URLRequest(url: URL(string: "https://api.openai.com/v1/chat/completions")!)
        request.httpMethod = "POST"
        request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
        request.setValue("application/json", forHTTPHeaderField: "Content-Type")

        let body: [String: Any] = [
            "model": "gpt-4o",
            "messages": [
                ["role": "user", "content": "Write a synopsis for a Pixar film called '\(title)'"]
            ]
        ]
        request.httpBody = try JSONSerialization.data(withJSONObject: body)

        let (data, response) = try await session.data(for: request)

        guard let httpResponse = response as? HTTPURLResponse,
              httpResponse.statusCode == 200 else {
            throw LLMError.requestFailed
        }

        // Parse the response — error prone, format-dependent
        let json = try JSONDecoder().decode(OpenAIResponse.self, from: data)
        return json.choices.first?.message.content ?? ""
    }
}

This approach has four problems that compound each other in production:

  1. Network dependency. No connection means no feature. A user on a plane, in a tunnel, or on a congested network gets a broken experience.
  2. Latency. A round-trip to a remote server adds hundreds of milliseconds at minimum, often over a second, before the first token arrives.
  3. Cost. At scale, per-token pricing adds up. A feature that works in testing at negligible cost can become a meaningful line item with real users.
  4. Privacy. The user typed text into your app. It now lives in a server log somewhere. For healthcare, finance, journaling, or any sensitive context, this is a serious liability.

The Foundation Models Framework

Apple Docs: FoundationModels — Apple Developer Documentation

Foundation Models provides access to Apple’s on-device language model — the same model that powers Writing Tools and Siri’s intelligence features. Import the framework and check availability before attempting any inference:

import FoundationModels

@available(iOS 26, *)
func checkModelAvailability() {
    switch SystemLanguageModel.default.availability {
    case .available:
        print("Foundation Models ready")
    case .unavailable(let reason):
        switch reason {
        case .deviceNotEligible:
            print("Device does not support on-device LLM")
        case .appleIntelligenceNotEnabled:
            print("User has not enabled Apple Intelligence")
        case .modelNotReady:
            print("Model is downloading or warming up")
        @unknown default:
            print("Unknown availability issue")
        }
    }
}

SystemLanguageModel.default is the entry point to Apple’s on-device model. It’s not a model you choose or configure — it’s the system model provisioned on that device. The availability property gives you a clear enum to act on: gracefully degrade your feature rather than showing a blank UI when the model isn’t ready.

Basic Text Generation

The minimum viable integration is three lines: create a session, call respond(to:), read the result.

import FoundationModels

@available(iOS 26, *)
func generateFilmSynopsis(title: String) async throws -> String {
    let session = LanguageModelSession()

    let response = try await session.respond(
        to: "Write a one-paragraph synopsis for a Pixar film called '\(title)'"
    )
    return response.content
}

LanguageModelSession is the workhorse of the framework — it manages model access, maintains conversation history, and handles the underlying inference pipeline.

Apple Docs: LanguageModelSession — FoundationModels

Calling respond(to:) is an async throws operation. It suspends the current task until inference completes and returns a LanguageModelSession.Response whose content property holds the generated text. The call throws LanguageModelError for availability problems, context overflow, and other runtime failures — we’ll cover those in Advanced Usage.

Managing Conversation Context

A LanguageModelSession remembers every exchange. Each call to respond(to:) appends both the user prompt and the model’s response to an internal transcript. This means multi-turn conversation works naturally:

@available(iOS 26, *)
final class FilmScriptAssistant {
    private let session: LanguageModelSession

    init() {
        // System prompt establishes the model's role and constraints for
        // the entire session lifetime
        self.session = LanguageModelSession(
            instructions: """
            You are a creative assistant for Pixar filmmakers.
            Help with brainstorming, plot analysis, and character development.
            Keep responses focused and concise — under 150 words unless asked for more.
            Do not invent real Pixar employee names.
            """
        )
    }

    func ask(_ question: String) async throws -> String {
        let response = try await session.respond(to: question)
        return response.content
    }
}

// Usage — each call builds on the previous
let assistant = FilmScriptAssistant()
let setup = try await assistant.ask("What are the core themes of 'Up'?")
let followUp = try await assistant.ask("How could a sequel explore those themes differently?")
// The model knows the context of the first question when answering the second

The instructions parameter sets a persistent system prompt — a set of constraints and persona instructions that apply to every response in the session. This is your primary tool for shaping consistent output across a multi-turn interaction.

One important implication: every call to respond(to:) on the same session includes the full conversation history in the model’s context window. Long sessions accumulate tokens. When the session’s context fills up, the framework throws LanguageModelError.contextWindowExceeded. Design for this — either create a fresh session when appropriate, or summarize old context before it overflows.

Guided Generation with @Generable

Free-text responses are useful but hard to parse. If your app needs structured output — a rating, a list of keywords, a set of fields — you can constrain the model to produce a specific type using the @Generable macro.

Apple Docs: Generable — FoundationModels

import FoundationModels

@available(iOS 26, *)
@Generable
struct FilmScriptAnalysis {
    @Guide(description: "Overall quality rating from 1 to 10")
    var rating: Int

    @Guide(description: "One sentence summarizing the script's central conflict")
    var centralConflict: String

    @Guide(description: "List of the main themes explored in the script")
    var themes: [String]

    @Guide(description: "Whether the script is appropriate for a family audience")
    var isFamilyFriendly: Bool
}

@available(iOS 26, *)
func analyzeScript(_ scriptExcerpt: String) async throws -> FilmScriptAnalysis {
    let session = LanguageModelSession(
        instructions: "You are a professional script analyst for an animation studio."
    )

    let analysis = try await session.respond(
        to: "Analyze this script excerpt:\n\n\(scriptExcerpt)",
        generating: FilmScriptAnalysis.self
    )
    return analysis
}

The @Generable macro generates a GenerationSchema for the type, which the framework uses to constrain the model’s output. The @Guide attribute on each property provides a natural-language description that guides the model toward producing appropriate values for that field. The overload of respond(to:generating:) returns the concrete type directly — no JSON parsing, no decoding, no format guessing.

This is the pattern to reach for whenever your downstream code needs to act on the response programmatically. Parsing free-text responses with string manipulation is fragile; @Generable is not.

Streaming Responses

For UI that shows generated text in real time — like a typing animation or progressive disclosure — use streamResponse(to:) instead of respond(to:):

import FoundationModels

@available(iOS 26, *)
@Observable
final class ScriptSuggestionViewModel {
    var streamedText: String = ""
    var isGenerating: Bool = false

    private let session = LanguageModelSession(
        instructions: "You are a Pixar story consultant. Suggest creative plot twists."
    )

    func suggestPlotTwists(for film: String) async {
        streamedText = ""
        isGenerating = true
        defer { isGenerating = false }

        do {
            let stream = session.streamResponse(
                to: "Suggest three plot twists for a sequel to '\(film)'"
            )

            for try await partialResponse in stream {
                // partialResponse.content grows token by token
                streamedText = partialResponse.content
            }
        } catch {
            streamedText = "Unable to generate suggestions."
        }
    }
}

Apple Docs: streamResponse(to:) — FoundationModels

streamResponse(to:) returns an AsyncSequence. Each element in the sequence is a partial response whose content reflects everything the model has generated so far — not just the new token, but the full accumulated string. Assigning partialResponse.content directly to your @Observable property is safe and efficient.

Combining @Observable with streamResponse gives you reactive UI updates with no additional plumbing. SwiftUI automatically re-renders any view that reads streamedText as each new partial response arrives.

Advanced Usage

Error Handling

LanguageModelSession throws LanguageModelError in three meaningful scenarios:

@available(iOS 26, *)
func handleModelErrors() async {
    let session = LanguageModelSession()

    do {
        let response = try await session.respond(
            to: "Suggest plot twists for a Pixar sequel"
        )
        print(response.content)
    } catch let error as LanguageModelError {
        switch error {
        case .modelUnavailable:
            // Apple Intelligence disabled, device ineligible, or model not downloaded
            // Show a UI prompt to enable Apple Intelligence in Settings
            showAppleIntelligencePrompt()
        case .contextWindowExceeded:
            // The session transcript is too long — create a fresh session or
            // summarize previous exchanges before continuing
            await recoverFromContextOverflow()
        case .guardrailViolation:
            // The prompt or response triggered Apple's content policy
            // Log for review but don't expose details to the user
            logGuardrailEvent()
        @unknown default:
            // New error cases added in future OS versions
            logUnknownError(error)
        }
    } catch {
        // Non-LanguageModelError failures (e.g., task cancellation)
        handleGenericError(error)
    }
}

The guardrailViolation case deserves particular attention. Apple’s on-device model enforces content policies. Your app should handle this gracefully — show a neutral error message, log the event for review, and avoid retrying the same prompt unchanged.

System Prompt Engineering

The instructions parameter is your primary lever for shaping consistent model behavior across a session. Effective system prompts for on-device models share a few characteristics:

@available(iOS 26, *)
func createProductionSession() -> LanguageModelSession {
    LanguageModelSession(
        instructions: """
        You are a film recommendation assistant embedded in the "Pixar Picks" app.

        Rules:
        - Only recommend films from Pixar Animation Studios
        - Always respond in the language the user writes in
        - Keep responses under 80 words
        - If asked about anything unrelated to Pixar films, politely redirect
        - Never make up film titles or release dates

        Format for recommendations:
        Title: [film title]
        Why: [one sentence reason]
        """
    )
}

On-device models are smaller and more instruction-sensitive than large cloud models. Explicit, short rules in the system prompt outperform vague persona descriptions. Defining an output format directly in the instructions — or better, using @Generable — removes the model’s latitude to invent formats.

Context Window Management

When your session grows long, you have two recovery strategies. The simple approach creates a new session and re-provides any essential context. The sophisticated approach summarizes the existing session before it overflows:

@available(iOS 26, *)
actor FilmConsultationSession {
    private var session: LanguageModelSession
    private var exchangeCount: Int = 0
    private let maxExchanges = 20 // Tune based on your prompt sizes

    init() {
        self.session = LanguageModelSession(
            instructions: "You are a Pixar story consultant."
        )
    }

    func respond(to prompt: String) async throws -> String {
        if exchangeCount >= maxExchanges {
            await resetSession()
        }

        let response = try await session.respond(to: prompt)
        exchangeCount += 1
        return response.content
    }

    private func resetSession() async {
        session = LanguageModelSession(
            instructions: "You are a Pixar story consultant."
        )
        exchangeCount = 0
    }
}

Wrapping the session in an actor ensures that exchangeCount and session resets are accessed safely from concurrent callers. This pattern is appropriate for any long-running conversation feature.

Performance Considerations

Foundation Models inference runs on the Apple Neural Engine, not the CPU or GPU. The practical implication: inference is fast and thermally efficient compared to a naive CPU implementation, but the Neural Engine is a shared resource.

Monitor thermal state before submitting long inference tasks:

import Foundation
import FoundationModels

@available(iOS 26, *)
func generateWithThermalCheck(_ prompt: String) async throws -> String? {
    let thermalState = ProcessInfo.processInfo.thermalState

    // Defer expensive inference when the device is hot
    guard thermalState != .critical && thermalState != .serious else {
        return nil // Signal to the caller to try again later
    }

    let session = LanguageModelSession()
    let response = try await session.respond(to: prompt)
    return response.content
}

Apple Docs: ProcessInfo.thermalState — Foundation

For streaming use cases, the model returns the first token within a few hundred milliseconds on recent hardware. This is fast enough for interactive UI — the perceived latency is much lower than a cloud round-trip because the response begins immediately.

Avoid creating a new LanguageModelSession on every inference call. Session initialization has overhead. For features that make many inference calls, maintain a session-level instance rather than creating one per request.

When to Use (and When Not To)

ScenarioRecommendation
User data is sensitive (health, journal, finance)Use Foundation Models — data never leaves the device
Feature must work offlineUse Foundation Models — no network dependency
Output needs to be a typed Swift valueUse Foundation Models with @Generable
Response quality must match GPT-4 / Claude levelUse cloud API — on-device models are smaller
Task requires knowledge of events after iOS 26 ship dateUse cloud API — on-device model has fixed training cutoff
Very long document summarization (100k+ tokens)Use cloud API — on-device context window is limited
Low-volume prototyping with flexible budgetEither works; on-device removes auth/key management overhead
High-volume user-generated content classificationUse Core ML with a fine-tuned classifier — faster and more reliable for constrained tasks

The most important dimension is output quality vs. privacy. On-device models are capable within their scope — short to medium text, structured extraction, classification, summarization of reasonably sized inputs. They are not the right tool for tasks that demand frontier model capability or massive context windows.

Summary

  • Foundation Models brings on-device LLM inference to iOS 26+. Text never leaves the device, inference works offline, and there is no per-token cost.
  • Always check SystemLanguageModel.default.availability before attempting inference and provide a graceful fallback when the model is unavailable.
  • LanguageModelSession maintains conversation history automatically — each respond(to:) call builds on the session transcript.
  • Use @Generable with @Guide annotations when your code needs to act on the model’s output programmatically. Structured output is more reliable than parsing free text.
  • streamResponse(to:) returns an AsyncSequence of partial responses that integrates naturally with @Observable for progressive SwiftUI updates.
  • Design for context window limits — monitor session length and reset or summarize when approaching the limit.
  • Monitor ProcessInfo.thermalState before submitting expensive inference tasks on the Neural Engine.

With Foundation Models handling text generation, the natural next step is incorporating structured classification and image understanding — both of which Core ML handles with a different but complementary set of APIs. See Integrating Core ML Models in SwiftUI for that deep dive, and Designing Prompts for On-Device AI for the prompt engineering patterns that make Foundation Models integrations production-grade.