Speech Framework — Live Transcription: Building Voice-Controlled Features


You have shipped a search bar, a text field, and maybe a chatbot — but your users keep asking for voice input. Apple has had on-device speech recognition since iOS 10, yet most teams still treat it as a black box they will “get to later.” The Speech framework is more capable than you think, and once you see word-level timestamps and alternative transcriptions in action, you will wonder why you waited.

This post covers SFSpeechRecognizer end-to-end: authorization, live audio recognition with AVAudioEngine, on-device versus server-based modes, word-level timing data, and alternative interpretations. We will not cover AVSpeechSynthesizer (text-to-speech) or the Natural Language framework’s text analysis pipeline — those deserve their own posts.

Contents

The Problem

Imagine you are building a movie trivia app. Players shout character names — “Woody!”, “Buzz Lightyear!”, “Dory!” — and the app needs to match their answers in real time. A text field kills the momentum. You need live audio transcription that streams partial results as the player speaks, gives you word-level timing for scoring, and works reliably even when the device is offline.

Here is the naive attempt most teams start with — a one-shot file-based recognition request:

import Speech

func transcribeAudioFile(url: URL) {
    let recognizer = SFSpeechRecognizer()
    let request = SFSpeechURLRecognitionRequest(url: url)

    recognizer?.recognitionTask(with: request) { result, error in
        if let result {
            print(result.bestTranscription.formattedString)
        }
    }
}

This works for pre-recorded files, but it blocks until the entire audio file has been processed. There is no streaming, no partial results, and no live microphone input. For a voice-controlled experience, you need SFSpeechAudioBufferRecognitionRequest paired with AVAudioEngine — and that is where the real engineering begins.

Setting Up Speech Authorization

Before any recognition can happen, you need two entitlements: microphone access and speech recognition authorization. Both require Info.plist keys, and both prompt the user with a system dialog.

Add these keys to your Info.plist:

<key>NSSpeechRecognitionUsageDescription</key>
<string>PixarTrivia needs speech recognition to hear your answers.</string>
<key>NSMicrophoneUsageDescription</key>
<string>PixarTrivia needs microphone access for voice input.</string>

Then request authorization at runtime. The Speech framework provides a dedicated class method for this:

import Speech

func requestSpeechAuthorization() async -> Bool {
    let speechStatus = await withCheckedContinuation { continuation in
        SFSpeechRecognizer.requestAuthorization { status in
            continuation.resume(returning: status)
        }
    }

    guard speechStatus == .authorized else {
        print("Speech recognition not authorized: \(speechStatus)")
        return false
    }

    // Microphone access uses AVAudioSession's AVFoundation permission
    let micGranted = await AVAudioApplication.requestRecordPermission()
    return micGranted
}

Warning: SFSpeechRecognizer.requestAuthorization dispatches its callback on an arbitrary queue — not the main queue. If you are updating UI from the result, dispatch back to @MainActor explicitly.

Apple Docs: SFSpeechRecognizer — Speech Framework

Live Transcription with AVAudioEngine

The core pattern for live transcription pairs AVAudioEngine (to capture microphone input) with SFSpeechAudioBufferRecognitionRequest (to stream audio buffers to the recognizer). Here is a production-grade TranscriptionEngine that encapsulates the full lifecycle:

import AVFoundation
import Speech

@MainActor
final class TranscriptionEngine: ObservableObject {
    @Published private(set) var transcript = ""
    @Published private(set) var isListening = false

    private let speechRecognizer =
        SFSpeechRecognizer(locale: Locale(identifier: "en-US"))!
    private let audioEngine = AVAudioEngine()
    private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
    private var recognitionTask: SFSpeechRecognitionTask?

    func startListening() throws {
        // Cancel any in-flight task before starting a new one
        stopListening()

        let request = SFSpeechAudioBufferRecognitionRequest()
        request.shouldReportPartialResults = true
        request.addsPunctuation = true // iOS 16+

        recognitionRequest = request
        let inputNode = audioEngine.inputNode
        let recordingFormat = inputNode.outputFormat(forBus: 0)

        inputNode.installTap(
            onBus: 0,
            bufferSize: 1024,
            format: recordingFormat
        ) { [weak request] buffer, _ in
            request?.append(buffer)
        }

        audioEngine.prepare()
        try audioEngine.start()
        isListening = true

        recognitionTask = speechRecognizer.recognitionTask(
            with: request
        ) { [weak self] result, error in
            guard let self else { return }

            if let result {
                Task { @MainActor in
                    self.transcript =
                        result.bestTranscription.formattedString
                }
            }

            if error != nil || result?.isFinal == true {
                Task { @MainActor in
                    self.stopListening()
                }
            }
        }
    }

    func stopListening() {
        audioEngine.stop()
        audioEngine.inputNode.removeTap(onBus: 0)
        recognitionRequest?.endAudio()
        recognitionTask?.cancel()
        recognitionRequest = nil
        recognitionTask = nil
        isListening = false
    }
}

A few things worth calling out:

  • shouldReportPartialResults = true gives you streaming transcription — the callback fires multiple times as the recognizer refines its hypothesis. This is what makes “live” transcription feel responsive.
  • addsPunctuation = true (available since iOS 16) inserts automatic punctuation. Without it, you get a raw stream of words with no periods or commas.
  • Always remove the tap on the input node when stopping. Failing to do so and then installing a new tap will crash with "required condition is false: googOutN == 2" — a notoriously unhelpful error message.
  • Cancel the previous task before starting a new one. SFSpeechRecognizer supports only one active task per instance.

Tip: If your app also plays audio (sound effects, music), configure your AVAudioSession category to .playAndRecord with the .defaultToSpeaker option. Otherwise the audio route will switch to the earpiece and your users will think the app went silent.

Word-Level Timestamps and Segments

The real power of the Speech framework lives in SFTranscriptionSegment. Each segment represents a single word (or punctuation token) with precise timing data:

recognitionTask = speechRecognizer.recognitionTask(with: request) {
    [weak self] result, _ in
    guard let self, let result else { return }

    let transcription = result.bestTranscription

    for segment in transcription.segments {
        let word = segment.substring
        let timestamp = segment.timestamp    // TimeInterval from start
        let duration = segment.duration      // How long the word lasted
        let confidence = segment.confidence  // 0.0 to 1.0

        print("\(word) at \(timestamp)s (\(duration)s, \(confidence))")
    }
}
To at 0.12s (duration: 0.18s, confidence: 0.92)
infinity at 0.30s (duration: 0.45s, confidence: 0.95)
and at 0.75s (duration: 0.12s, confidence: 0.89)
beyond at 0.87s (duration: 0.38s, confidence: 0.97)

This is invaluable for building karaoke-style highlight effects, calculating speaking pace (words per minute), or syncing transcription with video playback. In our trivia app, you could use timestamps to measure how quickly a player blurts out “Buzz Lightyear” after the question appears.

Note: Confidence values are only available for final results. During partial results, confidence returns 0.0 for most segments. Design your UI accordingly — do not show confidence badges until isFinal is true.

Alternative Interpretations

When the recognizer is not fully confident in a word, it provides alternatives. Each SFTranscriptionSegment carries an alternativeSubstrings array:

for segment in transcription.segments {
    if !segment.alternativeSubstrings.isEmpty {
        print("'\(segment.substring)' alternatives: \(segment.alternativeSubstrings)")
    }
}
'Nemo' alternatives: ["Memo", "Nimo", "Neemo"]

This is gold for voice-command matching. If your app expects the user to say a character name like “Nemo,” you can check alternatives when the best transcription does not match. Here is a practical fuzzy matcher:

func matchesCharacterName(
    _ target: String,
    in transcription: SFTranscription
) -> Bool {
    let lowered = target.lowercased()

    for segment in transcription.segments {
        let candidates =
            [segment.substring] + segment.alternativeSubstrings
        if candidates.contains(where: { $0.lowercased() == lowered }) {
            return true
        }
    }

    return false
}

// Usage in our trivia game
let answeredNemo = matchesCharacterName(
    "Nemo",
    in: result.bestTranscription
)

By checking alternative interpretations, you dramatically reduce false negatives for voice commands — especially for proper nouns the recognizer has not seen before.

On-Device vs. Server-Based Recognition

Starting in iOS 13, the Speech framework supports fully on-device recognition. This matters for privacy, latency, and offline availability. You control this with a single property:

let request = SFSpeechAudioBufferRecognitionRequest()
request.requiresOnDeviceRecognition = true

But before flipping that switch, check whether the device actually supports on-device recognition for your target locale:

let recognizer = SFSpeechRecognizer(
    locale: Locale(identifier: "en-US")
)

if recognizer?.supportsOnDeviceRecognition == true {
    request.requiresOnDeviceRecognition = true
} else {
    // Fall back to server-based. The device may not have
    // downloaded the on-device model for this locale yet.
    request.requiresOnDeviceRecognition = false
}

The trade-offs are real:

FactorOn-DeviceServer-Based
LatencyLower (no round-trip)Higher on slow connections
AccuracyGood for common wordsBetter for uncommon names
PrivacyAudio stays on deviceAudio sent to Apple
OfflineWorks without networkRequires connectivity
LanguagesSubset of localesFull locale catalog
PunctuationiOS 16+iOS 16+

Tip: On-device models are downloaded lazily. A user who just set up their phone may not have the model yet. Always check supportsOnDeviceRecognition and handle the fallback gracefully instead of hard-failing.

Advanced Usage

Task-Level Control with Async/Await

The callback-based recognitionTask(with:resultHandler:) API works, but it does not compose well with structured concurrency. You can wrap recognition in an AsyncStream for cleaner integration:

func transcriptionStream(
    recognizer: SFSpeechRecognizer,
    request: SFSpeechAudioBufferRecognitionRequest
) -> AsyncStream<String> {
    AsyncStream { continuation in
        recognizer.recognitionTask(with: request) { result, error in
            if let result {
                continuation.yield(
                    result.bestTranscription.formattedString
                )
                if result.isFinal {
                    continuation.finish()
                }
            }

            if error != nil {
                continuation.finish()
            }
        }

        continuation.onTermination = { _ in
            request.endAudio()
        }
    }
}

This lets you consume transcription results in a for await loop, integrate with SwiftUI’s task modifier, and benefit from automatic cancellation:

// In a SwiftUI view's .task modifier
for await partialText in transcriptionStream(
    recognizer: recognizer,
    request: request
) {
    self.displayedText = partialText
}

Handling Interruptions and Audio Route Changes

Audio sessions get interrupted — phone calls, Siri activation, other apps claiming the audio route. Register for AVAudioSession.interruptionNotification and restart your engine when the interruption ends:

NotificationCenter.default.addObserver(
    forName: AVAudioSession.interruptionNotification,
    object: nil,
    queue: .main
) { notification in
    guard let info = notification.userInfo,
          let typeValue = info[
            AVAudioSessionInterruptionTypeKey
          ] as? UInt,
          let type = AVAudioSession.InterruptionType(
            rawValue: typeValue
          )
    else { return }

    switch type {
    case .began:
        // Audio was interrupted — pause recognition
        self.stopListening()
    case .ended:
        // Interruption ended — safe to restart if needed
        let options = info[
            AVAudioSessionInterruptionOptionKey
        ] as? UInt ?? 0
        if AVAudioSession.InterruptionOptions(
            rawValue: options
        ).contains(.shouldResume) {
            try? self.startListening()
        }
    @unknown default:
        break
    }
}

Warning: Do not attempt to restart the audio engine during an active interruption. The system will reject the call and you will get a -10878 (AVAudioSessionErrorCodeCannotInterruptOthers) error. Wait for the .ended notification.

Rate Limits

Apple imposes usage limits on speech recognition that are not documented with specific numbers but are enforced at the system level. Recognition requests are rate-limited per device and per app. If you hit the limit, recognitionTask will return an error with domain kAFAssistantErrorDomain. Server-based requests are more aggressively throttled than on-device ones.

For sustained use cases (long-form dictation, always-listening accessibility features), prefer on-device recognition to avoid hitting server-side rate limits.

Performance Considerations

Live speech recognition is CPU and memory intensive. Here is what to watch for:

Audio buffer size. The bufferSize parameter in installTap(onBus:bufferSize:format:) controls how much audio data each callback delivers. A buffer size of 1024 frames works well for most cases. Larger buffers (4096) reduce callback frequency but increase latency. Smaller buffers (512) give faster partial results but increase CPU overhead from more frequent callbacks.

Memory. Each SFSpeechAudioBufferRecognitionRequest accumulates audio data internally. For long sessions (think: lecture transcription or meeting notes), memory grows linearly. Apple limits individual recognition requests to approximately one minute of audio for server-based recognition. On-device requests are more lenient but still accumulate memory. For long-running transcription, implement a rolling-window strategy:

func restartRecognitionPeriodically() {
    // Stop current session, capture transcript so far
    let partialTranscript = transcript
    stopListening()

    // Restart with a fresh request
    transcript = partialTranscript
    try? startListening()
}

Battery. Running AVAudioEngine plus on-device ML inference draws significant power. Profile with Instruments’ Energy Log template if your feature is expected to run for extended periods. Consider providing a visible “listening” indicator so users can stop recognition when they are done — do not leave the microphone running silently.

Thread usage. The recognition callback fires on an internal Speech framework queue. Avoid doing heavy work inside the callback — dispatch UI updates to @MainActor and offload any text processing to a background task.

Apple Docs: SFSpeechAudioBufferRecognitionRequest — Speech Framework

When to Use (and When Not To)

ScenarioRecommendation
Voice commandsIdeal. On-device gives fast, private results.
Real-time transcriptionUse with care. Rotate sessions for long audio.
AccessibilityCheck Voice Control or Accessibility first.
Background transcriptionNot supported without active audio session.
Offline dictationGreat fit with on-device recognition.
Multi-languageCheck supportedLocales() first.

If your use case is primarily about understanding the meaning of what the user said (intent classification, entity extraction), pair the Speech framework’s raw transcription with the Natural Language framework for tokenization and classification. The Speech framework gives you text; NLP gives you structure.

Summary

  • SFSpeechRecognizer supports both file-based and live audio transcription via AVAudioEngine. For real-time voice features, always use SFSpeechAudioBufferRecognitionRequest.
  • Authorization requires both NSSpeechRecognitionUsageDescription and NSMicrophoneUsageDescription in Info.plist, plus runtime permission requests.
  • SFTranscriptionSegment provides word-level timestamps, duration, confidence, and alternative interpretations — powerful for voice-command matching and timed text display.
  • On-device recognition (requiresOnDeviceRecognition = true) eliminates network latency and keeps audio private, but check supportsOnDeviceRecognition before enabling it.
  • For long-running transcription, implement session rotation to manage memory, and profile battery impact with Instruments.

For on-device language translation that pairs naturally with speech input, see Translation Framework. If you need to analyze the content of transcribed text — tokenization, sentiment, named entities — the Natural Language Framework picks up where the Speech framework leaves off.