Speech Framework — Live Transcription: Building Voice-Controlled Features
You have shipped a search bar, a text field, and maybe a chatbot — but your users keep asking for voice input. Apple has had on-device speech recognition since iOS 10, yet most teams still treat it as a black box they will “get to later.” The Speech framework is more capable than you think, and once you see word-level timestamps and alternative transcriptions in action, you will wonder why you waited.
This post covers SFSpeechRecognizer end-to-end: authorization, live audio recognition with AVAudioEngine, on-device
versus server-based modes, word-level timing data, and alternative interpretations. We will not cover
AVSpeechSynthesizer (text-to-speech) or the Natural Language framework’s text analysis pipeline — those deserve their
own posts.
Contents
- The Problem
- Setting Up Speech Authorization
- Live Transcription with AVAudioEngine
- Word-Level Timestamps and Segments
- Alternative Interpretations
- On-Device vs. Server-Based Recognition
- Advanced Usage
- Performance Considerations
- When to Use (and When Not To)
- Summary
The Problem
Imagine you are building a movie trivia app. Players shout character names — “Woody!”, “Buzz Lightyear!”, “Dory!” — and the app needs to match their answers in real time. A text field kills the momentum. You need live audio transcription that streams partial results as the player speaks, gives you word-level timing for scoring, and works reliably even when the device is offline.
Here is the naive attempt most teams start with — a one-shot file-based recognition request:
import Speech
func transcribeAudioFile(url: URL) {
let recognizer = SFSpeechRecognizer()
let request = SFSpeechURLRecognitionRequest(url: url)
recognizer?.recognitionTask(with: request) { result, error in
if let result {
print(result.bestTranscription.formattedString)
}
}
}
This works for pre-recorded files, but it blocks until the entire audio file has been processed. There is no streaming,
no partial results, and no live microphone input. For a voice-controlled experience, you need
SFSpeechAudioBufferRecognitionRequest paired with AVAudioEngine — and that is where the real engineering begins.
Setting Up Speech Authorization
Before any recognition can happen, you need two entitlements: microphone access and speech recognition authorization.
Both require Info.plist keys, and both prompt the user with a system dialog.
Add these keys to your Info.plist:
<key>NSSpeechRecognitionUsageDescription</key>
<string>PixarTrivia needs speech recognition to hear your answers.</string>
<key>NSMicrophoneUsageDescription</key>
<string>PixarTrivia needs microphone access for voice input.</string>
Then request authorization at runtime. The Speech framework provides a dedicated class method for this:
import Speech
func requestSpeechAuthorization() async -> Bool {
let speechStatus = await withCheckedContinuation { continuation in
SFSpeechRecognizer.requestAuthorization { status in
continuation.resume(returning: status)
}
}
guard speechStatus == .authorized else {
print("Speech recognition not authorized: \(speechStatus)")
return false
}
// Microphone access uses AVAudioSession's AVFoundation permission
let micGranted = await AVAudioApplication.requestRecordPermission()
return micGranted
}
Warning:
SFSpeechRecognizer.requestAuthorizationdispatches its callback on an arbitrary queue — not the main queue. If you are updating UI from the result, dispatch back to@MainActorexplicitly.
Apple Docs:
SFSpeechRecognizer— Speech Framework
Live Transcription with AVAudioEngine
The core pattern for live transcription pairs AVAudioEngine (to capture microphone input) with
SFSpeechAudioBufferRecognitionRequest (to stream audio buffers to the recognizer). Here is a production-grade
TranscriptionEngine that encapsulates the full lifecycle:
import AVFoundation
import Speech
@MainActor
final class TranscriptionEngine: ObservableObject {
@Published private(set) var transcript = ""
@Published private(set) var isListening = false
private let speechRecognizer =
SFSpeechRecognizer(locale: Locale(identifier: "en-US"))!
private let audioEngine = AVAudioEngine()
private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
private var recognitionTask: SFSpeechRecognitionTask?
func startListening() throws {
// Cancel any in-flight task before starting a new one
stopListening()
let request = SFSpeechAudioBufferRecognitionRequest()
request.shouldReportPartialResults = true
request.addsPunctuation = true // iOS 16+
recognitionRequest = request
let inputNode = audioEngine.inputNode
let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(
onBus: 0,
bufferSize: 1024,
format: recordingFormat
) { [weak request] buffer, _ in
request?.append(buffer)
}
audioEngine.prepare()
try audioEngine.start()
isListening = true
recognitionTask = speechRecognizer.recognitionTask(
with: request
) { [weak self] result, error in
guard let self else { return }
if let result {
Task { @MainActor in
self.transcript =
result.bestTranscription.formattedString
}
}
if error != nil || result?.isFinal == true {
Task { @MainActor in
self.stopListening()
}
}
}
}
func stopListening() {
audioEngine.stop()
audioEngine.inputNode.removeTap(onBus: 0)
recognitionRequest?.endAudio()
recognitionTask?.cancel()
recognitionRequest = nil
recognitionTask = nil
isListening = false
}
}
A few things worth calling out:
shouldReportPartialResults = truegives you streaming transcription — the callback fires multiple times as the recognizer refines its hypothesis. This is what makes “live” transcription feel responsive.addsPunctuation = true(available since iOS 16) inserts automatic punctuation. Without it, you get a raw stream of words with no periods or commas.- Always remove the tap on the input node when stopping. Failing to do so and then installing a new tap will crash
with
"required condition is false: googOutN == 2"— a notoriously unhelpful error message. - Cancel the previous task before starting a new one.
SFSpeechRecognizersupports only one active task per instance.
Tip: If your app also plays audio (sound effects, music), configure your
AVAudioSessioncategory to.playAndRecordwith the.defaultToSpeakeroption. Otherwise the audio route will switch to the earpiece and your users will think the app went silent.
Word-Level Timestamps and Segments
The real power of the Speech framework lives in SFTranscriptionSegment. Each segment represents a single word (or
punctuation token) with precise timing data:
recognitionTask = speechRecognizer.recognitionTask(with: request) {
[weak self] result, _ in
guard let self, let result else { return }
let transcription = result.bestTranscription
for segment in transcription.segments {
let word = segment.substring
let timestamp = segment.timestamp // TimeInterval from start
let duration = segment.duration // How long the word lasted
let confidence = segment.confidence // 0.0 to 1.0
print("\(word) at \(timestamp)s (\(duration)s, \(confidence))")
}
}
To at 0.12s (duration: 0.18s, confidence: 0.92)
infinity at 0.30s (duration: 0.45s, confidence: 0.95)
and at 0.75s (duration: 0.12s, confidence: 0.89)
beyond at 0.87s (duration: 0.38s, confidence: 0.97)
This is invaluable for building karaoke-style highlight effects, calculating speaking pace (words per minute), or syncing transcription with video playback. In our trivia app, you could use timestamps to measure how quickly a player blurts out “Buzz Lightyear” after the question appears.
Note: Confidence values are only available for final results. During partial results,
confidencereturns0.0for most segments. Design your UI accordingly — do not show confidence badges untilisFinalistrue.
Alternative Interpretations
When the recognizer is not fully confident in a word, it provides alternatives. Each SFTranscriptionSegment carries an
alternativeSubstrings array:
for segment in transcription.segments {
if !segment.alternativeSubstrings.isEmpty {
print("'\(segment.substring)' alternatives: \(segment.alternativeSubstrings)")
}
}
'Nemo' alternatives: ["Memo", "Nimo", "Neemo"]
This is gold for voice-command matching. If your app expects the user to say a character name like “Nemo,” you can check alternatives when the best transcription does not match. Here is a practical fuzzy matcher:
func matchesCharacterName(
_ target: String,
in transcription: SFTranscription
) -> Bool {
let lowered = target.lowercased()
for segment in transcription.segments {
let candidates =
[segment.substring] + segment.alternativeSubstrings
if candidates.contains(where: { $0.lowercased() == lowered }) {
return true
}
}
return false
}
// Usage in our trivia game
let answeredNemo = matchesCharacterName(
"Nemo",
in: result.bestTranscription
)
By checking alternative interpretations, you dramatically reduce false negatives for voice commands — especially for proper nouns the recognizer has not seen before.
On-Device vs. Server-Based Recognition
Starting in iOS 13, the Speech framework supports fully on-device recognition. This matters for privacy, latency, and offline availability. You control this with a single property:
let request = SFSpeechAudioBufferRecognitionRequest()
request.requiresOnDeviceRecognition = true
But before flipping that switch, check whether the device actually supports on-device recognition for your target locale:
let recognizer = SFSpeechRecognizer(
locale: Locale(identifier: "en-US")
)
if recognizer?.supportsOnDeviceRecognition == true {
request.requiresOnDeviceRecognition = true
} else {
// Fall back to server-based. The device may not have
// downloaded the on-device model for this locale yet.
request.requiresOnDeviceRecognition = false
}
The trade-offs are real:
| Factor | On-Device | Server-Based |
|---|---|---|
| Latency | Lower (no round-trip) | Higher on slow connections |
| Accuracy | Good for common words | Better for uncommon names |
| Privacy | Audio stays on device | Audio sent to Apple |
| Offline | Works without network | Requires connectivity |
| Languages | Subset of locales | Full locale catalog |
| Punctuation | iOS 16+ | iOS 16+ |
Tip: On-device models are downloaded lazily. A user who just set up their phone may not have the model yet. Always check
supportsOnDeviceRecognitionand handle the fallback gracefully instead of hard-failing.
Advanced Usage
Task-Level Control with Async/Await
The callback-based recognitionTask(with:resultHandler:) API works, but it does not compose well with structured
concurrency. You can wrap recognition in an AsyncStream for cleaner integration:
func transcriptionStream(
recognizer: SFSpeechRecognizer,
request: SFSpeechAudioBufferRecognitionRequest
) -> AsyncStream<String> {
AsyncStream { continuation in
recognizer.recognitionTask(with: request) { result, error in
if let result {
continuation.yield(
result.bestTranscription.formattedString
)
if result.isFinal {
continuation.finish()
}
}
if error != nil {
continuation.finish()
}
}
continuation.onTermination = { _ in
request.endAudio()
}
}
}
This lets you consume transcription results in a for await loop, integrate with SwiftUI’s task modifier, and benefit
from automatic cancellation:
// In a SwiftUI view's .task modifier
for await partialText in transcriptionStream(
recognizer: recognizer,
request: request
) {
self.displayedText = partialText
}
Handling Interruptions and Audio Route Changes
Audio sessions get interrupted — phone calls, Siri activation, other apps claiming the audio route. Register for
AVAudioSession.interruptionNotification and restart your engine when the interruption ends:
NotificationCenter.default.addObserver(
forName: AVAudioSession.interruptionNotification,
object: nil,
queue: .main
) { notification in
guard let info = notification.userInfo,
let typeValue = info[
AVAudioSessionInterruptionTypeKey
] as? UInt,
let type = AVAudioSession.InterruptionType(
rawValue: typeValue
)
else { return }
switch type {
case .began:
// Audio was interrupted — pause recognition
self.stopListening()
case .ended:
// Interruption ended — safe to restart if needed
let options = info[
AVAudioSessionInterruptionOptionKey
] as? UInt ?? 0
if AVAudioSession.InterruptionOptions(
rawValue: options
).contains(.shouldResume) {
try? self.startListening()
}
@unknown default:
break
}
}
Warning: Do not attempt to restart the audio engine during an active interruption. The system will reject the call and you will get a
-10878(AVAudioSessionErrorCodeCannotInterruptOthers) error. Wait for the.endednotification.
Rate Limits
Apple imposes usage limits on speech recognition that are not documented with specific numbers but are enforced at the
system level. Recognition requests are rate-limited per device and per app. If you hit the limit, recognitionTask will
return an error with domain kAFAssistantErrorDomain. Server-based requests are more aggressively throttled than
on-device ones.
For sustained use cases (long-form dictation, always-listening accessibility features), prefer on-device recognition to avoid hitting server-side rate limits.
Performance Considerations
Live speech recognition is CPU and memory intensive. Here is what to watch for:
Audio buffer size. The bufferSize parameter in installTap(onBus:bufferSize:format:) controls how much audio data
each callback delivers. A buffer size of 1024 frames works well for most cases. Larger buffers (4096) reduce callback
frequency but increase latency. Smaller buffers (512) give faster partial results but increase CPU overhead from more
frequent callbacks.
Memory. Each SFSpeechAudioBufferRecognitionRequest accumulates audio data internally. For long sessions (think:
lecture transcription or meeting notes), memory grows linearly. Apple limits individual recognition requests to
approximately one minute of audio for server-based recognition. On-device requests are more lenient but still accumulate
memory. For long-running transcription, implement a rolling-window strategy:
func restartRecognitionPeriodically() {
// Stop current session, capture transcript so far
let partialTranscript = transcript
stopListening()
// Restart with a fresh request
transcript = partialTranscript
try? startListening()
}
Battery. Running AVAudioEngine plus on-device ML inference draws significant power. Profile with Instruments’
Energy Log template if your feature is expected to run for extended periods. Consider providing a visible “listening”
indicator so users can stop recognition when they are done — do not leave the microphone running silently.
Thread usage. The recognition callback fires on an internal Speech framework queue. Avoid doing heavy work inside
the callback — dispatch UI updates to @MainActor and offload any text processing to a background task.
Apple Docs:
SFSpeechAudioBufferRecognitionRequest— Speech Framework
When to Use (and When Not To)
| Scenario | Recommendation |
|---|---|
| Voice commands | Ideal. On-device gives fast, private results. |
| Real-time transcription | Use with care. Rotate sessions for long audio. |
| Accessibility | Check Voice Control or Accessibility first. |
| Background transcription | Not supported without active audio session. |
| Offline dictation | Great fit with on-device recognition. |
| Multi-language | Check supportedLocales() first. |
If your use case is primarily about understanding the meaning of what the user said (intent classification, entity extraction), pair the Speech framework’s raw transcription with the Natural Language framework for tokenization and classification. The Speech framework gives you text; NLP gives you structure.
Summary
SFSpeechRecognizersupports both file-based and live audio transcription viaAVAudioEngine. For real-time voice features, always useSFSpeechAudioBufferRecognitionRequest.- Authorization requires both
NSSpeechRecognitionUsageDescriptionandNSMicrophoneUsageDescriptioninInfo.plist, plus runtime permission requests. SFTranscriptionSegmentprovides word-level timestamps, duration, confidence, and alternative interpretations — powerful for voice-command matching and timed text display.- On-device recognition (
requiresOnDeviceRecognition = true) eliminates network latency and keeps audio private, but checksupportsOnDeviceRecognitionbefore enabling it. - For long-running transcription, implement session rotation to manage memory, and profile battery impact with Instruments.
For on-device language translation that pairs naturally with speech input, see Translation Framework. If you need to analyze the content of transcribed text — tokenization, sentiment, named entities — the Natural Language Framework picks up where the Speech framework leaves off.