AVFoundation: Custom Camera and Audio Capture in iOS
The system camera is fine for snapshots, but the moment you need a custom viewfinder overlay, real-time frame
processing, or precise audio metering, you need to build your own capture pipeline.
AVFoundation gives you full control over every camera and
microphone on the device — from selecting lenses and configuring frame rates to writing spatial audio tracks with iOS
26’s new AVAssetWriter capabilities.
This post walks through the complete capture stack: session configuration, device discovery, preview layers, photo and
video output, AVAudioEngine for real-time audio processing, and the iOS 26 Spatial Audio recording API. We will not
cover playback (AVPlayer) or editing (AVMutableComposition) — those deserve their own deep dives.
Contents
- The Problem
- AVCaptureSession: The Pipeline Core
- Device Discovery and Configuration
- Building a Custom Camera View
- Capturing Photos
- Recording Video
- Audio Capture with AVAudioEngine
- iOS 26: Spatial Audio Capture
- Performance Considerations
- When to Use (and When Not To)
- Summary
The Problem
You are building a Pixar Storyboard Scanner — an app that lets animators photograph hand-drawn storyboard panels,
overlay alignment guides in real time, capture voice-over annotations, and export the result as a video with spatial
audio. The system UIImagePickerController cannot do any of this: no custom overlays, no frame-level processing, no
audio routing control.
Here is what a naive first attempt might look like:
import UIKit
// This gives you zero control over the viewfinder
let picker = UIImagePickerController()
picker.sourceType = .camera
picker.allowsEditing = true
// No custom overlay, no real-time processing, no audio metering
// No frame rate control, no lens selection, no spatial audio
UIImagePickerController hands you a photo after the fact. For anything beyond point-and-shoot, you need to assemble
the AVFoundation capture pipeline yourself. Let’s build it piece by piece.
AVCaptureSession: The Pipeline Core
AVCaptureSession is the central coordinator
that connects inputs (cameras, microphones) to outputs (photo capture, video recording, data buffers). Think of it as
the Pixar render pipeline: sources go in, processed frames come out.
import AVFoundation
final class CaptureService {
let session = AVCaptureSession()
private let sessionQueue = DispatchQueue(label: "com.pixar.storyboard.session")
func configure() throws {
session.beginConfiguration()
defer { session.commitConfiguration() }
// Set the session preset — determines resolution and quality trade-offs
session.sessionPreset = .photo
// Add video input
guard let camera = AVCaptureDevice.default(
.builtInWideAngleCamera, for: .video, position: .back
) else {
throw CaptureError.cameraUnavailable
}
let videoInput = try AVCaptureDeviceInput(device: camera)
guard session.canAddInput(videoInput) else {
throw CaptureError.inputNotSupported
}
session.addInput(videoInput)
}
func startSession() {
sessionQueue.async { [session] in
guard !session.isRunning else { return }
session.startRunning()
}
}
func stopSession() {
sessionQueue.async { [session] in
guard session.isRunning else { return }
session.stopRunning()
}
}
}
enum CaptureError: Error {
case cameraUnavailable
case inputNotSupported
case outputNotSupported
case recordingFailed
}
Warning: Always wrap configuration changes between
beginConfiguration()andcommitConfiguration(). This batches changes into a single atomic update, preventing the session from entering an inconsistent state where an input is removed before its replacement is added.
The session runs on its own serial dispatch queue internally, but startRunning() and stopRunning() are synchronous
and blocking. Never call them on the main thread — they can take 100-500 ms depending on the device and session
complexity.
Device Discovery and Configuration
Modern iPhones have multiple cameras — wide, ultra-wide, telephoto, and front-facing.
AVCaptureDevice.DiscoverySession
lets you enumerate available devices and pick the right one for your use case.
struct DeviceSelector {
/// Returns all back-facing cameras ordered by focal length.
static func availableBackCameras() -> [AVCaptureDevice] {
let discovery = AVCaptureDevice.DiscoverySession(
deviceTypes: [
.builtInWideAngleCamera,
.builtInUltraWideCamera,
.builtInTelephotoCamera
],
mediaType: .video,
position: .back
)
return discovery.devices
}
/// Returns the best available camera for document scanning.
static func storyboardCamera() -> AVCaptureDevice? {
AVCaptureDevice.default(
.builtInWideAngleCamera, for: .video, position: .back
)
}
}
Configuring Device Properties
Once you have a device, lock it for configuration to adjust focus, exposure, and frame rate:
extension CaptureService {
/// Configures the camera for storyboard scanning: autofocus, stable exposure.
func configureForScanning(device: AVCaptureDevice) throws {
try device.lockForConfiguration()
defer { device.unlockForConfiguration() }
// Enable continuous autofocus for document scanning
if device.isFocusModeSupported(.continuousAutoFocus) {
device.focusMode = .continuousAutoFocus
}
// Lock exposure to avoid flickering under fluorescent lights
if device.isExposureModeSupported(.continuousAutoExposure) {
device.exposureMode = .continuousAutoExposure
}
// Set frame rate to 30fps for smooth preview
let targetFPS = CMTimeMake(value: 1, timescale: 30)
device.activeVideoMinFrameDuration = targetFPS
device.activeVideoMaxFrameDuration = targetFPS
}
}
Tip: Call
lockForConfiguration()as briefly as possible. While locked, other apps and system processes cannot adjust the device. Thedeferpattern ensures you always unlock, even on early returns.
Camera Permissions
Before any of this works, you need the user’s permission. Declare NSCameraUsageDescription in your Info.plist and
request authorization:
extension CaptureService {
static func requestCameraAccess() async -> Bool {
let status = AVCaptureDevice.authorizationStatus(for: .video)
switch status {
case .authorized:
return true
case .notDetermined:
return await AVCaptureDevice.requestAccess(for: .video)
case .denied, .restricted:
return false
@unknown default:
return false
}
}
}
Building a Custom Camera View
SwiftUI does not have a native camera preview, so you bridge
AVCaptureVideoPreviewLayer
through UIViewRepresentable:
import SwiftUI
struct CameraPreviewView: UIViewRepresentable {
let session: AVCaptureSession
func makeUIView(context: Context) -> CameraPreviewUIView {
let view = CameraPreviewUIView()
view.previewLayer.session = session
view.previewLayer.videoGravity = .resizeAspectFill
return view
}
func updateUIView(_ uiView: CameraPreviewUIView, context: Context) {
// Session changes are handled externally
}
}
final class CameraPreviewUIView: UIView {
override class var layerClass: AnyClass {
AVCaptureVideoPreviewLayer.self
}
var previewLayer: AVCaptureVideoPreviewLayer {
layer as! AVCaptureVideoPreviewLayer
}
}
Using layerClass override is more efficient than adding a sublayer — it avoids layout mismatches and ensures the
preview fills the entire view automatically.
Now compose it with a custom overlay for the storyboard alignment guides:
struct StoryboardScannerView: View {
@State private var captureService = CaptureService()
@State private var cameraAuthorized = false
var body: some View {
ZStack {
if cameraAuthorized {
CameraPreviewView(session: captureService.session)
.ignoresSafeArea()
// Storyboard alignment overlay
StoryboardGuideOverlay()
} else {
ContentUnavailableView(
"Camera Access Required",
systemImage: "camera.fill",
description: Text(
"Allow camera access to scan storyboard panels."
)
)
}
}
.task {
cameraAuthorized = await CaptureService.requestCameraAccess()
if cameraAuthorized {
try? captureService.configure()
captureService.startSession()
}
}
}
}
struct StoryboardGuideOverlay: View {
var body: some View {
RoundedRectangle(cornerRadius: 12)
.stroke(Color.yellow, lineWidth: 2)
.padding(40)
.overlay {
Text("Align storyboard panel")
.font(.caption)
.foregroundStyle(.yellow)
.padding(.top, 48)
}
}
}
Capturing Photos
Add an AVCapturePhotoOutput to the
session and implement the delegate to receive captured images:
final class PhotoCaptureHandler: NSObject, AVCapturePhotoCaptureDelegate {
let photoOutput = AVCapturePhotoOutput()
var onPhotoCaptured: ((Data) -> Void)?
func addToSession(_ session: AVCaptureSession) throws {
guard session.canAddOutput(photoOutput) else {
throw CaptureError.outputNotSupported
}
session.addOutput(photoOutput)
}
/// Captures a high-quality photo of the current storyboard panel.
func captureStoryboardPanel() {
let settings = AVCapturePhotoSettings()
settings.flashMode = .auto
// Enable HEIF for smaller file sizes
if photoOutput.availablePhotoCodecTypes.contains(.hevc) {
settings.photoQualityPrioritization = .balanced
}
photoOutput.capturePhoto(with: settings, delegate: self)
}
func photoOutput(
_ output: AVCapturePhotoOutput,
didFinishProcessingPhoto photo: AVCapturePhoto,
error: Error?
) {
guard error == nil,
let data = photo.fileDataRepresentation() else {
return
}
onPhotoCaptured?(data)
}
}
Note:
AVCapturePhotoOutputreplaces the deprecatedAVCaptureStillImageOutput. Always useAVCapturePhotoOutputfor new code — it supports Live Photos, depth data, and multi-image bracketing.
Real-Time Frame Processing
For real-time overlays, barcode detection, or feeding frames to Core ML, use
AVCaptureVideoDataOutput:
final class FrameAnalyzer: NSObject,
AVCaptureVideoDataOutputSampleBufferDelegate
{
private let processingQueue = DispatchQueue(
label: "com.pixar.storyboard.frameAnalysis",
qos: .userInitiated
)
func configureOutput(for session: AVCaptureSession) throws {
let videoOutput = AVCaptureVideoDataOutput()
videoOutput.setSampleBufferDelegate(self, queue: processingQueue)
videoOutput.alwaysDiscardsLateVideoFrames = true
guard session.canAddOutput(videoOutput) else {
throw CaptureError.outputNotSupported
}
session.addOutput(videoOutput)
}
func captureOutput(
_ output: AVCaptureOutput,
didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection
) {
guard let pixelBuffer = CMSampleBufferGetImageBuffer(
sampleBuffer
) else { return }
// Feed to Vision, Core ML, or Core Image
analyzeFrame(pixelBuffer)
}
private func analyzeFrame(_ buffer: CVPixelBuffer) {
// Run storyboard panel detection, alignment analysis, etc.
}
}
Set alwaysDiscardsLateVideoFrames = true so that if your processing takes longer than one frame interval, the pipeline
drops frames rather than building up a backlog that would cause memory pressure and increasing latency.
Recording Video
For video recording,
AVCaptureMovieFileOutput is the
simplest path — it writes directly to a file:
final class VideoRecorder: NSObject,
AVCaptureFileOutputRecordingDelegate
{
let movieOutput = AVCaptureMovieFileOutput()
func addToSession(_ session: AVCaptureSession) throws {
guard session.canAddOutput(movieOutput) else {
throw CaptureError.outputNotSupported
}
session.addOutput(movieOutput)
// Limit recording to 5 minutes for storyboard walkthroughs
movieOutput.maxRecordedDuration = CMTime(
seconds: 300, preferredTimescale: 600
)
}
/// Starts recording a storyboard walkthrough video.
func startRecording() {
let tempURL = FileManager.default.temporaryDirectory
.appendingPathComponent(
"storyboard_\(UUID().uuidString).mov"
)
movieOutput.startRecording(to: tempURL, recordingDelegate: self)
}
func stopRecording() {
movieOutput.stopRecording()
}
func fileOutput(
_ output: AVCaptureFileOutput,
didFinishRecordingTo outputFileURL: URL,
from connections: [AVCaptureConnection],
error: Error?
) {
if let error {
print("Recording failed: \(error.localizedDescription)")
return
}
handleRecordedVideo(at: outputFileURL)
}
private func handleRecordedVideo(at url: URL) {
// Save to Photos, upload, or pass to AVAssetWriter for post-processing
}
}
Adding Audio to Video
To record video with audio, add a microphone input to the session. Request microphone permission first
(NSMicrophoneUsageDescription in Info.plist):
extension CaptureService {
func addMicrophoneInput() throws {
guard let mic = AVCaptureDevice.default(for: .audio) else {
throw CaptureError.cameraUnavailable
}
let audioInput = try AVCaptureDeviceInput(device: mic)
guard session.canAddInput(audioInput) else {
throw CaptureError.inputNotSupported
}
session.addInput(audioInput)
}
}
Once both video and audio inputs are on the session, AVCaptureMovieFileOutput automatically muxes them into the output
file.
Audio Capture with AVAudioEngine
For standalone audio recording — voice-over annotations, sound effects, or audio-only capture —
AVAudioEngine gives you a real-time processing
graph with nodes for mixing, effects, and metering.
import AVFAudio
final class VoiceOverRecorder {
private let engine = AVAudioEngine()
private var audioFile: AVAudioFile?
/// Configures the audio session for recording.
func configureAudioSession() throws {
let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(
.playAndRecord,
mode: .default,
options: [.defaultToSpeaker, .allowBluetooth]
)
try audioSession.setActive(true)
}
/// Starts recording voice-over to a file.
func startRecording(to url: URL) throws {
let inputNode = engine.inputNode
let format = inputNode.outputFormat(forBus: 0)
audioFile = try AVAudioFile(
forWriting: url,
settings: format.settings,
commonFormat: .pcmFormatFloat32,
interleaved: false
)
inputNode.installTap(
onBus: 0,
bufferSize: 4096,
format: format
) { [weak self] buffer, _ in
try? self?.audioFile?.write(from: buffer)
}
try engine.start()
}
/// Stops recording and cleans up.
func stopRecording() {
engine.inputNode.removeTap(onBus: 0)
engine.stop()
audioFile = nil
}
}
Warning: Always call
removeTap(onBus:)before stopping the engine. Stopping the engine with an active tap can cause a crash on older iOS versions.
Use .playAndRecord if your app plays audio while recording (monitoring), or .record if you only capture. The
defaultToSpeaker option routes playback to the speaker instead of the earpiece — essential for monitoring voice-overs
during recording.
iOS 26: Spatial Audio Capture
iOS 26 introduces spatial audio recording through AVAssetWriter, enabling apps to capture immersive 3D audio that
responds to head tracking when played back on AirPods Pro or AirPods Max.
Note: Spatial Audio capture requires iOS 26, a device with multiple microphones (iPhone 12 or later), and the
com.apple.developer.spatial-audioentitlement.
@available(iOS 26, *)
final class SpatialAudioRecorder {
private var assetWriter: AVAssetWriter?
private var audioWriterInput: AVAssetWriterInput?
private let engine = AVAudioEngine()
/// Configures AVAssetWriter for spatial audio recording.
func configureForSpatialAudio(outputURL: URL) throws {
let writer = try AVAssetWriter(
outputURL: outputURL, fileType: .wav
)
// Configure spatial audio output settings
let audioSettings: [String: Any] = [
AVFormatIDKey: kAudioFormatLinearPCM,
AVSampleRateKey: 48000,
AVNumberOfChannelsKey: 4, // Ambisonics (W, X, Y, Z)
AVLinearPCMBitDepthKey: 32,
AVLinearPCMIsFloatKey: true,
AVLinearPCMIsBigEndianKey: false,
AVLinearPCMIsNonInterleaved: false
]
let audioInput = AVAssetWriterInput(
mediaType: .audio,
outputSettings: audioSettings
)
audioInput.expectsMediaDataInRealTime = true
guard writer.canAdd(audioInput) else {
throw CaptureError.outputNotSupported
}
writer.add(audioInput)
self.assetWriter = writer
self.audioWriterInput = audioInput
}
/// Starts spatial audio recording.
func startRecording() throws {
guard let assetWriter else { return }
assetWriter.startWriting()
assetWriter.startSession(atSourceTime: .zero)
let inputNode = engine.inputNode
let format = inputNode.outputFormat(forBus: 0)
inputNode.installTap(
onBus: 0,
bufferSize: 4096,
format: format
) { [weak self] buffer, time in
guard let self,
let audioInput = self.audioWriterInput,
audioInput.isReadyForMoreMediaData,
let sampleBuffer =
buffer.asCMSampleBuffer(time: time)
else { return }
audioInput.append(sampleBuffer)
}
try engine.start()
}
/// Stops recording and finalizes the asset.
func stopRecording() async {
engine.inputNode.removeTap(onBus: 0)
engine.stop()
audioWriterInput?.markAsFinished()
await assetWriter?.finishWriting()
}
}
The asCMSampleBuffer(time:) helper converts the AVAudioPCMBuffer into a CMSampleBuffer that AVAssetWriterInput
accepts:
extension AVAudioPCMBuffer {
/// Converts a PCM buffer to a CMSampleBuffer for AVAssetWriter.
func asCMSampleBuffer(
time: AVAudioTime
) -> CMSampleBuffer? {
let audioBufferList = mutableAudioBufferList
let sampleRate = format.sampleRate
var timing = CMSampleTimingInfo(
duration: CMTime(
value: CMTimeValue(frameLength),
timescale: CMTimeScale(sampleRate)
),
presentationTimeStamp: CMTime(
value: CMTimeValue(time.sampleTime),
timescale: CMTimeScale(sampleRate)
),
decodeTimeStamp: .invalid
)
var formatDescription: CMAudioFormatDescription?
CMAudioFormatDescriptionCreate(
allocator: kCFAllocatorDefault,
asbd: format.streamDescription,
layoutSize: 0,
layout: nil,
magicCookieSize: 0,
magicCookie: nil,
extensions: nil,
formatDescriptionOut: &formatDescription
)
guard let desc = formatDescription else { return nil }
var sampleBuffer: CMSampleBuffer?
CMSampleBufferCreate(
allocator: kCFAllocatorDefault,
dataBuffer: nil,
dataReady: false,
makeDataReadyCallback: nil,
refcon: nil,
formatDescription: desc,
sampleCount: CMItemCount(frameLength),
sampleTimingEntryCount: 1,
sampleTimingArray: &timing,
sampleSizeEntryCount: 0,
sampleSizeArray: nil,
sampleBufferOut: &sampleBuffer
)
if let buffer = sampleBuffer {
CMSampleBufferSetDataBufferFromAudioBufferList(
buffer,
blockBufferAllocator: kCFAllocatorDefault,
blockBufferMemoryAllocator: kCFAllocatorDefault,
flags: 0,
bufferList: audioBufferList
)
}
return sampleBuffer
}
}
The key difference from standard audio recording is the 4-channel Ambisonic format (first-order: W, X, Y, Z channels). This captures the spatial characteristics of the sound field, which the system uses alongside head-tracking data during playback to position sounds in 3D space around the listener.
Apple Docs:
AVAssetWriter— AVFoundation
Performance Considerations
Camera and audio capture are among the most resource-intensive operations on iOS. Here are the key bottlenecks and mitigations:
| Concern | Impact | Mitigation |
|---|---|---|
| Session startup | 100-500 ms blocking | Start on a background queue, show a loading state |
| Frame processing | Called 30-60 times/sec | Dedicated queue, set alwaysDiscardsLateVideoFrames |
| Memory from buffers | ~33 MB per 4K frame | Process and release buffers promptly |
| Thermal throttling | Sustained capture heats device | Monitor ProcessInfo.thermalState |
| Battery drain | Camera + GPS + processing | Use the lowest acceptable resolution and frame rate |
Monitor thermal state to degrade gracefully:
extension CaptureService {
func observeThermalState() {
NotificationCenter.default.addObserver(
forName: ProcessInfo.thermalStateDidChangeNotification,
object: nil,
queue: .main
) { [weak self] _ in
let state = ProcessInfo.processInfo.thermalState
switch state {
case .nominal, .fair:
self?.session.sessionPreset = .photo
case .serious:
self?.session.sessionPreset = .medium
case .critical:
self?.session.sessionPreset = .low
@unknown default:
break
}
}
}
}
Tip: Use Instruments’ Activity Monitor and Metal System Trace templates to profile capture pipelines. The Camera template shows frame delivery timing and dropped frame counts, helping you identify whether your processing callback is the bottleneck or the device simply cannot keep up at the requested frame rate.
When to Use (and When Not To)
| Scenario | Recommendation |
|---|---|
| Custom camera UI with overlays | AVFoundation capture pipeline is the right tool |
| Simple photo capture | Use PhotosPicker — far less code |
| Real-time frame processing | AVCaptureVideoDataOutput with a delegate on a dedicated queue |
| Standard video recording | AVCaptureMovieFileOutput handles muxing for you |
| Custom video encoding | AVAssetWriter with AVCaptureVideoDataOutput for full control |
| Audio-only recording | AVAudioEngine for processing; AVAudioRecorder for simple files |
| Spatial audio (iOS 26) | AVAssetWriter with Ambisonic format on compatible hardware |
| Screen recording | Use ReplayKit, not AVFoundation capture |
Summary
AVCaptureSessionis the pipeline coordinator connecting inputs to outputs. Always configure atomically withbeginConfiguration()/commitConfiguration()and run the session on a background queue.- Device discovery via
AVCaptureDevice.DiscoverySessionlets you enumerate cameras by type and position. Lock devices for configuration changes and unlock immediately. - Custom camera views in SwiftUI use
UIViewRepresentablewrappingAVCaptureVideoPreviewLayer. OverridelayerClassfor the cleanest integration. - Photo capture uses
AVCapturePhotoOutputwith HEIF support. Video recording usesAVCaptureMovieFileOutputfor simplicity orAVAssetWriterfor full encoding control. AVAudioEngineprovides a real-time audio processing graph with taps for recording and metering. Always remove taps before stopping the engine.- Spatial Audio on iOS 26 captures 4-channel Ambisonic audio through
AVAssetWriter, enabling head-tracked playback on compatible AirPods.
To process captured frames with filters and effects in real time, explore Core Image Filters. For extracting text from captured storyboard panels, check out Vision OCR Scanning.