AVFoundation: Custom Camera and Audio Capture in iOS


The system camera is fine for snapshots, but the moment you need a custom viewfinder overlay, real-time frame processing, or precise audio metering, you need to build your own capture pipeline. AVFoundation gives you full control over every camera and microphone on the device — from selecting lenses and configuring frame rates to writing spatial audio tracks with iOS 26’s new AVAssetWriter capabilities.

This post walks through the complete capture stack: session configuration, device discovery, preview layers, photo and video output, AVAudioEngine for real-time audio processing, and the iOS 26 Spatial Audio recording API. We will not cover playback (AVPlayer) or editing (AVMutableComposition) — those deserve their own deep dives.

Contents

The Problem

You are building a Pixar Storyboard Scanner — an app that lets animators photograph hand-drawn storyboard panels, overlay alignment guides in real time, capture voice-over annotations, and export the result as a video with spatial audio. The system UIImagePickerController cannot do any of this: no custom overlays, no frame-level processing, no audio routing control.

Here is what a naive first attempt might look like:

import UIKit

// This gives you zero control over the viewfinder
let picker = UIImagePickerController()
picker.sourceType = .camera
picker.allowsEditing = true
// No custom overlay, no real-time processing, no audio metering
// No frame rate control, no lens selection, no spatial audio

UIImagePickerController hands you a photo after the fact. For anything beyond point-and-shoot, you need to assemble the AVFoundation capture pipeline yourself. Let’s build it piece by piece.

AVCaptureSession: The Pipeline Core

AVCaptureSession is the central coordinator that connects inputs (cameras, microphones) to outputs (photo capture, video recording, data buffers). Think of it as the Pixar render pipeline: sources go in, processed frames come out.

import AVFoundation

final class CaptureService {
    let session = AVCaptureSession()
    private let sessionQueue = DispatchQueue(label: "com.pixar.storyboard.session")

    func configure() throws {
        session.beginConfiguration()
        defer { session.commitConfiguration() }

        // Set the session preset — determines resolution and quality trade-offs
        session.sessionPreset = .photo

        // Add video input
        guard let camera = AVCaptureDevice.default(
            .builtInWideAngleCamera, for: .video, position: .back
        ) else {
            throw CaptureError.cameraUnavailable
        }
        let videoInput = try AVCaptureDeviceInput(device: camera)
        guard session.canAddInput(videoInput) else {
            throw CaptureError.inputNotSupported
        }
        session.addInput(videoInput)
    }

    func startSession() {
        sessionQueue.async { [session] in
            guard !session.isRunning else { return }
            session.startRunning()
        }
    }

    func stopSession() {
        sessionQueue.async { [session] in
            guard session.isRunning else { return }
            session.stopRunning()
        }
    }
}

enum CaptureError: Error {
    case cameraUnavailable
    case inputNotSupported
    case outputNotSupported
    case recordingFailed
}

Warning: Always wrap configuration changes between beginConfiguration() and commitConfiguration(). This batches changes into a single atomic update, preventing the session from entering an inconsistent state where an input is removed before its replacement is added.

The session runs on its own serial dispatch queue internally, but startRunning() and stopRunning() are synchronous and blocking. Never call them on the main thread — they can take 100-500 ms depending on the device and session complexity.

Device Discovery and Configuration

Modern iPhones have multiple cameras — wide, ultra-wide, telephoto, and front-facing. AVCaptureDevice.DiscoverySession lets you enumerate available devices and pick the right one for your use case.

struct DeviceSelector {
    /// Returns all back-facing cameras ordered by focal length.
    static func availableBackCameras() -> [AVCaptureDevice] {
        let discovery = AVCaptureDevice.DiscoverySession(
            deviceTypes: [
                .builtInWideAngleCamera,
                .builtInUltraWideCamera,
                .builtInTelephotoCamera
            ],
            mediaType: .video,
            position: .back
        )
        return discovery.devices
    }

    /// Returns the best available camera for document scanning.
    static func storyboardCamera() -> AVCaptureDevice? {
        AVCaptureDevice.default(
            .builtInWideAngleCamera, for: .video, position: .back
        )
    }
}

Configuring Device Properties

Once you have a device, lock it for configuration to adjust focus, exposure, and frame rate:

extension CaptureService {
    /// Configures the camera for storyboard scanning: autofocus, stable exposure.
    func configureForScanning(device: AVCaptureDevice) throws {
        try device.lockForConfiguration()
        defer { device.unlockForConfiguration() }

        // Enable continuous autofocus for document scanning
        if device.isFocusModeSupported(.continuousAutoFocus) {
            device.focusMode = .continuousAutoFocus
        }

        // Lock exposure to avoid flickering under fluorescent lights
        if device.isExposureModeSupported(.continuousAutoExposure) {
            device.exposureMode = .continuousAutoExposure
        }

        // Set frame rate to 30fps for smooth preview
        let targetFPS = CMTimeMake(value: 1, timescale: 30)
        device.activeVideoMinFrameDuration = targetFPS
        device.activeVideoMaxFrameDuration = targetFPS
    }
}

Tip: Call lockForConfiguration() as briefly as possible. While locked, other apps and system processes cannot adjust the device. The defer pattern ensures you always unlock, even on early returns.

Camera Permissions

Before any of this works, you need the user’s permission. Declare NSCameraUsageDescription in your Info.plist and request authorization:

extension CaptureService {
    static func requestCameraAccess() async -> Bool {
        let status = AVCaptureDevice.authorizationStatus(for: .video)
        switch status {
        case .authorized:
            return true
        case .notDetermined:
            return await AVCaptureDevice.requestAccess(for: .video)
        case .denied, .restricted:
            return false
        @unknown default:
            return false
        }
    }
}

Building a Custom Camera View

SwiftUI does not have a native camera preview, so you bridge AVCaptureVideoPreviewLayer through UIViewRepresentable:

import SwiftUI

struct CameraPreviewView: UIViewRepresentable {
    let session: AVCaptureSession

    func makeUIView(context: Context) -> CameraPreviewUIView {
        let view = CameraPreviewUIView()
        view.previewLayer.session = session
        view.previewLayer.videoGravity = .resizeAspectFill
        return view
    }

    func updateUIView(_ uiView: CameraPreviewUIView, context: Context) {
        // Session changes are handled externally
    }
}

final class CameraPreviewUIView: UIView {
    override class var layerClass: AnyClass {
        AVCaptureVideoPreviewLayer.self
    }

    var previewLayer: AVCaptureVideoPreviewLayer {
        layer as! AVCaptureVideoPreviewLayer
    }
}

Using layerClass override is more efficient than adding a sublayer — it avoids layout mismatches and ensures the preview fills the entire view automatically.

Now compose it with a custom overlay for the storyboard alignment guides:

struct StoryboardScannerView: View {
    @State private var captureService = CaptureService()
    @State private var cameraAuthorized = false

    var body: some View {
        ZStack {
            if cameraAuthorized {
                CameraPreviewView(session: captureService.session)
                    .ignoresSafeArea()

                // Storyboard alignment overlay
                StoryboardGuideOverlay()
            } else {
                ContentUnavailableView(
                    "Camera Access Required",
                    systemImage: "camera.fill",
                    description: Text(
                        "Allow camera access to scan storyboard panels."
                    )
                )
            }
        }
        .task {
            cameraAuthorized = await CaptureService.requestCameraAccess()
            if cameraAuthorized {
                try? captureService.configure()
                captureService.startSession()
            }
        }
    }
}

struct StoryboardGuideOverlay: View {
    var body: some View {
        RoundedRectangle(cornerRadius: 12)
            .stroke(Color.yellow, lineWidth: 2)
            .padding(40)
            .overlay {
                Text("Align storyboard panel")
                    .font(.caption)
                    .foregroundStyle(.yellow)
                    .padding(.top, 48)
            }
    }
}

Capturing Photos

Add an AVCapturePhotoOutput to the session and implement the delegate to receive captured images:

final class PhotoCaptureHandler: NSObject, AVCapturePhotoCaptureDelegate {
    let photoOutput = AVCapturePhotoOutput()
    var onPhotoCaptured: ((Data) -> Void)?

    func addToSession(_ session: AVCaptureSession) throws {
        guard session.canAddOutput(photoOutput) else {
            throw CaptureError.outputNotSupported
        }
        session.addOutput(photoOutput)
    }

    /// Captures a high-quality photo of the current storyboard panel.
    func captureStoryboardPanel() {
        let settings = AVCapturePhotoSettings()
        settings.flashMode = .auto
        // Enable HEIF for smaller file sizes
        if photoOutput.availablePhotoCodecTypes.contains(.hevc) {
            settings.photoQualityPrioritization = .balanced
        }
        photoOutput.capturePhoto(with: settings, delegate: self)
    }

    func photoOutput(
        _ output: AVCapturePhotoOutput,
        didFinishProcessingPhoto photo: AVCapturePhoto,
        error: Error?
    ) {
        guard error == nil,
              let data = photo.fileDataRepresentation() else {
            return
        }
        onPhotoCaptured?(data)
    }
}

Note: AVCapturePhotoOutput replaces the deprecated AVCaptureStillImageOutput. Always use AVCapturePhotoOutput for new code — it supports Live Photos, depth data, and multi-image bracketing.

Real-Time Frame Processing

For real-time overlays, barcode detection, or feeding frames to Core ML, use AVCaptureVideoDataOutput:

final class FrameAnalyzer: NSObject,
    AVCaptureVideoDataOutputSampleBufferDelegate
{
    private let processingQueue = DispatchQueue(
        label: "com.pixar.storyboard.frameAnalysis",
        qos: .userInitiated
    )

    func configureOutput(for session: AVCaptureSession) throws {
        let videoOutput = AVCaptureVideoDataOutput()
        videoOutput.setSampleBufferDelegate(self, queue: processingQueue)
        videoOutput.alwaysDiscardsLateVideoFrames = true

        guard session.canAddOutput(videoOutput) else {
            throw CaptureError.outputNotSupported
        }
        session.addOutput(videoOutput)
    }

    func captureOutput(
        _ output: AVCaptureOutput,
        didOutput sampleBuffer: CMSampleBuffer,
        from connection: AVCaptureConnection
    ) {
        guard let pixelBuffer = CMSampleBufferGetImageBuffer(
            sampleBuffer
        ) else { return }
        // Feed to Vision, Core ML, or Core Image
        analyzeFrame(pixelBuffer)
    }

    private func analyzeFrame(_ buffer: CVPixelBuffer) {
        // Run storyboard panel detection, alignment analysis, etc.
    }
}

Set alwaysDiscardsLateVideoFrames = true so that if your processing takes longer than one frame interval, the pipeline drops frames rather than building up a backlog that would cause memory pressure and increasing latency.

Recording Video

For video recording, AVCaptureMovieFileOutput is the simplest path — it writes directly to a file:

final class VideoRecorder: NSObject,
    AVCaptureFileOutputRecordingDelegate
{
    let movieOutput = AVCaptureMovieFileOutput()

    func addToSession(_ session: AVCaptureSession) throws {
        guard session.canAddOutput(movieOutput) else {
            throw CaptureError.outputNotSupported
        }
        session.addOutput(movieOutput)

        // Limit recording to 5 minutes for storyboard walkthroughs
        movieOutput.maxRecordedDuration = CMTime(
            seconds: 300, preferredTimescale: 600
        )
    }

    /// Starts recording a storyboard walkthrough video.
    func startRecording() {
        let tempURL = FileManager.default.temporaryDirectory
            .appendingPathComponent(
                "storyboard_\(UUID().uuidString).mov"
            )
        movieOutput.startRecording(to: tempURL, recordingDelegate: self)
    }

    func stopRecording() {
        movieOutput.stopRecording()
    }

    func fileOutput(
        _ output: AVCaptureFileOutput,
        didFinishRecordingTo outputFileURL: URL,
        from connections: [AVCaptureConnection],
        error: Error?
    ) {
        if let error {
            print("Recording failed: \(error.localizedDescription)")
            return
        }
        handleRecordedVideo(at: outputFileURL)
    }

    private func handleRecordedVideo(at url: URL) {
        // Save to Photos, upload, or pass to AVAssetWriter for post-processing
    }
}

Adding Audio to Video

To record video with audio, add a microphone input to the session. Request microphone permission first (NSMicrophoneUsageDescription in Info.plist):

extension CaptureService {
    func addMicrophoneInput() throws {
        guard let mic = AVCaptureDevice.default(for: .audio) else {
            throw CaptureError.cameraUnavailable
        }
        let audioInput = try AVCaptureDeviceInput(device: mic)
        guard session.canAddInput(audioInput) else {
            throw CaptureError.inputNotSupported
        }
        session.addInput(audioInput)
    }
}

Once both video and audio inputs are on the session, AVCaptureMovieFileOutput automatically muxes them into the output file.

Audio Capture with AVAudioEngine

For standalone audio recording — voice-over annotations, sound effects, or audio-only capture — AVAudioEngine gives you a real-time processing graph with nodes for mixing, effects, and metering.

import AVFAudio

final class VoiceOverRecorder {
    private let engine = AVAudioEngine()
    private var audioFile: AVAudioFile?

    /// Configures the audio session for recording.
    func configureAudioSession() throws {
        let audioSession = AVAudioSession.sharedInstance()
        try audioSession.setCategory(
            .playAndRecord,
            mode: .default,
            options: [.defaultToSpeaker, .allowBluetooth]
        )
        try audioSession.setActive(true)
    }

    /// Starts recording voice-over to a file.
    func startRecording(to url: URL) throws {
        let inputNode = engine.inputNode
        let format = inputNode.outputFormat(forBus: 0)

        audioFile = try AVAudioFile(
            forWriting: url,
            settings: format.settings,
            commonFormat: .pcmFormatFloat32,
            interleaved: false
        )

        inputNode.installTap(
            onBus: 0,
            bufferSize: 4096,
            format: format
        ) { [weak self] buffer, _ in
            try? self?.audioFile?.write(from: buffer)
        }

        try engine.start()
    }

    /// Stops recording and cleans up.
    func stopRecording() {
        engine.inputNode.removeTap(onBus: 0)
        engine.stop()
        audioFile = nil
    }
}

Warning: Always call removeTap(onBus:) before stopping the engine. Stopping the engine with an active tap can cause a crash on older iOS versions.

Use .playAndRecord if your app plays audio while recording (monitoring), or .record if you only capture. The defaultToSpeaker option routes playback to the speaker instead of the earpiece — essential for monitoring voice-overs during recording.

iOS 26: Spatial Audio Capture

iOS 26 introduces spatial audio recording through AVAssetWriter, enabling apps to capture immersive 3D audio that responds to head tracking when played back on AirPods Pro or AirPods Max.

Note: Spatial Audio capture requires iOS 26, a device with multiple microphones (iPhone 12 or later), and the com.apple.developer.spatial-audio entitlement.

@available(iOS 26, *)
final class SpatialAudioRecorder {
    private var assetWriter: AVAssetWriter?
    private var audioWriterInput: AVAssetWriterInput?
    private let engine = AVAudioEngine()

    /// Configures AVAssetWriter for spatial audio recording.
    func configureForSpatialAudio(outputURL: URL) throws {
        let writer = try AVAssetWriter(
            outputURL: outputURL, fileType: .wav
        )

        // Configure spatial audio output settings
        let audioSettings: [String: Any] = [
            AVFormatIDKey: kAudioFormatLinearPCM,
            AVSampleRateKey: 48000,
            AVNumberOfChannelsKey: 4, // Ambisonics (W, X, Y, Z)
            AVLinearPCMBitDepthKey: 32,
            AVLinearPCMIsFloatKey: true,
            AVLinearPCMIsBigEndianKey: false,
            AVLinearPCMIsNonInterleaved: false
        ]

        let audioInput = AVAssetWriterInput(
            mediaType: .audio,
            outputSettings: audioSettings
        )
        audioInput.expectsMediaDataInRealTime = true

        guard writer.canAdd(audioInput) else {
            throw CaptureError.outputNotSupported
        }
        writer.add(audioInput)

        self.assetWriter = writer
        self.audioWriterInput = audioInput
    }

    /// Starts spatial audio recording.
    func startRecording() throws {
        guard let assetWriter else { return }
        assetWriter.startWriting()
        assetWriter.startSession(atSourceTime: .zero)

        let inputNode = engine.inputNode
        let format = inputNode.outputFormat(forBus: 0)

        inputNode.installTap(
            onBus: 0,
            bufferSize: 4096,
            format: format
        ) { [weak self] buffer, time in
            guard let self,
                  let audioInput = self.audioWriterInput,
                  audioInput.isReadyForMoreMediaData,
                  let sampleBuffer =
                      buffer.asCMSampleBuffer(time: time)
            else { return }

            audioInput.append(sampleBuffer)
        }

        try engine.start()
    }

    /// Stops recording and finalizes the asset.
    func stopRecording() async {
        engine.inputNode.removeTap(onBus: 0)
        engine.stop()

        audioWriterInput?.markAsFinished()
        await assetWriter?.finishWriting()
    }
}

The asCMSampleBuffer(time:) helper converts the AVAudioPCMBuffer into a CMSampleBuffer that AVAssetWriterInput accepts:

extension AVAudioPCMBuffer {
    /// Converts a PCM buffer to a CMSampleBuffer for AVAssetWriter.
    func asCMSampleBuffer(
        time: AVAudioTime
    ) -> CMSampleBuffer? {
        let audioBufferList = mutableAudioBufferList
        let sampleRate = format.sampleRate
        var timing = CMSampleTimingInfo(
            duration: CMTime(
                value: CMTimeValue(frameLength),
                timescale: CMTimeScale(sampleRate)
            ),
            presentationTimeStamp: CMTime(
                value: CMTimeValue(time.sampleTime),
                timescale: CMTimeScale(sampleRate)
            ),
            decodeTimeStamp: .invalid
        )
        var formatDescription: CMAudioFormatDescription?
        CMAudioFormatDescriptionCreate(
            allocator: kCFAllocatorDefault,
            asbd: format.streamDescription,
            layoutSize: 0,
            layout: nil,
            magicCookieSize: 0,
            magicCookie: nil,
            extensions: nil,
            formatDescriptionOut: &formatDescription
        )
        guard let desc = formatDescription else { return nil }

        var sampleBuffer: CMSampleBuffer?
        CMSampleBufferCreate(
            allocator: kCFAllocatorDefault,
            dataBuffer: nil,
            dataReady: false,
            makeDataReadyCallback: nil,
            refcon: nil,
            formatDescription: desc,
            sampleCount: CMItemCount(frameLength),
            sampleTimingEntryCount: 1,
            sampleTimingArray: &timing,
            sampleSizeEntryCount: 0,
            sampleSizeArray: nil,
            sampleBufferOut: &sampleBuffer
        )
        if let buffer = sampleBuffer {
            CMSampleBufferSetDataBufferFromAudioBufferList(
                buffer,
                blockBufferAllocator: kCFAllocatorDefault,
                blockBufferMemoryAllocator: kCFAllocatorDefault,
                flags: 0,
                bufferList: audioBufferList
            )
        }
        return sampleBuffer
    }
}

The key difference from standard audio recording is the 4-channel Ambisonic format (first-order: W, X, Y, Z channels). This captures the spatial characteristics of the sound field, which the system uses alongside head-tracking data during playback to position sounds in 3D space around the listener.

Apple Docs: AVAssetWriter — AVFoundation

Performance Considerations

Camera and audio capture are among the most resource-intensive operations on iOS. Here are the key bottlenecks and mitigations:

ConcernImpactMitigation
Session startup100-500 ms blockingStart on a background queue, show a loading state
Frame processingCalled 30-60 times/secDedicated queue, set alwaysDiscardsLateVideoFrames
Memory from buffers~33 MB per 4K frameProcess and release buffers promptly
Thermal throttlingSustained capture heats deviceMonitor ProcessInfo.thermalState
Battery drainCamera + GPS + processingUse the lowest acceptable resolution and frame rate

Monitor thermal state to degrade gracefully:

extension CaptureService {
    func observeThermalState() {
        NotificationCenter.default.addObserver(
            forName: ProcessInfo.thermalStateDidChangeNotification,
            object: nil,
            queue: .main
        ) { [weak self] _ in
            let state = ProcessInfo.processInfo.thermalState
            switch state {
            case .nominal, .fair:
                self?.session.sessionPreset = .photo
            case .serious:
                self?.session.sessionPreset = .medium
            case .critical:
                self?.session.sessionPreset = .low
            @unknown default:
                break
            }
        }
    }
}

Tip: Use Instruments’ Activity Monitor and Metal System Trace templates to profile capture pipelines. The Camera template shows frame delivery timing and dropped frame counts, helping you identify whether your processing callback is the bottleneck or the device simply cannot keep up at the requested frame rate.

When to Use (and When Not To)

ScenarioRecommendation
Custom camera UI with overlaysAVFoundation capture pipeline is the right tool
Simple photo captureUse PhotosPicker — far less code
Real-time frame processingAVCaptureVideoDataOutput with a delegate on a dedicated queue
Standard video recordingAVCaptureMovieFileOutput handles muxing for you
Custom video encodingAVAssetWriter with AVCaptureVideoDataOutput for full control
Audio-only recordingAVAudioEngine for processing; AVAudioRecorder for simple files
Spatial audio (iOS 26)AVAssetWriter with Ambisonic format on compatible hardware
Screen recordingUse ReplayKit, not AVFoundation capture

Summary

  • AVCaptureSession is the pipeline coordinator connecting inputs to outputs. Always configure atomically with beginConfiguration() / commitConfiguration() and run the session on a background queue.
  • Device discovery via AVCaptureDevice.DiscoverySession lets you enumerate cameras by type and position. Lock devices for configuration changes and unlock immediately.
  • Custom camera views in SwiftUI use UIViewRepresentable wrapping AVCaptureVideoPreviewLayer. Override layerClass for the cleanest integration.
  • Photo capture uses AVCapturePhotoOutput with HEIF support. Video recording uses AVCaptureMovieFileOutput for simplicity or AVAssetWriter for full encoding control.
  • AVAudioEngine provides a real-time audio processing graph with taps for recording and metering. Always remove taps before stopping the engine.
  • Spatial Audio on iOS 26 captures 4-channel Ambisonic audio through AVAssetWriter, enabling head-tracked playback on compatible AirPods.

To process captured frames with filters and effects in real time, explore Core Image Filters. For extracting text from captured storyboard panels, check out Vision OCR Scanning.