Jan 26, 2026

Vision Framework: OCR, Document Scanning, and Text Recognition

Your app has a camera. Your users have receipts, business cards, movie posters, and handwritten notes. Bridging the gap between pixels and structured text used to require third-party SDKs and server round-trips. The Vision framework makes it a local, offline, single-API-call operation — and with iOS 26, it got even better.

This post covers the three tiers of text recognition Apple ships today: VNRecognizeTextRequest for static image OCR, DataScannerViewController for real-time camera scanning, and the new Swift 6-native RecognizeDocumentsRequest for full document understanding. We will not cover barcode detection or face recognition — those deserve their own deep dives.

The Problem
Tier 1: VNRecognizeTextRequest for Static Image OCR
Tier 2: DataScannerViewController for Live Camera Scanning
Tier 3: RecognizeDocumentsRequest in iOS 26
Advanced Usage
Performance Considerations
When to Use (and When Not To)
Summary

The Problem

Imagine you are building a Pixar movie archive app. Users photograph their physical Blu-ray collection, and the app needs to extract the movie title, studio name, and rating from the cover art. A naive approach might look like this:

import UIKit

func extractMovieTitle(from image: UIImage) -> String? {
    // Option A: Ship the image to a server for OCR
    // - Requires network connectivity
    // - Adds latency (200-800ms round trip)
    // - Raises privacy concerns (user images leave the device)

    // Option B: Bundle a third-party OCR library
    // - Adds 20-50 MB to your binary
    // - Licensing headaches
    // - Often lags behind Apple's hardware-tuned models

    return nil // Neither option is great
}

Both paths carry real costs. Vision eliminates them. Apple ships a neural-network-based text recognizer that runs on the Neural Engine, requires zero network access, supports 18+ languages out of the box, and weighs nothing in your app bundle because it lives in the OS.

Tier 1: VNRecognizeTextRequest for Static Image OCR

VNRecognizeTextRequest is the workhorse API for extracting text from a still image or a single video frame. It has been available since iOS 13 and has improved significantly with each release.

Setting Up the Request

The core pattern is straightforward: create a request, configure its recognition level, feed it an image through a request handler, and read the results.

import Vision
import UIKit

func recognizeText(in image: UIImage) async throws -> [String] {
    guard let cgImage = image.cgImage else {
        throw OCRError.invalidImage
    }

    let request = VNRecognizeTextRequest()
    request.recognitionLevel = .accurate   // .fast for real-time scenarios
    request.recognitionLanguages = ["en-US", "es-ES"]
    request.usesLanguageCorrection = true  // Post-processing with NLP

    let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
    try handler.perform([request])

    let observations = request.results ?? []
    return observations.compactMap { observation in
        observation.topCandidates(1).first?.string
    }
}

enum OCRError: Error {
    case invalidImage
}

A few things to note here. The .accurate recognition level uses a heavier neural network that delivers better results for complex layouts, handwriting, and small fonts. The .fast level trades accuracy for speed — useful when processing video frames. The recognitionLanguages array is ordered by priority; Vision uses the first language as its primary hypothesis. And usesLanguageCorrection applies a language model pass that fixes common misreads (turning “Wnody” back into “Woody”).

Extracting Bounding Boxes

Each VNRecognizedTextObservation carries geometry data. This is essential when you need to overlay recognized text on the source image — for instance, highlighting the title on a Pixar movie poster.

func recognizedTextWithLocations(
    in image: UIImage
) async throws -> [(String, CGRect)] {
    guard let cgImage = image.cgImage else {
        throw OCRError.invalidImage
    }

    let request = VNRecognizeTextRequest()
    request.recognitionLevel = .accurate

    let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
    try handler.perform([request])

    return (request.results ?? []).compactMap { observation in
        guard let candidate = observation.topCandidates(1).first else {
            return nil
        }
        // Vision uses normalized coordinates (0...1, bottom-left origin)
        let boundingBox = observation.boundingBox
        return (candidate.string, boundingBox)
    }
}

Warning: Vision’s coordinate system uses a bottom-left origin with normalized values (0.0 to 1.0). If you are drawing overlays in UIKit or SwiftUI, you need to flip the y-axis and scale to the image’s pixel dimensions. Forgetting this is the single most common Vision integration bug.

Converting Vision Coordinates to UIKit

Here is a helper that transforms Vision’s normalized rect into UIKit’s top-left-origin coordinate space:

extension CGRect {
    /// Converts a Vision normalized rect to UIKit coordinates
    /// for a given image size.
    func toUIKitRect(imageSize: CGSize) -> CGRect {
        let x = self.origin.x * imageSize.width
        let y = (1 - self.origin.y - self.height) * imageSize.height
        let width = self.width * imageSize.width
        let height = self.height * imageSize.height
        return CGRect(x: x, y: y, width: width, height: height)
    }
}

Tier 2: DataScannerViewController for Live Camera Scanning

Introduced in iOS 16, DataScannerViewController is a turnkey UIKit view controller that handles camera setup, live preview, text highlighting, and user interaction in one package. Think of it as the difference between building Remy’s kitchen from scratch versus walking into Gusteau’s fully-equipped restaurant.

Checking Availability

DataScannerViewController requires specific hardware capabilities. Always check before presenting it.

import VisionKit

func isScanningSupported() -> Bool {
    DataScannerViewController.isSupported
        && DataScannerViewController.isAvailable
}

isSupported checks for compatible hardware (devices with a Neural Engine — A12 Bionic or later). isAvailable additionally checks that the user has not restricted camera access. Both must be true.

Presenting the Scanner

Here is how to configure and present a live text scanner in a SwiftUI context using UIViewControllerRepresentable:

import SwiftUI
import VisionKit

struct MoviePosterScanner: UIViewControllerRepresentable {
    @Binding var recognizedText: String

    func makeUIViewController(context: Context) -> DataScannerViewController {
        let scanner = DataScannerViewController(
            recognizedDataTypes: [.text()],
            qualityLevel: .balanced,      // .fast, .balanced, or .accurate
            recognizesMultipleItems: true,
            isHighFrameRateTrackingEnabled: false,
            isHighlightingEnabled: true   // Draws bounding boxes automatically
        )
        scanner.delegate = context.coordinator
        return scanner
    }

    func updateUIViewController(
        _ uiViewController: DataScannerViewController,
        context: Context
    ) {}

    func makeCoordinator() -> Coordinator {
        Coordinator(recognizedText: $recognizedText)
    }
}

Handling Delegate Callbacks

The coordinator bridges DataScannerViewControllerDelegate events back into SwiftUI:

extension MoviePosterScanner {
    class Coordinator: NSObject, DataScannerViewControllerDelegate {
        @Binding var recognizedText: String

        init(recognizedText: Binding<String>) {
            _recognizedText = recognizedText
        }

        func dataScanner(
            _ scanner: DataScannerViewController,
            didTapOn item: RecognizedItem
        ) {
            switch item {
            case .text(let text):
                recognizedText = text.transcript
            default:
                break
            }
        }

        func dataScanner(
            _ scanner: DataScannerViewController,
            didAdd addedItems: [RecognizedItem],
            allItems: [RecognizedItem]
        ) {
            // React to newly recognized items in the live feed
            let allText = allItems.compactMap { item -> String? in
                if case .text(let text) = item {
                    return text.transcript
                }
                return nil
            }
            recognizedText = allText.joined(separator: "\n")
        }
    }
}

Tip: Do not forget to call scanner.startScanning() after the view appears. A common pattern is to trigger it in onAppear from the parent SwiftUI view, or in the coordinator’s initial setup. The scanner will not begin capturing until you explicitly start it.

Starting and Stopping the Scanner

Wrap the representable in a SwiftUI view that manages the scanner lifecycle:

struct ScannerView: View {
    @State private var recognizedText = ""
    @State private var isShowingScanner = false

    var body: some View {
        VStack {
            Text(recognizedText.isEmpty ? "No text scanned yet" : recognizedText)
                .padding()

            Button("Scan Movie Poster") {
                isShowingScanner = true
            }
        }
        .sheet(isPresented: $isShowingScanner) {
            MoviePosterScanner(recognizedText: $recognizedText)
                .onAppear {
                    // Scanner starts via delegate or representable lifecycle
                }
                .ignoresSafeArea()
        }
    }
}

Note: DataScannerViewController is UIKit-only. There is no pure SwiftUI equivalent as of iOS 18. The UIViewControllerRepresentable wrapper shown above is the standard integration path. See WWDC22 session Capture machine-readable codes and text with VisionKit for the full walkthrough.

Tier 3: RecognizeDocumentsRequest in iOS 26

iOS 26 introduces RecognizeDocumentsRequest as part of the Vision framework’s Swift 6 concurrency overhaul. Where VNRecognizeTextRequest gives you raw text lines, RecognizeDocumentsRequest understands document structure: paragraphs, columns, headers, tables, and reading order.

The Swift 6 Vision API Pattern

The entire Vision framework received a modern Swift API in iOS 26. Requests are now value types, handlers use async/await natively, and results are strongly typed. No more casting from [Any].

import Vision

@available(iOS 26, *)
func recognizeDocument(in image: CGImage) async throws -> [RecognizedDocument] {
    let request = RecognizeDocumentsRequest()

    let observations = try await request.perform(on: image)
    return observations  // Strongly typed, no casting needed
}

Compare this with the pre-iOS 26 pattern where you created a handler, performed the request synchronously (or wrapped it in withCheckedThrowingContinuation), and cast results from [Any]. The new API is a significant ergonomic improvement.

Understanding Document Structure

RecognizeDocumentsRequest returns observations that preserve the logical structure of the document. This matters when scanning a Pixar storyboard where you need to distinguish the scene heading from the dialogue and action lines.

@available(iOS 26, *)
func extractStructuredContent(
    from image: CGImage
) async throws -> [DocumentSection] {
    let request = RecognizeDocumentsRequest()
    let documents = try await request.perform(on: image)

    var sections: [DocumentSection] = []

    for document in documents {
        for page in document.pages {
            for body in page.bodies {
                for paragraph in body.paragraphs {
                    let text = paragraph.lines
                        .map(\.text)
                        .joined(separator: " ")
                    sections.append(
                        DocumentSection(
                            text: text,
                            boundingBox: paragraph.boundingBox
                        )
                    )
                }
            }
        }
    }

    return sections
}

struct DocumentSection {
    let text: String
    let boundingBox: CGRect
}

Apple Docs: RecognizeDocumentsRequest — Vision Framework (iOS 26+)

Migration from VNRecognizeTextRequest

If you are adopting iOS 26 as your minimum deployment target, migrating is straightforward. The mental model shifts from “request + handler + perform” to “request + perform(on:)”:

// Before (iOS 13+)
let request = VNRecognizeTextRequest()
let handler = VNImageRequestHandler(cgImage: image, options: [:])
try handler.perform([request])
let results = request.results ?? []

// After (iOS 26+)
let request = RecognizeTextRequest()
let results = try await request.perform(on: image)

The new RecognizeTextRequest (note: no VN prefix) is the Swift 6 equivalent of VNRecognizeTextRequest. Use RecognizeDocumentsRequest when you need structural understanding beyond raw text lines.

Tip: If you need to support iOS 16 through iOS 26, wrap both paths behind a protocol or use if #available(iOS 26, *) checks. The old VN-prefixed APIs are not deprecated — they continue to work and receive model updates.

Advanced Usage

Language Detection and Multi-Language OCR

Vision can automatically detect the language of recognized text. This is particularly useful for an app that catalogs international Pixar releases where the same movie might appear as “Up,” “Oben,” or “La-Haut.”

func recognizeWithLanguageDetection(
    in cgImage: CGImage
) async throws -> [(String, String?)] {
    let request = VNRecognizeTextRequest()
    request.recognitionLevel = .accurate
    request.automaticallyDetectsLanguage = true  // iOS 16+

    let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
    try handler.perform([request])

    return (request.results ?? []).compactMap { observation in
        guard let candidate = observation.topCandidates(1).first else {
            return nil
        }
        // Each candidate can report its detected language
        return (candidate.string, candidate.confidence > 0.5 ? "high" : "low")
    }
}

Note: automaticallyDetectsLanguage was introduced in iOS 16. On earlier versions, you must set recognitionLanguages explicitly. The auto-detection adds a small processing overhead but significantly improves accuracy for mixed-language documents.

Revision System

Vision uses a revision system to let you pin your app to a specific model version. This is critical for apps where OCR consistency matters across OS updates.

let request = VNRecognizeTextRequest()

// Check available revisions on this device
let supportedRevisions = VNRecognizeTextRequest.supportedRevisions
print("Supported revisions: \(supportedRevisions)")

// Pin to a specific revision for consistent behavior
request.revision = VNRecognizeTextRequestRevision3  // iOS 16 model

Warning: Pinning revisions means you will not automatically benefit from model improvements in new OS releases. Only pin when reproducibility is more important than accuracy — for example, in medical or legal document processing where output consistency is a regulatory requirement.

Combining with Other Vision Requests

One of Vision’s best design decisions is that a single VNImageRequestHandler can execute multiple requests in one pass. The framework optimizes shared image preprocessing across requests.

func analyzeMoviePoster(image: CGImage) throws {
    let textRequest = VNRecognizeTextRequest()
    textRequest.recognitionLevel = .accurate

    let rectangleRequest = VNDetectRectanglesRequest()
    rectangleRequest.maximumObservations = 10

    let faceRequest = VNDetectFaceRectanglesRequest()

    // Single handler, multiple requests -- shared image preprocessing
    let handler = VNImageRequestHandler(cgImage: image, options: [:])
    try handler.perform([textRequest, rectangleRequest, faceRequest])

    let text = textRequest.results ?? []
    let rectangles = rectangleRequest.results ?? []
    let faces = faceRequest.results ?? []

    // Now you have text, rectangular regions, and face locations
    // from one efficient pass through the image pipeline
}

Performance Considerations

Text recognition is computationally expensive. Here are the numbers you should know and the knobs you can turn.

Recognition Level Impact

Level	Latency (iPhone 15 Pro)	Accuracy	Use Case
`.fast`	~50-100ms per frame	Good	Real-time video, live camera
`.accurate`	~200-500ms per image	Excellent	Static images, documents

The .accurate level runs a larger neural network and performs multiple passes. For batch processing (scanning an entire Pixar Blu-ray collection), use .accurate but process images concurrently with a TaskGroup to leverage all available cores.

Batch Processing Pattern

func processMovieCollection(
    images: [CGImage]
) async throws -> [String: [String]] {
    try await withThrowingTaskGroup(
        of: (Int, [String]).self
    ) { group in
        for (index, image) in images.enumerated() {
            group.addTask {
                let request = VNRecognizeTextRequest()
                request.recognitionLevel = .accurate
                let handler = VNImageRequestHandler(
                    cgImage: image,
                    options: [:]
                )
                try handler.perform([request])
                let lines = (request.results ?? []).compactMap {
                    $0.topCandidates(1).first?.string
                }
                return (index, lines)
            }
        }

        var results: [String: [String]] = [:]
        for try await (index, lines) in group {
            results["image_\(index)"] = lines
        }
        return results
    }
}

Tip: Vision requests are not thread-safe, but VNImageRequestHandler instances are independent. Creating one handler per image and dispatching them across a task group is both safe and efficient. Avoid sharing a single handler across tasks.

Memory Considerations

Vision loads the text recognition model into memory on first use and keeps it cached for subsequent requests. On devices with limited RAM, be aware of the following:

The .accurate model consumes approximately 30-50 MB of memory while loaded.
Processing very high-resolution images (4K+) creates intermediate buffers. Consider downscaling to 2048px on the longest edge for documents — the model’s effective resolution plateaus beyond that.
DataScannerViewController maintains a continuous camera session. Dismiss it when not in use to reclaim memory.

Apple Docs: VNImageRequestHandler — Vision Framework

When to Use (and When Not To)

Scenario	Recommendation
Extract text from a photo or screenshot	Use `VNRecognizeTextRequest` with `.accurate`. This is its sweet spot.
Live camera text scanning with UI	Use `DataScannerViewController`. It handles camera, preview, and highlighting.
Custom camera pipeline with text overlay	Use `VNRecognizeTextRequest` with `.fast` on frames from your AVFoundation capture session.
Full document structure (paragraphs, columns)	Use `RecognizeDocumentsRequest` (iOS 26+). The only API that preserves reading order.
Barcode and QR code scanning	Use `VNDetectBarcodesRequest` or `DataScannerViewController` with `.barcode()`.
Server-side OCR for archival	Vision runs on-device only. For server workloads, look at server-side alternatives.
Handwriting recognition	`VNRecognizeTextRequest` supports handwriting since iOS 14. Use `.accurate`.

The general rule: if you need text from an image and you are on Apple hardware, reach for Vision first. The only reasons to look elsewhere are server-side processing requirements or niche language support that Apple has not added yet.

Summary

VNRecognizeTextRequest is your go-to for static image OCR. Available since iOS 13, it handles printed and handwritten text in 18+ languages with on-device processing.
DataScannerViewController provides a complete live camera scanning experience with minimal code. Use it when you need a user-facing scanner and do not want to build camera UI from scratch.
RecognizeDocumentsRequest in iOS 26 brings structural document understanding — paragraphs, columns, reading order — and uses Swift 6’s native async/await pattern.
Always check .accurate vs .fast recognition levels against your latency budget. Batch process with TaskGroup for collections.
Vision’s coordinate system uses bottom-left origin with normalized values. Convert to UIKit/SwiftUI coordinates before drawing overlays.

For a deeper look at feeding Vision results into a trained model for classification, see Integrating Core ML Models in SwiftUI. If you need to build a custom camera pipeline to feed frames into Vision, AVFoundation: Custom Camera and Audio Capture covers the capture session setup in detail.