Vision Framework: OCR, Document Scanning, and Text Recognition
Your app has a camera. Your users have receipts, business cards, movie posters, and handwritten notes. Bridging the gap between pixels and structured text used to require third-party SDKs and server round-trips. The Vision framework makes it a local, offline, single-API-call operation — and with iOS 26, it got even better.
This post covers the three tiers of text recognition Apple ships today: VNRecognizeTextRequest for static image OCR,
DataScannerViewController for real-time camera scanning, and the new Swift 6-native RecognizeDocumentsRequest for
full document understanding. We will not cover barcode detection or face recognition — those deserve their own deep
dives.
Contents
- The Problem
- Tier 1: VNRecognizeTextRequest for Static Image OCR
- Tier 2: DataScannerViewController for Live Camera Scanning
- Tier 3: RecognizeDocumentsRequest in iOS 26
- Advanced Usage
- Performance Considerations
- When to Use (and When Not To)
- Summary
The Problem
Imagine you are building a Pixar movie archive app. Users photograph their physical Blu-ray collection, and the app needs to extract the movie title, studio name, and rating from the cover art. A naive approach might look like this:
import UIKit
func extractMovieTitle(from image: UIImage) -> String? {
// Option A: Ship the image to a server for OCR
// - Requires network connectivity
// - Adds latency (200-800ms round trip)
// - Raises privacy concerns (user images leave the device)
// Option B: Bundle a third-party OCR library
// - Adds 20-50 MB to your binary
// - Licensing headaches
// - Often lags behind Apple's hardware-tuned models
return nil // Neither option is great
}
Both paths carry real costs. Vision eliminates them. Apple ships a neural-network-based text recognizer that runs on the Neural Engine, requires zero network access, supports 18+ languages out of the box, and weighs nothing in your app bundle because it lives in the OS.
Tier 1: VNRecognizeTextRequest for Static Image OCR
VNRecognizeTextRequest is the workhorse API
for extracting text from a still image or a single video frame. It has been available since iOS 13 and has improved
significantly with each release.
Setting Up the Request
The core pattern is straightforward: create a request, configure its recognition level, feed it an image through a request handler, and read the results.
import Vision
import UIKit
func recognizeText(in image: UIImage) async throws -> [String] {
guard let cgImage = image.cgImage else {
throw OCRError.invalidImage
}
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate // .fast for real-time scenarios
request.recognitionLanguages = ["en-US", "es-ES"]
request.usesLanguageCorrection = true // Post-processing with NLP
let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
try handler.perform([request])
let observations = request.results ?? []
return observations.compactMap { observation in
observation.topCandidates(1).first?.string
}
}
enum OCRError: Error {
case invalidImage
}
A few things to note here. The .accurate recognition level uses a heavier neural network that delivers better results
for complex layouts, handwriting, and small fonts. The .fast level trades accuracy for speed — useful when processing
video frames. The recognitionLanguages array is ordered by priority; Vision uses the first language as its primary
hypothesis. And usesLanguageCorrection applies a language model pass that fixes common misreads (turning “Wnody” back
into “Woody”).
Extracting Bounding Boxes
Each VNRecognizedTextObservation carries geometry data. This is essential when you need to overlay recognized text on
the source image — for instance, highlighting the title on a Pixar movie poster.
func recognizedTextWithLocations(
in image: UIImage
) async throws -> [(String, CGRect)] {
guard let cgImage = image.cgImage else {
throw OCRError.invalidImage
}
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate
let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
try handler.perform([request])
return (request.results ?? []).compactMap { observation in
guard let candidate = observation.topCandidates(1).first else {
return nil
}
// Vision uses normalized coordinates (0...1, bottom-left origin)
let boundingBox = observation.boundingBox
return (candidate.string, boundingBox)
}
}
Warning: Vision’s coordinate system uses a bottom-left origin with normalized values (0.0 to 1.0). If you are drawing overlays in UIKit or SwiftUI, you need to flip the y-axis and scale to the image’s pixel dimensions. Forgetting this is the single most common Vision integration bug.
Converting Vision Coordinates to UIKit
Here is a helper that transforms Vision’s normalized rect into UIKit’s top-left-origin coordinate space:
extension CGRect {
/// Converts a Vision normalized rect to UIKit coordinates
/// for a given image size.
func toUIKitRect(imageSize: CGSize) -> CGRect {
let x = self.origin.x * imageSize.width
let y = (1 - self.origin.y - self.height) * imageSize.height
let width = self.width * imageSize.width
let height = self.height * imageSize.height
return CGRect(x: x, y: y, width: width, height: height)
}
}
Tier 2: DataScannerViewController for Live Camera Scanning
Introduced in iOS 16,
DataScannerViewController is a
turnkey UIKit view controller that handles camera setup, live preview, text highlighting, and user interaction in one
package. Think of it as the difference between building Remy’s kitchen from scratch versus walking into Gusteau’s
fully-equipped restaurant.
Checking Availability
DataScannerViewController requires specific hardware capabilities. Always check before presenting it.
import VisionKit
func isScanningSupported() -> Bool {
DataScannerViewController.isSupported
&& DataScannerViewController.isAvailable
}
isSupported checks for compatible hardware (devices with a Neural Engine — A12 Bionic or later). isAvailable
additionally checks that the user has not restricted camera access. Both must be true.
Presenting the Scanner
Here is how to configure and present a live text scanner in a SwiftUI context using UIViewControllerRepresentable:
import SwiftUI
import VisionKit
struct MoviePosterScanner: UIViewControllerRepresentable {
@Binding var recognizedText: String
func makeUIViewController(context: Context) -> DataScannerViewController {
let scanner = DataScannerViewController(
recognizedDataTypes: [.text()],
qualityLevel: .balanced, // .fast, .balanced, or .accurate
recognizesMultipleItems: true,
isHighFrameRateTrackingEnabled: false,
isHighlightingEnabled: true // Draws bounding boxes automatically
)
scanner.delegate = context.coordinator
return scanner
}
func updateUIViewController(
_ uiViewController: DataScannerViewController,
context: Context
) {}
func makeCoordinator() -> Coordinator {
Coordinator(recognizedText: $recognizedText)
}
}
Handling Delegate Callbacks
The coordinator bridges DataScannerViewControllerDelegate events back into SwiftUI:
extension MoviePosterScanner {
class Coordinator: NSObject, DataScannerViewControllerDelegate {
@Binding var recognizedText: String
init(recognizedText: Binding<String>) {
_recognizedText = recognizedText
}
func dataScanner(
_ scanner: DataScannerViewController,
didTapOn item: RecognizedItem
) {
switch item {
case .text(let text):
recognizedText = text.transcript
default:
break
}
}
func dataScanner(
_ scanner: DataScannerViewController,
didAdd addedItems: [RecognizedItem],
allItems: [RecognizedItem]
) {
// React to newly recognized items in the live feed
let allText = allItems.compactMap { item -> String? in
if case .text(let text) = item {
return text.transcript
}
return nil
}
recognizedText = allText.joined(separator: "\n")
}
}
}
Tip: Do not forget to call
scanner.startScanning()after the view appears. A common pattern is to trigger it inonAppearfrom the parent SwiftUI view, or in the coordinator’s initial setup. The scanner will not begin capturing until you explicitly start it.
Starting and Stopping the Scanner
Wrap the representable in a SwiftUI view that manages the scanner lifecycle:
struct ScannerView: View {
@State private var recognizedText = ""
@State private var isShowingScanner = false
var body: some View {
VStack {
Text(recognizedText.isEmpty ? "No text scanned yet" : recognizedText)
.padding()
Button("Scan Movie Poster") {
isShowingScanner = true
}
}
.sheet(isPresented: $isShowingScanner) {
MoviePosterScanner(recognizedText: $recognizedText)
.onAppear {
// Scanner starts via delegate or representable lifecycle
}
.ignoresSafeArea()
}
}
}
Note:
DataScannerViewControlleris UIKit-only. There is no pure SwiftUI equivalent as of iOS 18. TheUIViewControllerRepresentablewrapper shown above is the standard integration path. See WWDC22 session Capture machine-readable codes and text with VisionKit for the full walkthrough.
Tier 3: RecognizeDocumentsRequest in iOS 26
iOS 26 introduces
RecognizeDocumentsRequest as part of the
Vision framework’s Swift 6 concurrency overhaul. Where VNRecognizeTextRequest gives you raw text lines,
RecognizeDocumentsRequest understands document structure: paragraphs, columns, headers, tables, and reading order.
The Swift 6 Vision API Pattern
The entire Vision framework received a modern Swift API in iOS 26. Requests are now value types, handlers use
async/await natively, and results are strongly typed. No more casting from [Any].
import Vision
@available(iOS 26, *)
func recognizeDocument(in image: CGImage) async throws -> [RecognizedDocument] {
let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: image)
return observations // Strongly typed, no casting needed
}
Compare this with the pre-iOS 26 pattern where you created a handler, performed the request synchronously (or wrapped it
in withCheckedThrowingContinuation), and cast results from [Any]. The new API is a significant ergonomic
improvement.
Understanding Document Structure
RecognizeDocumentsRequest returns observations that preserve the logical structure of the document. This matters when
scanning a Pixar storyboard where you need to distinguish the scene heading from the dialogue and action lines.
@available(iOS 26, *)
func extractStructuredContent(
from image: CGImage
) async throws -> [DocumentSection] {
let request = RecognizeDocumentsRequest()
let documents = try await request.perform(on: image)
var sections: [DocumentSection] = []
for document in documents {
for page in document.pages {
for body in page.bodies {
for paragraph in body.paragraphs {
let text = paragraph.lines
.map(\.text)
.joined(separator: " ")
sections.append(
DocumentSection(
text: text,
boundingBox: paragraph.boundingBox
)
)
}
}
}
}
return sections
}
struct DocumentSection {
let text: String
let boundingBox: CGRect
}
Apple Docs:
RecognizeDocumentsRequest— Vision Framework (iOS 26+)
Migration from VNRecognizeTextRequest
If you are adopting iOS 26 as your minimum deployment target, migrating is straightforward. The mental model shifts from “request + handler + perform” to “request + perform(on:)”:
// Before (iOS 13+)
let request = VNRecognizeTextRequest()
let handler = VNImageRequestHandler(cgImage: image, options: [:])
try handler.perform([request])
let results = request.results ?? []
// After (iOS 26+)
let request = RecognizeTextRequest()
let results = try await request.perform(on: image)
The new RecognizeTextRequest (note: no VN prefix) is the Swift 6 equivalent of VNRecognizeTextRequest. Use
RecognizeDocumentsRequest when you need structural understanding beyond raw text lines.
Tip: If you need to support iOS 16 through iOS 26, wrap both paths behind a protocol or use
if #available(iOS 26, *)checks. The oldVN-prefixed APIs are not deprecated — they continue to work and receive model updates.
Advanced Usage
Language Detection and Multi-Language OCR
Vision can automatically detect the language of recognized text. This is particularly useful for an app that catalogs international Pixar releases where the same movie might appear as “Up,” “Oben,” or “La-Haut.”
func recognizeWithLanguageDetection(
in cgImage: CGImage
) async throws -> [(String, String?)] {
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate
request.automaticallyDetectsLanguage = true // iOS 16+
let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
try handler.perform([request])
return (request.results ?? []).compactMap { observation in
guard let candidate = observation.topCandidates(1).first else {
return nil
}
// Each candidate can report its detected language
return (candidate.string, candidate.confidence > 0.5 ? "high" : "low")
}
}
Note:
automaticallyDetectsLanguagewas introduced in iOS 16. On earlier versions, you must setrecognitionLanguagesexplicitly. The auto-detection adds a small processing overhead but significantly improves accuracy for mixed-language documents.
Revision System
Vision uses a revision system to let you pin your app to a specific model version. This is critical for apps where OCR consistency matters across OS updates.
let request = VNRecognizeTextRequest()
// Check available revisions on this device
let supportedRevisions = VNRecognizeTextRequest.supportedRevisions
print("Supported revisions: \(supportedRevisions)")
// Pin to a specific revision for consistent behavior
request.revision = VNRecognizeTextRequestRevision3 // iOS 16 model
Warning: Pinning revisions means you will not automatically benefit from model improvements in new OS releases. Only pin when reproducibility is more important than accuracy — for example, in medical or legal document processing where output consistency is a regulatory requirement.
Combining with Other Vision Requests
One of Vision’s best design decisions is that a single VNImageRequestHandler can execute multiple requests in one
pass. The framework optimizes shared image preprocessing across requests.
func analyzeMoviePoster(image: CGImage) throws {
let textRequest = VNRecognizeTextRequest()
textRequest.recognitionLevel = .accurate
let rectangleRequest = VNDetectRectanglesRequest()
rectangleRequest.maximumObservations = 10
let faceRequest = VNDetectFaceRectanglesRequest()
// Single handler, multiple requests -- shared image preprocessing
let handler = VNImageRequestHandler(cgImage: image, options: [:])
try handler.perform([textRequest, rectangleRequest, faceRequest])
let text = textRequest.results ?? []
let rectangles = rectangleRequest.results ?? []
let faces = faceRequest.results ?? []
// Now you have text, rectangular regions, and face locations
// from one efficient pass through the image pipeline
}
Performance Considerations
Text recognition is computationally expensive. Here are the numbers you should know and the knobs you can turn.
Recognition Level Impact
| Level | Latency (iPhone 15 Pro) | Accuracy | Use Case |
|---|---|---|---|
.fast | ~50-100ms per frame | Good | Real-time video, live camera |
.accurate | ~200-500ms per image | Excellent | Static images, documents |
The .accurate level runs a larger neural network and performs multiple passes. For batch processing (scanning an
entire Pixar Blu-ray collection), use .accurate but process images concurrently with a TaskGroup to leverage all
available cores.
Batch Processing Pattern
func processMovieCollection(
images: [CGImage]
) async throws -> [String: [String]] {
try await withThrowingTaskGroup(
of: (Int, [String]).self
) { group in
for (index, image) in images.enumerated() {
group.addTask {
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate
let handler = VNImageRequestHandler(
cgImage: image,
options: [:]
)
try handler.perform([request])
let lines = (request.results ?? []).compactMap {
$0.topCandidates(1).first?.string
}
return (index, lines)
}
}
var results: [String: [String]] = [:]
for try await (index, lines) in group {
results["image_\(index)"] = lines
}
return results
}
}
Tip: Vision requests are not thread-safe, but
VNImageRequestHandlerinstances are independent. Creating one handler per image and dispatching them across a task group is both safe and efficient. Avoid sharing a single handler across tasks.
Memory Considerations
Vision loads the text recognition model into memory on first use and keeps it cached for subsequent requests. On devices with limited RAM, be aware of the following:
- The
.accuratemodel consumes approximately 30-50 MB of memory while loaded. - Processing very high-resolution images (4K+) creates intermediate buffers. Consider downscaling to 2048px on the longest edge for documents — the model’s effective resolution plateaus beyond that.
DataScannerViewControllermaintains a continuous camera session. Dismiss it when not in use to reclaim memory.
Apple Docs:
VNImageRequestHandler— Vision Framework
When to Use (and When Not To)
| Scenario | Recommendation |
|---|---|
| Extract text from a photo or screenshot | Use VNRecognizeTextRequest with .accurate. This is its sweet spot. |
| Live camera text scanning with UI | Use DataScannerViewController. It handles camera, preview, and highlighting. |
| Custom camera pipeline with text overlay | Use VNRecognizeTextRequest with .fast on frames from your AVFoundation capture session. |
| Full document structure (paragraphs, columns) | Use RecognizeDocumentsRequest (iOS 26+). The only API that preserves reading order. |
| Barcode and QR code scanning | Use VNDetectBarcodesRequest or DataScannerViewController with .barcode(). |
| Server-side OCR for archival | Vision runs on-device only. For server workloads, look at server-side alternatives. |
| Handwriting recognition | VNRecognizeTextRequest supports handwriting since iOS 14. Use .accurate. |
The general rule: if you need text from an image and you are on Apple hardware, reach for Vision first. The only reasons to look elsewhere are server-side processing requirements or niche language support that Apple has not added yet.
Summary
VNRecognizeTextRequestis your go-to for static image OCR. Available since iOS 13, it handles printed and handwritten text in 18+ languages with on-device processing.DataScannerViewControllerprovides a complete live camera scanning experience with minimal code. Use it when you need a user-facing scanner and do not want to build camera UI from scratch.RecognizeDocumentsRequestin iOS 26 brings structural document understanding — paragraphs, columns, reading order — and uses Swift 6’s native async/await pattern.- Always check
.accuratevs.fastrecognition levels against your latency budget. Batch process withTaskGroupfor collections. - Vision’s coordinate system uses bottom-left origin with normalized values. Convert to UIKit/SwiftUI coordinates before drawing overlays.
For a deeper look at feeding Vision results into a trained model for classification, see Integrating Core ML Models in SwiftUI. If you need to build a custom camera pipeline to feed frames into Vision, AVFoundation: Custom Camera and Audio Capture covers the capture session setup in detail.