VisionKit and document scanning

Apple introduced VisionKit, a new API for scanning documents with iOS 13 last June during WWDC 2019, which may or may have not been the last physical WWDC ever.

VisionKit is what powers both Notes and Files’ document scanning capabilities by allowing the user to scan, apply perspective correction and color enhancements to their paper documents.

Here’s some sample code:

import UIKit
import VisionKit

class ViewController: UIViewController {
	// snip, obviously!

    @objc func startScanner() {
        guard VNDocumentCameraViewController.isSupported else {
            print("document scanning not supported")
            return
        }
        let scanner = VNDocumentCameraViewController()
        scanner.delegate = self
        present(scanner, animated: true, completion: nil)
    }

    func documentCameraViewController(_ controller: VNDocumentCameraViewController, didFinishWith scan: VNDocumentCameraScan) {
        print("scanned \(scan.pageCount) pages")
        for i in 0..<scan.pageCount {
            let pageImage = scan.imageOfPage(at: i)
            let bytes = pageImage.jpegData(compressionQuality: 0.8)!
            print("page dimensions \(pageImage.size.width) by \(pageImage.size.height) - JPEG size \(bytes.count)")
        }
    }	
}

So we start by creating a VNDocumentCameraViewController which behaves suspiciously like your good old image picker controller, then setting delegates and ~~firing it away~~ presenting it.

The delegate will be back to you after the user taps the Save button with a VNDocumentCameraScan object that holds all the scanned pages. In the sample code, I’m just taking the JPEG bytes of the image and printing some information about it.

The provided images are already perspective and color-corrected so you can send them directly into...

Vision

By piping the image into the Vision framework, we can leverage Apple’s, well, computer vision framework to detect rectangles, text, faces, or even animals in both pictures and video feeds.

Actually, this is how the document scanner from VisionKit knows when the document page is laid under the camera to perform the automatic capture.

I’ve loaded the page image into a view model (just in case, you know!) so then it can be processed by Vision.

To do so, just create the request type that is needed, in this case a VNRecognizeTextRequest, set up the completion handler, that has a (VNRequest, Error?) -> () signature, and pass it into a VNImageRequestHandler to actually perform the operation.

Bear in mind this is an asynchronous process and it may take some time.

Also bear in mind, that, in this case for OCR, the request is VNRecognizeTextRequest. There is also VNDetectTextRectanglesRequest which is another kind of request that only reports back on where in the picture there were text blocks. This is useful if you have your own OCR framework and just want to make its job easier by not processing the whole image.

Here’s some more code

    @objc func startOCR() {
        do {
            guard let image = viewModel.image.cgImage else { return }
            let request = VNRecognizeTextRequest(completionHandler: self.recognizeTextCompletionHandler(request: error:))
            request.recognitionLevel = .accurate
            request.revision = VNRecognizeTextRequestRevision1
            let handler = VNImageRequestHandler(cgImage: image,
                                                orientation: CGImagePropertyOrientation(viewModel.image.imageOrientation),
                                                options: [:])
            try handler.perform([request])
        } catch {
            print("error: \(error)")
        }
    }

    func recognizeTextCompletionHandler(request: VNRequest, error: Error?) {
        guard
            let textRequest = request as? VNRecognizeTextRequest,
            let textResults = textRequest.results as? [VNRecognizedTextObservation]
            else { return }
        for result in textResults {
            guard let c = result.topCandidates(1).first else { continue }
            viewModel.add(recognizedText: c)
        }
        DispatchQueue.main.async { [unowned self] in
            // unfortunately out of scope!
            self.overlayOCRResults()
        }
    }

There are some items to consider when performing text recognition. Apple offers two ways of doing it: fast, and accurate. The former is oriented to live processing of video frames and the latter is more for still images where you don’t need the speed but maybe need some more accuracy.

The completion handler’s request parameter is the one that was passed before, but no the results property has the recognized text observations, in this case.

In the prototype app that runs this code, I’m taking the results and overlaying them over the picture of the image, allowing users to tap highlighted sections to copy them into the clipboard if they only care about a single text item on the image, like taking a picture of a billboard for the phone number. I’m also giving users the option to see all the text in the image in a text view so they can pick exactly what they want to copy.

Closing comments

This was a great little API to work with. You get a lot of awesome features right out of the box with just three lines of code. I’d like to see how customizable it is, like defaulting the color correction options to grayscale, as I found out works better in my use case, but I didn’t dive there. The UI may be other interesting case, it’s got a lot of options that the user may never care about and that would let the interface be more minimalistic.

On the text recognition side, the worst part for me was trying to figure out what was going on with the coordinates for the text boxes given the different coordinate systems that UIKit and Vision use. I know it’s just a case of “flip vertically and translate” but it took me a while to get my head around it and I kinda suck at basic math =P