Making Use of Unstructured Data with Form Recognizer

Seth Juarez4/8/2020

Every business has their "junk" drawer (or the "share" as Bill/Karen likes to call it). Often hidden in this data wasteland are nuggets of informational gold that could revolutionize the way you do business (we hope). Turns out that the most important information is usually hidden in paper forms (by paper I mean digital paper like an image or pdf file). Microsoft recently released a new service (just released - blog) called Form Recognizer designed to make short work of the structured data hidden in these gems. I shall endeavor to take you through the simple process here!

As an aside, I used this same process as part of a talk I delivered on Azure Cognitive Search at Ignite 2019.

The Process

I will be using a bunch of invoices I generated for a fictional company called Tailwind Traders. They all basically look like this.

Create the Service

The first step is to create the actual Form Recognizer service:

Service Options

The good news is that the F0 pricing tier already has 500 free calls per month for analyzing forms. If you're just looking to see if this service will work for your scenario this would be the perfect pricing tier to select.

Once the service is up and ready to go you will need the service key and endpoint to do the rest.

Service Keys

All that's left is training and analysis.

Training the Model

This is honestly a super painless process patterned after the quickstart documentation. I decided to use Postman instead of curl for this example but you can use whatever you like (including using .NET or Python together with the requests library).

Every machine learning model uses data. The Form Recognizer service also needs the ability to access this data (both read the files and list files in a directory). The best way to do that is to create a Azure Storage Account, create a container, and upload some training data.

How can a completely separate service (such as Form Recognizer) get access to the storage you ask? The best way to do that with Azure Storage is to use something called a Shared Access Signature. You can even generate the signature directly in the portal:

SAS Token

You only need 5 files of the same type to train the model - that's it. It's pretty amazing that's all you need. The first part of training is initiating a post call with the location of the five pdf files.

Training the Model

This will return a Location variable in the response headers that contains the model id (highlighted above). One thing I omitted in the previous screenshot was how I was sending the authentication key. Below you can see the Ocp-Apim-Subscription-Key header being passed to the request.

Training the Model

Using this model id we can query the service to see how it did! I included the returned json below in its entirety:

{
  "modelInfo": {
    "modelId": "dcf82555-bbca-47cc-bdde-c9e50c067a48",
    "status": "ready",
    "createdDateTime": "2020-04-03T05:22:57Z",
    "lastUpdatedDateTime": "2020-04-03T05:22:59Z"
  },
  "trainResult": {
    "trainingDocuments": [
      {
        "documentName": "Invoice129235.pdf",
        "pages": 1,
        "errors": [],
        "status": "succeeded"
      },
      {
        "documentName": "Invoice21048.pdf",
        "pages": 1,
        "errors": [],
        "status": "succeeded"
      },
      {
        "documentName": "Invoice68349.pdf",
        "pages": 1,
        "errors": [],
        "status": "succeeded"
      },
      {
        "documentName": "Invoice89656.pdf",
        "pages": 1,
        "errors": [],
        "status": "succeeded"
      },
      {
        "documentName": "Invoice98026.pdf",
        "pages": 1,
        "errors": [],
        "status": "succeeded"
      }
    ],
    "errors": []
  }
}

That's it for training!

Trying the Model Out

The two part process for training (a POST to train and a GET to see how it did) is also used for analyzing a form. In this case we will POST an actual pdf to the service and issue a GET request to view the results.

I highlighted model id below (its included in the uri for analysis).

Analyze PDF

I also highlighted the result ID returned in the response headers. This will tell us where we can GET the results (see what I did there?)

Analysis Results

That's it! In a few minutes you should be able to have something working!

Let's Compare!

Before we ride off into the sunset we need to see how well the service did. For reference, we submitted Invoice22418.pdf for analysis (there's a bunch in the test set but we only tried one). Instead of showing the full results (it is quite large), let's take a look at a couple of interesting parts. The first section has information about when the analysis was run, status, stuff like that. The second and third sections are the interesting bits.

The first interesting section basically returns all of the key/value pairs it thinks it found. The section below found the Name key and associated the value Raul Tang with it. It also gives bounding boxes of where it found these things. Why might that be useful you ask? Consider a situation where you want to create an amazing search utility that not only returns the images (yes this works with images as well) but also draws boxes around the important text in the form. Having bounding boxes totally enables that scenario. This key/value section is of course repeated for every pair it finds.

"pageResults": [
      {
        "page": 1,
        "keyValuePairs": [
          {
            "key": {
              "text": "Name",
              "boundingBox": [
                1.1194,
                1.3097,
                1.5847,
                1.3097,
                1.5847,
                1.5972,
                1.1194,
                1.5972
              ],
              "elements": null
            },
            "value": {
              "text": "Raul Tang",
              "boundingBox": [
                1.1194,
                1.5347,
                1.85,
                1.5347,
                1.85,
                1.8194,
                1.1194,
                1.8194
              ],
              "elements": null
            },
            "confidence": 0.36
          },
... more key/value pairs

The third section works with actual, literal, bonafide tables. The invoice we tested actually has this table in it:

Invoice Table

This is what makes this so interesting! It returns the actual table in json form. In this case you see it found the table, saw it had 4 rows, and 8 columns. It also returns the bounding box for the cell as well.

"tables": [
          {
            "rows": 4,
            "columns": 8,
            "cells": [
              {
                "text": "Itm",
                "rowIndex": 0,
                "columnIndex": 0,
                "boundingBox": [
                  1.1667,
                  4.8472,
                  1.3861,
                  4.8472,
                  1.3861,
                  5.0806,
                  1.1667,
                  5.0806
                ],
                "confidence": 1.0,
                "rowSpan": 1,
                "columnSpan": 1,
                "elements": null,
                "isHeader": true,
                "isFooter": false
              },
... more columns

What about that one cell that spanned multiple lines? How does it do with that? In this case it handle it perfectly! It noticed that it was in the second row and in the second column (count starts at 0). It also pulled all of the appropriate text as well.

{
                "text": "One Handle Stainless Steel Pull Out Kitchen Faucet",
                "rowIndex": 2,
                "columnIndex": 2,
                "boundingBox": [
                  2.0236,
                  5.3903,
                  4.1236,
                  5.3903,
                  4.1236,
                  5.8014,
                  2.0236,
                  5.8014
                ],
                "confidence": 1.0,
                "rowSpan": 1,
                "columnSpan": 1,
                "elements": null,
                "isHeader": false,
                "isFooter": false
              },

Review

Let's recap a bit here. We want to get structure out of our unstructured files so we went with Form Recognizer. The process was simple:

Upload training files to Azure Storage
Train a model using the training files (a POST to initiate the training and a GET to see how it did)
Analyze a new pdf form (a POST to initiate the analysis and a GET to view the results)
Become a bajillionaire with all the new insights gathered from your corporate junk drawer!

Let me know if this works for you!

Resources

Your Thoughts?

Let me know if you have a question/thoughts about "Making Use of Unstructured Data with Form Recognizer"!

Does it make sense?
Did it help you solve a problem?
Were you looking for something else?

Happy to answer any questions!

« Troubleshooting an ONNX Model deployment to Azure Functions

3 Tips for Debugging Cloud Scale Machine Learning Workloads »