Making Use of Unstructured Data with Form Recognizer
Every business has their "junk" drawer (or the "share" as Bill/Karen likes to call it). Often hidden in this data wasteland are nuggets of informational gold that could revolutionize the way you do business (we hope). Turns out that the most important information is usually hidden in paper forms (by paper I mean digital paper like an image or pdf file). Microsoft recently released a new service (just released - blog) called Form Recognizer designed to make short work of the structured data hidden in these gems. I shall endeavor to take you through the simple process here!
As an aside, I used this same process as part of a talk I delivered on Azure Cognitive Search at Ignite 2019.
The Process
I will be using a bunch of invoices I generated for a fictional company called Tailwind Traders. They all basically look like this.
Create the Service
The first step is to create the actual Form Recognizer service:
The good news is that the F0
pricing tier already has 500 free calls per month for analyzing forms. If you're just looking to see if this service will work for your scenario this would be the perfect pricing tier to select.
Once the service is up and ready to go you will need the service key
and endpoint
to do the rest.
All that's left is training and analysis.
Training the Model
This is honestly a super painless process patterned after the quickstart documentation. I decided to use Postman instead of curl
for this example but you can use whatever you like (including using .NET or Python together with the requests
library).
Every machine learning model uses data. The Form Recognizer service also needs the ability to access this data (both read the files and list files in a directory). The best way to do that is to create a Azure Storage Account, create a container, and upload some training data.
How can a completely separate service (such as Form Recognizer) get access to the storage you ask? The best way to do that with Azure Storage is to use something called a Shared Access Signature. You can even generate the signature directly in the portal:
You only need 5 files of the same type to train the model - that's it. It's pretty amazing that's all you need. The first part of training is initiating a post
call with the location of the five pdf files.
This will return a Location
variable in the response headers that contains the model id (highlighted above). One thing I omitted in the previous screenshot was how I was sending the authentication key. Below you can see the Ocp-Apim-Subscription-Key
header being passed to the request.
Using this model id we can query the service to see how it did! I included the returned json below in its entirety:
{
"modelInfo": {
"modelId": "dcf82555-bbca-47cc-bdde-c9e50c067a48",
"status": "ready",
"createdDateTime": "2020-04-03T05:22:57Z",
"lastUpdatedDateTime": "2020-04-03T05:22:59Z"
},
"trainResult": {
"trainingDocuments": [
{
"documentName": "Invoice129235.pdf",
"pages": 1,
"errors": [],
"status": "succeeded"
},
{
"documentName": "Invoice21048.pdf",
"pages": 1,
"errors": [],
"status": "succeeded"
},
{
"documentName": "Invoice68349.pdf",
"pages": 1,
"errors": [],
"status": "succeeded"
},
{
"documentName": "Invoice89656.pdf",
"pages": 1,
"errors": [],
"status": "succeeded"
},
{
"documentName": "Invoice98026.pdf",
"pages": 1,
"errors": [],
"status": "succeeded"
}
],
"errors": []
}
}
That's it for training!
Trying the Model Out
The two part process for training (a POST
to train and a GET
to see how it did) is also used for analyzing a form. In this case we will POST
an actual pdf to the service and issue a GET
request to view the results.
I highlighted model id below (its included in the uri for analysis).
I also highlighted the result ID
returned in the response headers. This will tell us where we can GET
the results (see what I did there?)
That's it! In a few minutes you should be able to have something working!
Let's Compare!
Before we ride off into the sunset we need to see how well the service did. For reference, we submitted Invoice22418.pdf for analysis (there's a bunch in the test set but we only tried one). Instead of showing the full results (it is quite large), let's take a look at a couple of interesting parts. The first section has information about when the analysis was run, status, stuff like that. The second and third sections are the interesting bits.
The first interesting section basically returns all of the key/value pairs it thinks it found. The section below found the Name key and associated the value Raul Tang with it. It also gives bounding boxes of where it found these things. Why might that be useful you ask? Consider a situation where you want to create an amazing search utility that not only returns the images (yes this works with images as well) but also draws boxes around the important text in the form. Having bounding boxes totally enables that scenario. This key/value section is of course repeated for every pair it finds.
"pageResults": [
{
"page": 1,
"keyValuePairs": [
{
"key": {
"text": "Name",
"boundingBox": [
1.1194,
1.3097,
1.5847,
1.3097,
1.5847,
1.5972,
1.1194,
1.5972
],
"elements": null
},
"value": {
"text": "Raul Tang",
"boundingBox": [
1.1194,
1.5347,
1.85,
1.5347,
1.85,
1.8194,
1.1194,
1.8194
],
"elements": null
},
"confidence": 0.36
},
... more key/value pairs
The third section works with actual, literal, bonafide tables. The invoice we tested actually has this table in it:
This is what makes this so interesting! It returns the actual table in json form. In this case you see it found the table, saw it had 4 rows, and 8 columns. It also returns the bounding box for the cell as well.
"tables": [
{
"rows": 4,
"columns": 8,
"cells": [
{
"text": "Itm",
"rowIndex": 0,
"columnIndex": 0,
"boundingBox": [
1.1667,
4.8472,
1.3861,
4.8472,
1.3861,
5.0806,
1.1667,
5.0806
],
"confidence": 1.0,
"rowSpan": 1,
"columnSpan": 1,
"elements": null,
"isHeader": true,
"isFooter": false
},
... more columns
What about that one cell that spanned multiple lines? How does it do with that? In this case it handle it perfectly! It noticed that it was in the second row and in the second column (count starts at 0). It also pulled all of the appropriate text as well.
{
"text": "One Handle Stainless Steel Pull Out Kitchen Faucet",
"rowIndex": 2,
"columnIndex": 2,
"boundingBox": [
2.0236,
5.3903,
4.1236,
5.3903,
4.1236,
5.8014,
2.0236,
5.8014
],
"confidence": 1.0,
"rowSpan": 1,
"columnSpan": 1,
"elements": null,
"isHeader": false,
"isFooter": false
},
Review
Let's recap a bit here. We want to get structure out of our unstructured files so we went with Form Recognizer. The process was simple:
- Upload training files to Azure Storage
- Train a model using the training files (a
POST
to initiate the training and aGET
to see how it did) - Analyze a new pdf form (a
POST
to initiate the analysis and aGET
to view the results) - Become a bajillionaire with all the new insights gathered from your corporate junk drawer!
Let me know if this works for you!
Resources
Your Thoughts?
- Does it make sense?
- Did it help you solve a problem?
- Were you looking for something else?