This isn’t a post about which OCR engine is better. Sometimes FineReader will do a better job, sometimes RecoStar shines. In fact, sometimes you’d love to work with the result of both engines. Kofax Transformation however only supports one full-page engine being present. Sure, there’s the OCR re-read option, but that’s tied to a field. What if you want your format locators to run on FineReader, but fire your database locators against RecoStar’s results?
Well, I know how to modify and handle representations – basically the object model Kofax Transformations is storing OCR results. Representations contain of text lines, pages, individual words, which themselves have coordinates, text, and so on. However, while I’ve been working with Kofax products for more than 10 years by now, only until today I learned how to fire OCR in script (thanks to Brendan’s post in the LinkedIn user group). So, I wanted to share the results. Today’s questions are:
- Can we perform OCR in script?
- Can we tell the locators which OCR results to use?
Firing more than one OCR engine
By default, this is what you get when you process a document in Transformations:
Our document was processed with the default settings, so we have Abbyy’s FineReader results. There’s one representation by default, and all locators will use the respective results. However, Kofax ships the components required to perform OCR in script with Transformations. Make sure to add the following references to the script:
The recognizer objects have a Recognize method. Just provide an xdoc, a page profile, and a page number – and you’ll end up with a new OCR representation. There’s a catch, however – this method will only update the very first representation in the xdoc – plus, it does not seems to clear the representation first. Long story short, here’s the script that allows you to perform OCR twice:
Private Sub Document_BeforeExtract(ByVal pXDoc As CASCADELib.CscXDocument)
Dim repFR As CscXDocRepresentation
Dim repRS As CscXDocRepresentation
Dim recognizerFR As New MpsPageRecognizerFR
Dim recognizerRS As New MpsPageRecognizerRecoStar
Dim idxPg As Long
' fire up OCR engines, first finereader, then recostar
XDocument_ClearAllRepresentations(pXDoc)
For idxPg = 0 To pXDoc.CDoc.Pages.Count - 1
recognizerFR.Recognize(pXDoc, Project.RecogProfiles.ItemByName("FineReader"), idxPg)
Next
Set repFR = pXDoc.Representations(0)
XDocument_ClearAllRepresentations(pXDoc)
For idxPg = 0 To pXDoc.CDoc.Pages.Count - 1
recognizerRS.Recognize(pXDoc, Project.RecogProfiles.ItemByName("RecoStar"), idxPg)
Next
Set repRS = pXDoc.Representations(0)
' finally, remove all reps again and repopulate them, in any order to your liking
XDocument_ClearAllRepresentations(pXDoc)
Representation_Copy(repFR, pXDoc.Representations.Create("RecoStar"))
Representation_Copy(repRS, pXDoc.Representations.Create("FineReader"))
pXDoc.Save
End Sub
Public Sub XDocument_ClearAllRepresentations(pXDoc As CscXDocument)
' helper to remove all existing representations form an xdocument
Dim i As Long
For i = pXDoc.Representations.Count-1 To 0 Step-1
pXDoc.Representations.Remove(i)
Next
End Sub
Public Sub Representation_Copy(fromRep As CscXDocRepresentation, toRep As CscXDocRepresentation)
' copies one representation to another
Dim idxPg As Long
Dim idxWord As Long
For idxPg = 0 To fromRep.Pages.Count - 1
For idxWord = 0 To fromRep.Pages(idxPg).Words.Count - 1
toRep.Pages(idxPg).AddWord(Word_Create(fromRep.Pages(idxPg).Words(idxWord)))
Next
Next
toRep.AnalyzeLines
End Sub
Now, let’s have a look at this xDoc again – you’ll notice that we ended up with the results of both engines. You’ll notice some differences, in our example RecoStar has found 89 words, while FineReader contains 85. By default, all locators will use the results from the first representation at index 0 – which is, in this case, FineReader.
Here’s an example: in the second representation, RecoStar interpreted a checkbox as a strangely looking word. When firing a format locator with exactly this string, we won’t be seeing any results:
As expected, when testing the locator we end up with nothing:
Again, this is just to illustrate that the second representation does not matter – yet. No worries, we’ll make it matter.
Swapping Representations
Good news is – we already have what we need. The script above already contains a helper to copy a representation – so, swapping them seems easy enough:
Public Sub XDocument_SwapRepresentations(pXDoc As CscXDocument)
' swaps the first two representations of an xdoc (will only work when there are exactly two reps)
Dim rep0 As CscXDocRepresentation
Dim rep1 As CscXDocRepresentation
Dim n0 As String
Dim n1 As String
If pXDoc.Representations.Count = 2 Then
Set rep0 = pXDoc.Representations(0)
n0 = pXDoc.Representations(0).Name
Set rep1 = pXDoc.Representations(1)
n1 = pXDoc.Representations(1).Name
' rename the "old" representations
rep0.Name += "_old"
rep1.Name += "_old"
' copy and remove the old reps
Representation_Copy(rep1, pXDoc.Representations.Create(n1))
Representation_Copy(rep0, pXDoc.Representations.Create(n0))
pXDoc.Representations.Remove(0)
pXDoc.Representations.Remove(0)
End If
End Sub
Note that this script only works when there are exactly two representations in the xdoc. But where to call it? Easy enough – as locators are performed exactly in the order as they appear in Project Builder, just call the swap helper right before the required locator. In our simplified use case, that’s just right before our format locator.
I won’t bore you with the contents of the script locator as it’s easy enough – just call XDocument_SwapRepresentations. You can, in fact, call it as often as you like (or need) to. So, here’s the result:
And as expected, when we fire up the format locator we get a 100%-hit:
That’s all you need to do if you wanted to use results from more than one OCR engine. Please note that firing two engines will likely reduce the page count twice, however I did not verify that.
Zany Zone (aka side notes about the Script)
- Why the Words_Create helper? It seems that you can not add one representations word to another representation. From what I’ve learned that is related to some representation-specific properties, such as the IndexInBlock, IndexInTextLine, IndexOnDocument and IndexOnPage. All these indices are re-calculated when firing the respective method (in our case, AnalyzeLines is sufficient). Hence the helper – here, we create a new word with the defaults for all indices (i.e. -1).
- Why does swapping only work with 2 alternatives? Well, because I coded it that way. I wanted some quick results. Feel free to improve the method, and please send me the improved version 😉
- Do I have to pay twice, once for every page? I really don’t know, and it’s hard to say with a developer’s license. Give it a shot and let me know!