Skip to content

Latest commit

 

History

History
55 lines (50 loc) · 3.94 KB

Fuzzy Dictionary Substitution.md

File metadata and controls

55 lines (50 loc) · 3.94 KB

Fuzzy Dictionary Substitution

This is a powerful script-free technique to find and unique label parts of a document for use in later locators.
When a dictionary is added to a format locator you retrieve all occurences including duplicates of dictionary entries on a document using fuzzy matching.
In comparison, a Fuzzy Database Locator only returns ONE occurance of an entry, not ALL.

Example 1 - Finding more information than is actually on the document

In the example below, Fuzzy Dictionary Substitution was used to find the name Patricia on the document.
image
But instead of returning Patricia as the value, the dictionary returned Patricia_1077 via auto replacement. 1077 out of 100,000 American females are called Patricia (US Census Data 2010) and so I retrieve 5 pieces of information to use in other locators.

  • confidence = 95.06% because "Patricia" in the database fuzzy matched "Patricia*" on the document
  • Patricia as the person's name.
  • 1077 as the frequency of the person's name.
  • The exact coordinates of Patricia on the document, which you can see in the green box.
  • The words from the OCR. Using Alternative.Words.Text I can retrieve Patricia*.

This is achieved using a fuzzy database with auto replacement values in it.
image
and inserting the dictionary into a format locator
image

Example 2 - Finding Text Anchors for values in large documents.

This can be useful for parsing large tables with identifying labels or finding checkboxes and OCR fields scattered throughout a large document.
Consider this example.
image
And say you are interested in extracting these results:

id amount
1 10
3a 9,4
12 0

Note we want 3a, which is not even on the document, which has 3-a)

  • Make a dictionary to find these important phrases and auto-replace them to unique codes. Make sure the phrases are long and unique.
    image
  • Add this dictionary to the project with auto-replace turned on.
    image
  • Add to a format locator.
    image
  • Add this script to customize the format locator to remove alternatives with a confidence less than 80%.
Private Sub Document_AfterLocate(ByVal pXDoc As CASCADELib.CscXDocument, ByVal LocatorName As String)
   Dim A As Long, Alternatives As CscXDocFieldAlternatives
   Set Alternatives =pXDoc.Locators.ItemByName(LocatorName).Alternatives
   Select Case LocatorName
   Case "FL_Table"
      For A=Alternatives.Count-1 To 0 Step -1 ' Always count backwards if deleting
         If Alternatives(A).Confidence<0.8 Then Alternatives.Remove(A)
      Next
   End Select
End Sub
  • Test! The results contain the precise locations and unique labels required by a following locator to process. Note that there is an OCR error (confidence= 98.06%) in the 3rd value, which didn't matter. Fuzzy matching with long phrases is very robust. In this example, the 3 textlines of the results can be used to retrieve the last word on each textline and insert into a custom table locator. image