This framework shows how to build powerful table extraction algorithms and integrate them into Kofax Transformation such that
-
The automatic + learning mode is supported and indeed enhanced.
- The Learning mechanism in the Table Locator works as follows
- Find the best classification matches of the document with the specific training set/knowledge bases.
- For the best N (5?) specific samples see if they have a manual Table Definition, that was either trained in Project Builder with the Edit Document dialog, or thta was automatically generated from an online learning sample.
- The Learning mechanism in the Table Locator works as follows
-
The Test feature of the Table Locator works, including running on-demand the required previous locators that the scripts need
There are 3 places where scripts can contribute to table locator extraction:
- Document_BeforeExtract
- General changes to the OCR layer of the document
- splitting "FirstName,LastName" into two words, where there is no space after the comma
- dollar, comma and period repairs in numbers?
- This script needs to be aware if it has already run and not run on the same document repeatedly
- General changes to the OCR layer of the document
- Document_BeforeLocate
- Correct errors in table header words, create table headers, or simplify table headers to make it easier for the table locator
- Document_AfterLocate
- Allow many different table algorithms to run. Each algorithm will be passed two parameters (pXDoc as CSCXdocument, Table as CSCXDocTable)
- Can we simply add more alternatives to the table locator, and create a table on alternatives (I have never done this)
- "Vote" on results, or perhaps merge results
- An algorithm should be checking the online-learning samples and using that for our own "manual table locator" algorithms
- Does someone know how to get the cell coordinates of the manual table locator sample?
- An algorithm should be checking the online-learning samples and using that for our own "manual table locator" algorithms
- Repair table structure
- check for missing rows or whether extrapolation needs to go further down (or up)
- check for missing cells in rows, because rows above or below have a value here
- check for misaligned cells and correct them
- Repair table cells
- correct misspelled names?
- correct procedure codes??
- check (and correct?) running total rows in the tables
- Post Processing
- Remove from all tables the rows and columns that are of no interest to the customer. (We may have extracted more than they need)
- Generate table metadata for benchmarking tables in a locator SL_TableMetadata
- TableRowCount
- TableCellCount (ignore empty cells)
- TablePatientIds
- TablePatientNames
- TableProcedureCodes
- TableCopaySum
- TableTotalSum
- TablePayableSum