This in-depth step-by-step guide will take you through the following steps
- Collecting sample files to build a KT project in KTA.
- naming them well to make them easy to sort and work with.
- creating subsets.
- Converting TIFF to PDF to dramatically speed up development time without losing PDF-text.
- Getting the truth into the sample documents.
- Importing the truth documents from KTA into Transformation Designer and restoring the original file names
- Building a table benchmark using metafields
- Row Count.
- Row alignment.
- Column alignment.
- Cell content matching.
- Numeric Column sums.
- Combining Advanced Table Locator (ATL) with Automatic Table Locator, Specific Online Learning and Manual Table Locator to optimize table extraction.
- "Voting" by script between ATL and TL.
- Enhancing TL by script to improve
- find missing rows.
- switch columns.
- check for missing words.
- fixing line-wrapped table cells.
- Migrating XDocs from the training project into your main KTA project without losing data.
- Training and benchmarking in your main KTA project.
Collect 1 day's worth of documents from your customer.
You need representative files for your project. The sample files need to represent what actually happens in a normal day - the good, bad and weird documents - all of them.
- Make sure they include all the "good" documents and all the "bad" documents and all the "weird" documents.
- Your project has to have excellent extraction performance on the "good" documents and have great user productuvity on the "bad" documents. You cannot simply consider the "bad" documents out-of-scope. You need to demonstrate your solution is excellent for bad documents - make it easy and fast for the human to enter data in validation.
- Ask the customer "Are there any other documents? Is that all? Got any weird ones? Different documents at different times of years?" Tell them, "Any different documents that appear later are out of scope".
Name them well. - Nicest is CompanyName_idnumber.That makes them easy to find and sort and identify.
- Open Windows Explorer and turn on Preview Pane. This is one of the best tools for examining, renaming and sorting files.
- In Windows Explorer you can also enable NavigationPane/ExpandToOpenFolders to view folders that you can use to name documents.
- When you import the documents into Transformation Designer. You can now quickly look through the documents to rename them. Transformation Designer does have the clustering tool, but that is not really effective for browsing documents - and it forces you to name every document correctly first time - it can me more frustrating than beneficial.
- import files into Transformation Designer with the following settings to create a subset for each folder in windows explorer.
PDF Documents are problematic for Transformation Designer because
- every time you run a locator or classify the document the PDF renderer has to redraw each page.
- You cannot split or change pages in a pdf document
The Xdocs are loaded with the text from the PDF documents.
This script generates singlepage TIFF images from the PDF document and replaces them in the XDoc. The PDF is then disconnected from the XDoc.
Now you have the perfect text, single pages you can re-order and images for fast locators and benchmarks. No downsides!
You want this project small and fast for rapid development and testing.
- Add a top level document class and a subclass to contain the documents you want to train.
- Select Enable Table Detection in the class details.
- Add the table model and field that you need.
- Add a Field Group TableBenchmark and the fields TableRowCount, TableRowAlignment, TableColumnAligment, TableTP, TableFN, TableTN, TableFP, which will be used for the table benchmarking.
- Add the Advanced Table Locator (ATL) and the Table Locator (TL)and assign them the table model.
- Assign your Table Field to the Table Locator. We will use a script to copy the ATL to the TL if it is a better result. This way we can benefit from the online learning of the TL, while the ATL has no online learning.e