Gibberish Detection Script

Pages that are processed in Kofax Transformation may contain unintelligible gibberish because of scanning problems, image quality or obfuscation occuring in a pdf document. This fast script detects whether the document contains "readable language".

Create a Dictionary.
Create locators to test a document with the Dictionary.

Create a Dictionary

Load a set of documents into Project Builder that contain the kind of language you need. Add as many documents as possible; hundreds or thousands are better.
Switch the Document Viewer to Hierarchy View.
Swtich the Runtime Script Events to Event Batch_Close.
Copy the script below to the Project Class.
In the Script Debugger Edit/References... a refrence to Microsoft Scripting Runtime so you can use the Dictionary object.
Execute the script by clicking the Lightning icon.

This script can process 100 documents in about 10 seconds. It removes punctation, ignores numbers and stores words that are 2 or more characters long
Copy the dictionary file c:\temp\english.dict into your project's dictionary folder.
Add the dictionary to your project. Delimiter=comma, and select replacement values.
in this example you see that the dictionary contains the word "contra", which has a length of 6 and appears in 41.1% of the documents

Create locators to test a document with the dictionary.

At the project level add a Format Locator FL_English and a script locator SL_English. you cannot use a database locator because a database locator only returns one result per document, we need to use a format locator to return all results.
Configure the format locator to use only the dictionary by selecting Regular Expression and adding a dictionary. this makes a fuzzy dictionary locator and does not use regular expressions.
Test the locator and you will see all meaningful words on the document highlighted and their text replaced with their length & frequency.
note in the example below that the 4th alternative is 8 characters long and appears on average 1.667 times per document.
The Script Locator SL_English then sums the length of all words found (multiplied by their frequency) and then divides by the number of characters in the document. This should return a number above 100% for meaningful words and a number close to zero for gibberish.
Use a benchmark set of real language documents and gibberish documents to find the best confidence threshhold for your project.

Ways to extend this further

Make multiple dictionaries for different document sets.
test each page individually for gibberish.
Find areas on a page where particular language is - eg doctor's diagnosis text inside a larger document.
Change the scoring method - eg ignore frequency and only use word length.

Const digits="0123456789"
Const punc="!£$%^&*()<>,.;:'@[]#{}\|/"","
Private Sub Batch_Close(ByVal pXRootFolder As CASCADELib.CscXFolder, ByVal CloseMode As CASCADELib.CscBatchCloseMode)
   Dim X As Long, XDoc As CscXDocument, Dict As New Dictionary, W As Long, Word As String, key As String, Count As Long, DocCount As Long, c As Long, ch As String
   Dim numeric As Boolean

   For X=0 To pXRootFolder.DocInfos.Count-1
      Set XDoc=pXRootFolder.DocInfos(X).XDocument
      For W=0 To XDoc.Words.Count-1
         Word=LCase(XDoc.Words(W).Text)
         numeric=False
         For c=1 To Len(digits)
            numeric=InStr(Word,Mid(digits,c,1))
            If numeric Then Exit For
         Next
         If Not numeric Then
            For c=1 To Len(punc)
               Word=Replace(Word,Mid(punc,c,1),"")
            Next
            If Len(Word)>1 Then
               If Dict.Exists(Word) Then
                  Dict(Word)=Dict(Word)+1
               Else
                  Dict.Add(Word,1)
               End If
            End If
        End If
      Next
   Next
   DocCount=pXRootFolder.DocInfos.Count
   Open "c:\temp\english.dict" For Output As 1
   Print #1, vbUTF8BOM;
   For Each key In Dict.Keys
      Count=Dict(key)
      If Count>DocCount/10 Then 'if the word appears in at least 10% of documents
         Print #1, key & "," & Format(Len(key),"#") & "-" & Format (Count/DocCount,"0.000")
      End If
   Next
   Close #1
End Sub

Private Sub SL_English_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, ByVal pLocator As CASCADELib.CscXDocField)
   Dim Words As CscXDocFieldAlternatives, W As Long
   Set Words=pXDoc.Locators.ItemByName("FL_English").Alternatives
   With pLocator.Alternatives.Create
      .Text="English"
      For W=0 To Words.Count-1
         'Add the length of each word found multiplied by its average frequency on documents
         .Confidence=.Confidence+ CInt(Split(Words(W).Text,"-")(0))*CDbl(Split(Words(W).Text,"-")(1))
      Next
      .Confidence=.Confidence/(len(pxdoc.words.text)-pxdoc.words.count) ' divide by length of text minus the spaces between words
      End With
End Sub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GibberishDetection.md

GibberishDetection.md

Gibberish Detection Script

Create a Dictionary

Create locators to test a document with the dictionary.

Ways to extend this further

Files

GibberishDetection.md

Latest commit

History

GibberishDetection.md

File metadata and controls

Gibberish Detection Script

Create a Dictionary

Create locators to test a document with the dictionary.

Ways to extend this further