Text Extration

A modular framework for extracting text from many different sources (websites, PDFs, images).

Text Extractors comparison

PDF

There are two types of PDF:

"Image only" PDFs that just embed (scanned) images. But they contain no selectable and therefore extractable text. To get the text in the images, first the images have to be extracted from the PDF and then OCR applied to them. See section Images.
Searchable PDFs: If you open them in a PDF viewer you can select their text or search for it. The following libraries help to extract text from these types of PDFs:

Searchable PDFs

Extractor	Permissive License	Runs on Android	Advantages	Disadvantages
pdftotext	✔️	❌	Best PDF extraction result so far	User has to install Poppler Utils Does not run on Android
iText 2	✔️	✔️	Works also with PDFs with disordered layouts Best PDF extraction result of any Java library I found Works on older Androids (at least on Android 4.1) Almost the same text extraction quality as the newer (and non-free) iText 7
iText	❌	✔️	Works also with PDFs with disordered layouts Best PDF extraction result of any Java library I found Works on older Androids (at least on Android 4.1)	Not free / commercial (AGPL / commercial license)
OpenPDF	✔️	(:heavy_check_mark:)	Free Quite good and fast	Does not work on PDFs with disordered layouts Does not run on older Androids (uses Java 8 features (Optional); works on Android 6 but not on Android 4.1, others not tested)
PDFBox (not added yet)	✔️	❌
PdfBox-Android (not added yet)	✔️	✔️

iText 2 and iText 7

iText 2 is the older, permissive version of then turned commercial iText. But as the last free iText version, 2.1.7, has security flaws, I used version 2.1.7.js7 from JasperReports as this version fixes the security issues. It's slower than iText 7 but in regard to text extraction quality I cannot see any difference between iText 7 and iText 2.

OpenPdf

OpenPdf took the last commit with a permissive license of iText and developed it further. But according to my experience its text extraction capability is worse than that one of iText 7 and iText 2.

Do not add OpenPdfPdfTextExtractor and iText2PdfTextExtractor to the class path at the same time as both have the same package and class names but different method and class signatures -> one of them will crash when using them.

(Very opinionated) Recommendation

If you can use pdftotext (Poppler), use pdftotext. It yields the best results both in terms of text extraction quality and speed.

Otherwise use security issues fixed version of iText 2. It's slower than commercial (and really amazing good) iText 7, but in terms of text extraction quality I cannot see any difference between iText 2 and iText 7.

I don't know why, but of some PDFs OpenPdf cannot extract any text at all.

How to distinguish between Searchable and "Image only" PDFs?

Kurt Pfeifle gave an superb hint (https://stackoverflow.com/a/3108531): Check how many fonts a PDF uses. If it uses fonts, it contains searchable text. If it uses no font at all it contains only images.

I added IPdfTypeDetector implementations for Poppler / pdffonts and ...

Images

(All variants with Tesseract 4 have the same extraction quality, which is quite good but not the best.)

Extractor	Advantages	Disadvantages
tess4j	Uses Tesseract 4	User has to install Tesseract Extraction result depends a lot on image quality Does not run on Android
Tesseract 4 over JNI (e. g. from Bytedeco)	Uses Tesseract 4	If there's an exception in native code whole application crashes (JNI) User has to install Tesseract Extraction result depends a lot on image quality Does not run on Android
Tesseract4Android	Uses Tesseract 4	Very slow, took 2 minutes to recognize a single image (0,5 MB) Extraction result depends a lot on image quality
Tess4Android	Uses Tesseract 4	Couldn't get it to compile
TextFairy (not added yet)		Uses Tesseract 3 Quite slow Extraction result depends a lot on image quality
Microsoft Cloud Computer Vision API OCR (not implemented yet)	Best image extraction result I found so far	Requires registration (credit card required; every single user to do this for his/her self) Costs $1.50 per 1000 images (see) Data protection insanity, stores all your images and recognized text for years
Google Cloud Vision OCR (neither implemented nor tested yet)		Requires registration (credit card required; every single user to do this for his/her self) 1000 images per month are free, have to pay for more (see) Data protection insanity, stores all your images and recognized text for years

License

If not stated otherwise all code is licensed under Apache License, Version 2.0.

Notice: Some libraries, like iText, have different, partially commercial licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 266 Commits
FineReaderCommandlineImageTextExtractor		FineReaderCommandlineImageTextExtractor
FineReaderHotFolderImageTextExtractor		FineReaderHotFolderImageTextExtractor
OpenPdfPdfTextExtractor		OpenPdfPdfTextExtractor
PdfBox2PdfTextExtractor		PdfBox2PdfTextExtractor
PdfBoxAndroidPdfTextExtractor		PdfBoxAndroidPdfTextExtractor
PdfBoxPdfTextExtractor		PdfBoxPdfTextExtractor
PopplerPdfTextExtractor		PopplerPdfTextExtractor
Tesseract4CommandlineImageTextExtractor		Tesseract4CommandlineImageTextExtractor
Tesseract4JniImageTextExtractor		Tesseract4JniImageTextExtractor
TesseractCommon		TesseractCommon
TestAppAndroid		TestAppAndroid
TestAppJavaFX		TestAppJavaFX
TextExtractorCommon		TextExtractorCommon
TikaTextExtractor		TikaTextExtractor
gradle		gradle
iText2PdfTextExtractor		iText2PdfTextExtractor
iTextPdfTextExtractor		iTextPdfTextExtractor
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts
versions.gradle		versions.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Extration

Text Extractors comparison

PDF

Searchable PDFs

iText 2 and iText 7

OpenPdf

(Very opinionated) Recommendation

How to distinguish between Searchable and "Image only" PDFs?

Images

License

About

Releases

Packages

Languages

License

dankito/TextExtraction

Folders and files

Latest commit

History

Repository files navigation

Text Extration

Text Extractors comparison

PDF

Searchable PDFs

iText 2 and iText 7

OpenPdf

(Very opinionated) Recommendation

How to distinguish between Searchable and "Image only" PDFs?

Images

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages