A modular framework for extracting text from many different sources (websites, PDFs, images).
There are two types of PDF:
- "Image only" PDFs that just embed (scanned) images. But they contain no selectable and therefore extractable text. To get the text in the images, first the images have to be extracted from the PDF and then OCR applied to them. See section Images.
- Searchable PDFs: If you open them in a PDF viewer you can select their text or search for it. The following libraries help to extract text from these types of PDFs:
Extractor | Permissive License | Runs on Android | Advantages | Disadvantages |
---|---|---|---|---|
pdftotext | ✔️ | ❌ |
|
|
iText 2 | ✔️ | ✔️ |
|
|
iText | ❌ | ✔️ |
|
|
OpenPDF | ✔️ | (:heavy_check_mark:) |
|
|
PDFBox (not added yet) | ✔️ | ❌ | ||
PdfBox-Android (not added yet) | ✔️ | ✔️ |
iText 2 is the older, permissive version of then turned commercial iText. But as the last free iText version, 2.1.7, has security flaws, I used version 2.1.7.js7 from JasperReports as this version fixes the security issues. It's slower than iText 7 but in regard to text extraction quality I cannot see any difference between iText 7 and iText 2.
OpenPdf took the last commit with a permissive license of iText and developed it further. But according to my experience its text extraction capability is worse than that one of iText 7 and iText 2.
Do not add OpenPdfPdfTextExtractor and iText2PdfTextExtractor to the class path at the same time as both have the same package and class names but different method and class signatures -> one of them will crash when using them.
If you can use pdftotext (Poppler), use pdftotext. It yields the best results both in terms of text extraction quality and speed.
Otherwise use security issues fixed version of iText 2. It's slower than commercial (and really amazing good) iText 7, but in terms of text extraction quality I cannot see any difference between iText 2 and iText 7.
I don't know why, but of some PDFs OpenPdf cannot extract any text at all.
Kurt Pfeifle gave an superb hint (https://stackoverflow.com/a/3108531): Check how many fonts a PDF uses. If it uses fonts, it contains searchable text. If it uses no font at all it contains only images.
I added IPdfTypeDetector implementations for Poppler / pdffonts and ...
(All variants with Tesseract 4 have the same extraction quality, which is quite good but not the best.)
Extractor | Advantages | Disadvantages |
---|---|---|
tess4j |
|
|
Tesseract 4 over JNI (e. g. from Bytedeco) |
|
|
Tesseract4Android |
|
|
Tess4Android |
|
|
TextFairy (not added yet) |
|
|
Microsoft Cloud Computer Vision API OCR (not implemented yet) |
|
|
Google Cloud Vision OCR (neither implemented nor tested yet) |
|
If not stated otherwise all code is licensed under Apache License, Version 2.0.
Notice: Some libraries, like iText, have different, partially commercial licenses.