Add Option to Skip OCR #4

rawora-rg · 2024-12-06T10:34:31Z

Hi,

First of all, thank you for this amazing tool—it’s been incredibly helpful!

I have a feature request regarding the OCR functionality. It would be great to have a .env setting that allows users to skip OCR processing for PDFs that have already undergone OCR. For example, if I feed the software a folder containing PDFs with OCR already applied, I’d like the tool to be faster, and only rename and tag those files, without checking or reprocessing them with OCR a second time.

I hope this would save time and resources, especially for users handling large volumes of pre-OCRed documents. Is this something that could be implemented?

I noticed that your source code seems to check for OCR, but during execution, this check appears to be skipped.

Perhaps adding an option to completely disable OCR processing could help address this and provide more flexibility for users handling pre-OCRed documents.

Thanks for considering this request, and let me know if I can provide more details.

ptmrio · 2024-12-06T11:17:26Z

Hi, thank you.

Problem is that many documents contain a small amount of text. "Page already has text" is often somewhat a false positive. The majority of the recognizable text will not be found sometimes if we skip OCR. Eg. company names only appear in logos or other images (e.g. Invoices received via email).

In fact, I tried to do that but stumbled upon many edge cases. Unless you have very predefined PDFs this option is more negative than positive.

Thank about it and let me know if still relevant.

PS: If you like this project, you might find PhraseVault very useful. Consider it, PhraseVault helps me co-finance open source projects like this :)

rawora-rg · 2024-12-06T12:37:23Z

Hi,

thank you for the explanation. I understand why skipping it might lead to inconsistencies. Allowing OCR to run again seems like the better approach for ensuring consistent results from the same engine. :)

As for PhraseVault, thank you for recommending it! At first glance, it looks like a great tool, and I’ll definitely take a closer look. For some specific cases, I’ve been using AHK for quite a bit now, but I really like the clean GUI and the simplicity of PhraseVault—it looks like a great product.

Thanks again for your response and for taking time and creating such fantastic tools!

ptmrio · 2024-12-07T08:58:42Z

Thank you.

If you still need this feature, just let me know. I might imlement it for someone who has very unified documents or definitly had OCR previously running through. It's not that hard.

Just wanted to let you know that I'm trying up improve the processing speed in general.

As for AKH: Yes 💪 great software, used it for years. However I got tired of changing the script file for new entries and AHK was WAY TOO complicated for end users in offices (my clients). Thats the reason for this software. As easy and simply as possible.

Appreciate your feedback, thank you!

BenProe · 2025-01-08T21:15:49Z

Hi @ptmrio ,

thanks for this amazing tool. Adding to this discussion:
I usually scan my documents via "Mobile Doc Scanner" (Android) which has a good OCR built in. Now unfortunately autorename removes the searchable text layer in my pdf documents. Is there a way to either:

Skip the OCR as mentioned above
Let Autorename do the OCR (OCRmyPDF / PyMuPDF?) and add the text layer
Keep the original text layer without rasterizing

Again thanks for your great work!

ptmrio · 2025-01-13T17:58:53Z

Hi @BenProe

Does it indeed? It SHOULD OCR on a temporary file.

So, just to be clear: You currently scan them with Mobile Doc Scanner, run Autorename PDF and it changed the original input PDF?

It would help my a lot, if you'd confirm this by checking the exact filesize before and after scanning. You could also use MD5. I will look into it, but it may take some time. Your help could speed it up :)

ptmrio self-assigned this Dec 6, 2024

ptmrio added the enhancement New feature or request label Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Option to Skip OCR #4

Add Option to Skip OCR #4

rawora-rg commented Dec 6, 2024

ptmrio commented Dec 6, 2024

rawora-rg commented Dec 6, 2024

ptmrio commented Dec 7, 2024

BenProe commented Jan 8, 2025 •

edited

Loading

ptmrio commented Jan 13, 2025

Add Option to Skip OCR #4

Add Option to Skip OCR #4

Comments

rawora-rg commented Dec 6, 2024

ptmrio commented Dec 6, 2024

rawora-rg commented Dec 6, 2024

ptmrio commented Dec 7, 2024

BenProe commented Jan 8, 2025 • edited Loading

ptmrio commented Jan 13, 2025

BenProe commented Jan 8, 2025 •

edited

Loading