Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Option to Skip OCR #4

Open
rawora-rg opened this issue Dec 6, 2024 · 5 comments
Open

Add Option to Skip OCR #4

rawora-rg opened this issue Dec 6, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@rawora-rg
Copy link

Hi,

First of all, thank you for this amazing tool—it’s been incredibly helpful!

I have a feature request regarding the OCR functionality. It would be great to have a .env setting that allows users to skip OCR processing for PDFs that have already undergone OCR. For example, if I feed the software a folder containing PDFs with OCR already applied, I’d like the tool to be faster, and only rename and tag those files, without checking or reprocessing them with OCR a second time.

I hope this would save time and resources, especially for users handling large volumes of pre-OCRed documents. Is this something that could be implemented?

I noticed that your source code seems to check for OCR, but during execution, this check appears to be skipped.
grafik

Perhaps adding an option to completely disable OCR processing could help address this and provide more flexibility for users handling pre-OCRed documents.

Thanks for considering this request, and let me know if I can provide more details.

@ptmrio ptmrio self-assigned this Dec 6, 2024
@ptmrio ptmrio added the enhancement New feature or request label Dec 6, 2024
@ptmrio
Copy link
Owner

ptmrio commented Dec 6, 2024

Hi, thank you.

Problem is that many documents contain a small amount of text. "Page already has text" is often somewhat a false positive. The majority of the recognizable text will not be found sometimes if we skip OCR. Eg. company names only appear in logos or other images (e.g. Invoices received via email).

In fact, I tried to do that but stumbled upon many edge cases. Unless you have very predefined PDFs this option is more negative than positive.

Thank about it and let me know if still relevant.

PS: If you like this project, you might find PhraseVault very useful. Consider it, PhraseVault helps me co-finance open source projects like this :)

@rawora-rg
Copy link
Author

Hi,

thank you for the explanation. I understand why skipping it might lead to inconsistencies. Allowing OCR to run again seems like the better approach for ensuring consistent results from the same engine. :)

As for PhraseVault, thank you for recommending it! At first glance, it looks like a great tool, and I’ll definitely take a closer look. For some specific cases, I’ve been using AHK for quite a bit now, but I really like the clean GUI and the simplicity of PhraseVault—it looks like a great product.

Thanks again for your response and for taking time and creating such fantastic tools!

@ptmrio
Copy link
Owner

ptmrio commented Dec 7, 2024

Thank you.

If you still need this feature, just let me know. I might imlement it for someone who has very unified documents or definitly had OCR previously running through. It's not that hard.

Just wanted to let you know that I'm trying up improve the processing speed in general.

As for AKH: Yes 💪 great software, used it for years. However I got tired of changing the script file for new entries and AHK was WAY TOO complicated for end users in offices (my clients). Thats the reason for this software. As easy and simply as possible.

Appreciate your feedback, thank you!

@BenProe
Copy link

BenProe commented Jan 8, 2025

Hi @ptmrio ,

thanks for this amazing tool. Adding to this discussion:
I usually scan my documents via "Mobile Doc Scanner" (Android) which has a good OCR built in. Now unfortunately autorename removes the searchable text layer in my pdf documents. Is there a way to either:

  1. Skip the OCR as mentioned above
  2. Let Autorename do the OCR (OCRmyPDF / PyMuPDF?) and add the text layer
  3. Keep the original text layer without rasterizing

Again thanks for your great work!

@ptmrio
Copy link
Owner

ptmrio commented Jan 13, 2025

Hi @BenProe

Does it indeed? It SHOULD OCR on a temporary file.

So, just to be clear: You currently scan them with Mobile Doc Scanner, run Autorename PDF and it changed the original input PDF?

It would help my a lot, if you'd confirm this by checking the exact filesize before and after scanning. You could also use MD5. I will look into it, but it may take some time. Your help could speed it up :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants