-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Option to Skip OCR #4
Comments
Hi, thank you. Problem is that many documents contain a small amount of text. "Page already has text" is often somewhat a false positive. The majority of the recognizable text will not be found sometimes if we skip OCR. Eg. company names only appear in logos or other images (e.g. Invoices received via email). In fact, I tried to do that but stumbled upon many edge cases. Unless you have very predefined PDFs this option is more negative than positive. Thank about it and let me know if still relevant. PS: If you like this project, you might find PhraseVault very useful. Consider it, PhraseVault helps me co-finance open source projects like this :) |
Hi, thank you for the explanation. I understand why skipping it might lead to inconsistencies. Allowing OCR to run again seems like the better approach for ensuring consistent results from the same engine. :) As for PhraseVault, thank you for recommending it! At first glance, it looks like a great tool, and I’ll definitely take a closer look. For some specific cases, I’ve been using AHK for quite a bit now, but I really like the clean GUI and the simplicity of PhraseVault—it looks like a great product. Thanks again for your response and for taking time and creating such fantastic tools! |
Thank you. If you still need this feature, just let me know. I might imlement it for someone who has very unified documents or definitly had OCR previously running through. It's not that hard. Just wanted to let you know that I'm trying up improve the processing speed in general. As for AKH: Yes 💪 great software, used it for years. However I got tired of changing the script file for new entries and AHK was WAY TOO complicated for end users in offices (my clients). Thats the reason for this software. As easy and simply as possible. Appreciate your feedback, thank you! |
Hi @ptmrio , thanks for this amazing tool. Adding to this discussion:
Again thanks for your great work! |
Hi @BenProe Does it indeed? It SHOULD OCR on a temporary file. So, just to be clear: You currently scan them with Mobile Doc Scanner, run Autorename PDF and it changed the original input PDF? It would help my a lot, if you'd confirm this by checking the exact filesize before and after scanning. You could also use MD5. I will look into it, but it may take some time. Your help could speed it up :) |
Hi,
First of all, thank you for this amazing tool—it’s been incredibly helpful!
I have a feature request regarding the OCR functionality. It would be great to have a .env setting that allows users to skip OCR processing for PDFs that have already undergone OCR. For example, if I feed the software a folder containing PDFs with OCR already applied, I’d like the tool to be faster, and only rename and tag those files, without checking or reprocessing them with OCR a second time.
I hope this would save time and resources, especially for users handling large volumes of pre-OCRed documents. Is this something that could be implemented?
I noticed that your source code seems to check for OCR, but during execution, this check appears to be skipped.
Perhaps adding an option to completely disable OCR processing could help address this and provide more flexibility for users handling pre-OCRed documents.
Thanks for considering this request, and let me know if I can provide more details.
The text was updated successfully, but these errors were encountered: