-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ricecooker.utils.pdf.PDFParser#get_toc reimplemented to use pdftk #324
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One update to start with - my suggestion in the issue was just to bite the bullet and use pdftk for all of our PDF operations, so we could drop reliance on the unmaintained Python library.
ricecooker/utils/pdf.py
Outdated
def get_toc(self, subchapters=False): | ||
temp_file = NamedTemporaryFile() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to use mkstemp
here - on windows trying to access a NamedTemporaryFile
by referencing its name will give you a permissions error.
Also then need to explicitly close the file handle returned by mkstemp to ensure proper cleanup.
See here for an example:
ricecooker/ricecooker/classes/files.py
Line 776 in 8cbdf59
fdin, temp_in_file_name = tempfile.mkstemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One update to start with - my suggestion in the issue was just to bite the bullet and use pdftk for all of our PDF operations, so we could drop reliance on the unmaintained Python library.
There is one issue here where we'll break the PDFParser API which I mentioned in my comment on the linked issue: #312 - the PDFParser.pdf
field is (in code) a public property. It isn't mentioned in the docs though. I'm okay with moving forward this way but feel it just needs to be noted that some peoples' chefs might break if they've managed to rely on that property at all - which they might because it is a PyPDF2 instance.
Will continue on with fully wrapping pdftk - but as we are effectively not doing anything to improve the software, I also suggested we consider just catching the error and informing the user of the limitations of PDFParser - which is that we cannot help them if their PDF doesn't have the proper metadata. This would be equally robust a solution if it can be worked out and ostensibly simpler.
We have to do something like this anyway if we use pdftk.
Happy to go either way just wanted to hear your thoughts on the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - its worth noting that the change I made is basically removing the meat of PyPDF2.
What remains is some clean-up and documentation updates to finalize the removal of PyPDF2. Also - I need to catch and raise a meaningful error message when we get a PDF without an outline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - I think catching errors and giving meaningful error messages is an improvement :)
I also recently came across this: https://github.com/jorisschellekens/borb - not sure if it would help or not. |
- Used subprocess to call pdftk - The output is written to and then read from a tempfile - this is done to avoid issues where CLI tools may at times output unexpected lines that could cause issues for parsing. - Used NamedTemporaryFile and only accessed the file with the `NamedTemporaryFile.name` property once - should be compatible across all OS implementations. Also the file is closed at the end of the process. - This is only done such that it passes the existing test suite - so it assumes that the suite is robust enough at this point to cover some edge cases
Trying I can give borb a look as an alternative for sure - as long as it breaks up the chapters similarly to PyPDF2 and pdftk it ought to be a relatively quick refactor |
re: load a file like: with open('path/to/file', 'rb') as f:
doc = borb.pdf.pdf.PDF.loads(f)
outlines = doc["XRef"]["Trailer"]["Root"]["Outlines"] # is where the outlines live
from borb.io.read.types import Name, String, Dictionary
# when you look into outlines, you'll see a dictionary of objects of thoes Name, String, Dictionary types
# which means you have to look at the source to know how to get the values from them like String you call
# __str__() on it... it's a mess. This should be enough of a head start upon returning. |
I played a bit more with |
cpdf is free for non-commercial use and does everything we want. It was a pretty quick shift - but now tox is mad about subprocesses. I wanted to capture stderr during dev so I used subprocess.run() and didn't write to a tmp file like before. Not sure if this is the cause, but the error is coming up in subprocess.Popen - so despite finding a tool that works for our needs very well, back at square one with failing tests on the binaries. Will come back to this again at some point. |
Stale - closing for now |
Fixes #312
WIP - Need to make a test that fails the old
get_toc
implementation (temporarily renamed toget_old_toc
) but passes the new one to be sure that I'm addressing the issue noted about PDFs created withwkhtmltopdf
.Currently:
get_toc
passes existing test suites.get_toc
usesNamedTemporaryFile
and hopefully in a cross-OS compatible way (I only accessed it using the.name
property once and close the file at the end anyway).get_toc
for their data.