Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion error with pdfsizeopt #110

Closed
galaxy001 opened this issue Mar 18, 2019 · 4 comments
Closed

Assertion error with pdfsizeopt #110

galaxy001 opened this issue Mar 18, 2019 · 4 comments

Comments

@galaxy001
Copy link

galaxy001 commented Mar 18, 2019

Using pdfsizeopt_libexec_darwin-v1.tar.gz.

 ./pdfsizeopt.single nnGm.pdf nnGmox.pdf
info: This is pdfsizeopt ZIP rUNKNOWN size=69649.
info: prepending to PATH: /Users/galaxy/git/etc/pdfsizeopt/pdfsizeopt_libexec
info: loading PDF from: nnGm.pdf
info: loaded PDF of 6470058 bytes
info: using Ghostscript TMPDIR=/var/folders/7c/5nl6z4jx4zq08fm0qqmbl1200000gn/T TEMP=/var/folders/7c/5nl6z4jx4zq08fm0qqmbl1200000gn/T /Users/galaxy/git/etc/pdfsizeopt/pdfsizeopt_libexec/pdfsizeopt_gs/gs: GPL Ghostscript 9.05 (2012-02-08)
info: decompressing 1356 bytes with Ghostscript /Filter/FlateDecode/DecodeParms <</Columns 5/Predictor 12>>
info: found 3088 obj offsets and 11 obj streams in xref stream
Traceback (most recent call last):
  File "/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "./pdfsizeopt.single/__main__.py", line 1, in <module>
  File "./pdfsizeopt.single/m.py", line 6, in <module>
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 5610, in main
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 2587, in Load
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 2832, in ParseUsingXref
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 2806, in ParseUsingXrefStream
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 629, in __init__
pdfsizeopt.main.PdfTokenParseError: In obj data between ofs 0 and 523: Found empty name token.

nnGm.pdf

@pts
Copy link
Owner

pts commented Mar 18, 2019

Thank you for reporting this problem! Object 523 in the input file nnGm.pdf looks like this (after some transformations):

523 0 obj
<<
/codeMantra,#20LLC(http://www.codemantra.com)
/Universal#20PDF(The process that creates this PDF constitutes a trade secret of codeMantra, LLC and is protected by the copyright laws of the United States)
/Title(Population Genetics of Bacteria : A Tribute to Thomas S. Whittam)
/Producer(Acrobat Distiller 7.0 \(Windows\))
/ModDate(D:20190318160559+08'00')
/EBX_PUBLISHER/ASM#20Press
/Creator(PScript5.dll Version 5.2)
/CreationDate(D:20110813074357+05'30')
/Author(Walk, Seth T.\(Editor\))
//www.codemantra.com  
>>
endobj

The line //www.codemantra.com near the bottom has a syntax error: // is not valid there, so nnGm.pdf is broken.

It would be awesome if pdfsizeopt was able to repair broken PDF files such as nnGm.pdf. However, adding and maintaining such repair code is not feasible until it gets funding.

As of now, to get nnGm.pdf processed by pdfsizeopt successfully, you need the regenerate nnGm.pdf with non-broken software first. Or you may want to preprocess it with pdftk or qpdf (and feed the output of those tools to pdfsizeopt), which may be more lenient on these kinds of syntax errors.

@galaxy001
Copy link
Author

galaxy001 commented Mar 19, 2019

I not sure whether the raw XML is wrong. It seems just structured.

The pdfx:Universalↂ0020PDF contains , which might cannot be a XML element name ?

2863 0 obj
<</Type/Metadata/Subtype/XML/Length 4228>>
stream
<?xpacket begin="<U+FEFF>" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c015 84.159810, 2016/09/10-02:41:30        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
            xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:stRef="http://ns.adobe.com/xap/1.0/sType/ResourceRef#"
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <pdf:Producer>Acrobat Distiller 7.0 (Windows)</pdf:Producer>
         <pdfx:codeMantraↂ002Cↂ0020LLC>http://www.codemantra.com</pdfx:codeMantraↂ002Cↂ0020LLC>
         <pdfx:Universalↂ0020PDF>The process that creates this PDF constitutes a trade secret of codeMantra, LLC and is prote
cted by the copyright laws of the United States</pdfx:Universalↂ0020PDF>
         <xmp:CreateDate>2011-08-13T07:43:57+05:30</xmp:CreateDate>
         <xmp:ModifyDate>2019-03-18T16:05:59+08:00</xmp:ModifyDate>
         <xmp:MetadataDate>2019-03-18T16:05:59+08:00</xmp:MetadataDate>
         <xmp:CreatorTool>PScript5.dll Version 5.2</xmp:CreatorTool>
         <xmpMM:DocumentID>uuid:DF57C9D151C5E0119B6BD1C4AAD1A2F9</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:66b746de-c760-f544-a64a-127d172cc809</xmpMM:InstanceID>
         <xmpMM:DerivedFrom rdf:parseType="Resource">
            <stRef:documentName>uuid:131bfa5e-206c-4a25-aa69-1a9c002a577a</stRef:documentName>
            <stRef:documentID>uuid:ff0ad5d3-c572-4519-8102-3197dccd28d4</stRef:documentID>
         </xmpMM:DerivedFrom>
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">Population Genetics of Bacteria : A Tribute to Thomas S. Whittam</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:creator>
            <rdf:Seq>
               <rdf:li>Walk, Seth T.(Editor)</rdf:li>
            </rdf:Seq>
         </dc:creator>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

[DOZENS OF SPACE CHARS HERE]

<?xpacket end="w"?>
endstream
endobj

@pts
Copy link
Owner

pts commented Mar 19, 2019

pdfsizeopt doesn't have a problem with object 2863 (in fact, pdfsizeopt keeps such XML objects intact), it is complaining about the syntax error in object 523.

Unfortunately I'm not able to advise you how to fix the input PDF beyond the advice I've already given (i.e. try pdftk or qpdf). If you manage the fix it, please update this issue!

@galaxy001
Copy link
Author

I managed to fix it with qpdf --qdf and remove the "www.codemantra.com" items.

Then, it works. And I even find qdf will leads to smaller file.

   7269273 Mar 20 11:36 s.pdf
   6919777 Mar 20 11:39 so.pdf
  17234287 Mar 20 11:37 s.qdf
   6807064 Mar 20 11:39 so.qdf.pdf

I am facing #111 now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants