Assertion error with pdfsizeopt #110

galaxy001 · 2019-03-18T09:33:14Z

Using pdfsizeopt_libexec_darwin-v1.tar.gz.

 ./pdfsizeopt.single nnGm.pdf nnGmox.pdf
info: This is pdfsizeopt ZIP rUNKNOWN size=69649.
info: prepending to PATH: /Users/galaxy/git/etc/pdfsizeopt/pdfsizeopt_libexec
info: loading PDF from: nnGm.pdf
info: loaded PDF of 6470058 bytes
info: using Ghostscript TMPDIR=/var/folders/7c/5nl6z4jx4zq08fm0qqmbl1200000gn/T TEMP=/var/folders/7c/5nl6z4jx4zq08fm0qqmbl1200000gn/T /Users/galaxy/git/etc/pdfsizeopt/pdfsizeopt_libexec/pdfsizeopt_gs/gs: GPL Ghostscript 9.05 (2012-02-08)
info: decompressing 1356 bytes with Ghostscript /Filter/FlateDecode/DecodeParms <</Columns 5/Predictor 12>>
info: found 3088 obj offsets and 11 obj streams in xref stream
Traceback (most recent call last):
  File "/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "./pdfsizeopt.single/__main__.py", line 1, in <module>
  File "./pdfsizeopt.single/m.py", line 6, in <module>
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 5610, in main
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 2587, in Load
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 2832, in ParseUsingXref
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 2806, in ParseUsingXrefStream
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 629, in __init__
pdfsizeopt.main.PdfTokenParseError: In obj data between ofs 0 and 523: Found empty name token.

nnGm.pdf

The text was updated successfully, but these errors were encountered:

pts · 2019-03-18T12:21:15Z

Thank you for reporting this problem! Object 523 in the input file nnGm.pdf looks like this (after some transformations):

523 0 obj
<<
/codeMantra,#20LLC(http://www.codemantra.com)
/Universal#20PDF(The process that creates this PDF constitutes a trade secret of codeMantra, LLC and is protected by the copyright laws of the United States)
/Title(Population Genetics of Bacteria : A Tribute to Thomas S. Whittam)
/Producer(Acrobat Distiller 7.0 \(Windows\))
/ModDate(D:20190318160559+08'00')
/EBX_PUBLISHER/ASM#20Press
/Creator(PScript5.dll Version 5.2)
/CreationDate(D:20110813074357+05'30')
/Author(Walk, Seth T.\(Editor\))
//www.codemantra.com  
>>
endobj

The line //www.codemantra.com near the bottom has a syntax error: // is not valid there, so nnGm.pdf is broken.

It would be awesome if pdfsizeopt was able to repair broken PDF files such as nnGm.pdf. However, adding and maintaining such repair code is not feasible until it gets funding.

As of now, to get nnGm.pdf processed by pdfsizeopt successfully, you need the regenerate nnGm.pdf with non-broken software first. Or you may want to preprocess it with pdftk or qpdf (and feed the output of those tools to pdfsizeopt), which may be more lenient on these kinds of syntax errors.

galaxy001 · 2019-03-19T08:08:06Z

I not sure whether the raw XML is wrong. It seems just structured.

The pdfx:Universalↂ0020PDF contains ↂ, which might cannot be a XML element name ?

2863 0 obj
<</Type/Metadata/Subtype/XML/Length 4228>>
stream
<?xpacket begin="<U+FEFF>" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c015 84.159810, 2016/09/10-02:41:30        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
            xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:stRef="http://ns.adobe.com/xap/1.0/sType/ResourceRef#"
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <pdf:Producer>Acrobat Distiller 7.0 (Windows)</pdf:Producer>
         <pdfx:codeMantraↂ002Cↂ0020LLC>http://www.codemantra.com</pdfx:codeMantraↂ002Cↂ0020LLC>
         <pdfx:Universalↂ0020PDF>The process that creates this PDF constitutes a trade secret of codeMantra, LLC and is prote
cted by the copyright laws of the United States</pdfx:Universalↂ0020PDF>
         <xmp:CreateDate>2011-08-13T07:43:57+05:30</xmp:CreateDate>
         <xmp:ModifyDate>2019-03-18T16:05:59+08:00</xmp:ModifyDate>
         <xmp:MetadataDate>2019-03-18T16:05:59+08:00</xmp:MetadataDate>
         <xmp:CreatorTool>PScript5.dll Version 5.2</xmp:CreatorTool>
         <xmpMM:DocumentID>uuid:DF57C9D151C5E0119B6BD1C4AAD1A2F9</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:66b746de-c760-f544-a64a-127d172cc809</xmpMM:InstanceID>
         <xmpMM:DerivedFrom rdf:parseType="Resource">
            <stRef:documentName>uuid:131bfa5e-206c-4a25-aa69-1a9c002a577a</stRef:documentName>
            <stRef:documentID>uuid:ff0ad5d3-c572-4519-8102-3197dccd28d4</stRef:documentID>
         </xmpMM:DerivedFrom>
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">Population Genetics of Bacteria : A Tribute to Thomas S. Whittam</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:creator>
            <rdf:Seq>
               <rdf:li>Walk, Seth T.(Editor)</rdf:li>
            </rdf:Seq>
         </dc:creator>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

[DOZENS OF SPACE CHARS HERE]

<?xpacket end="w"?>
endstream
endobj

pts · 2019-03-19T12:14:05Z

pdfsizeopt doesn't have a problem with object 2863 (in fact, pdfsizeopt keeps such XML objects intact), it is complaining about the syntax error in object 523.

Unfortunately I'm not able to advise you how to fix the input PDF beyond the advice I've already given (i.e. try pdftk or qpdf). If you manage the fix it, please update this issue!

galaxy001 · 2019-03-20T04:28:27Z

I managed to fix it with qpdf --qdf and remove the "www.codemantra.com" items.

Then, it works. And I even find qdf will leads to smaller file.

   7269273 Mar 20 11:36 s.pdf
   6919777 Mar 20 11:39 so.pdf
  17234287 Mar 20 11:37 s.qdf
   6807064 Mar 20 11:39 so.qdf.pdf

I am facing #111 now.

pts closed this as completed Mar 18, 2019

pts added enhancement wontfix labels Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion error with pdfsizeopt #110

Assertion error with pdfsizeopt #110

galaxy001 commented Mar 18, 2019 •

edited

Loading

pts commented Mar 18, 2019 •

edited

Loading

galaxy001 commented Mar 19, 2019 •

edited

Loading

pts commented Mar 19, 2019 •

edited

Loading

galaxy001 commented Mar 20, 2019

Assertion error with pdfsizeopt #110

Assertion error with pdfsizeopt #110

Comments

galaxy001 commented Mar 18, 2019 • edited Loading

pts commented Mar 18, 2019 • edited Loading

galaxy001 commented Mar 19, 2019 • edited Loading

pts commented Mar 19, 2019 • edited Loading

galaxy001 commented Mar 20, 2019

galaxy001 commented Mar 18, 2019 •

edited

Loading

pts commented Mar 18, 2019 •

edited

Loading

galaxy001 commented Mar 19, 2019 •

edited

Loading

pts commented Mar 19, 2019 •

edited

Loading