Extend PDF-based tests to `dvips` #278

muzimuzhi · 2023-02-22T17:46:44Z

This PR adds dvips chain to the list of check engines use for PDF-based tests, and of course the new test output.

With the help from @u-fischer in #274, timestamp metadata in PDFs generated by ps2pdf is normalized by adding \DocumentMedadata{} to tex file, which defines \__pdf_backend_set_regression_data: to be used by regression-test.tex.

Two by-produces (perhaps they should live in separate PRs, but due to relations with current PR...):

Add option -a to diff program call, to make it consider all files to be text files.
- This is helpful when there're remaining byte streams in files to compare.
- Right now I don't have direct access to Windows. From doc it seems the corresponding option for fc in Windows is /b.
Add a new pattern of PDF stream to normalize PDFs.
- In old version of test output 00-test-2.latexdvips.tpf, two streams were not normalized. Their beginning marks were /Subtype/Type1C/Length 938>>stream and /Type/Metadata/Length 1546>>stream, so I use a strict pattern /Length %d+>>stream to try to match them. (a sample 00-test-2.latexdvips.tpf.zip)

My original purpose was to test against #273, but it ends with findings that dvips chain (more specifically, the ghostscript behind ps2pdf) doesn't need the epoch settings in env vars to generate normalized metadata in PDF. Evidence: checks are passed for commit af4950c which reverts #273.

The settings are kept, as it does no harm and is used by dvips.

muzimuzhi · 2023-02-22T17:49:33Z

CHANGELOG.md

@@ -7,6 +7,11 @@ this project uses date-based 'snapshot' version identifiers.

 ## [Unreleased]

+### Added
+- Force non-Windows diff program to consider all files to be text files
+- Extend PDF-based tests to `dvips`


Does l3build test changes need to be listed in CHANGELOG?

I'm not sure what you mean - that the change may require .tlg updates?

the changes only involving "using l3build to test l3build itself"

muzimuzhi · 2023-02-22T17:52:04Z

build.lua

+--[[ FIXME:
+  - setting specialformats in config-pdf.lua results in "binary" set to "latexdvips" (should be "latex")
+  - setting specialformats in build.lua disables specifying engine
+      l3build save -c config-pdf -e latexdvips 00-test-2
+]]
+specialformats = specialformats or {}
+specialformats["latex"] = specialformats["latex"] or
+  {
+    latexdvips = {binary = "latex", format = ""}
+  }


I guess the wired limitations are caused by some l3build problems, but for now let's just use a way that "works".

muzimuzhi · 2023-02-22T17:54:29Z

l3build-check.lua

@@ -551,6 +551,11 @@ local function normalize_pdf(content)
      binary = false
      stream = true
      stream_content = "stream" .. os_newline
+    elseif match(line, "/Length %d+>>stream$") then


Maybe ">>stream$" is enough, but I chose a more strict pattern, mainly to not match a non-stream.

muzimuzhi · 2023-02-22T17:58:08Z

.github/tl_packages

+l3experimental
+latex-lab
+pdfmanagement-testphase


Are these three packages too experimental to cause less stable test output files?

muzimuzhi · 2023-02-22T18:14:33Z

testfiles-pdf/00-test-2.pvt

+% needed by dvips (ps2pdf), to normalize pdf metadata
+\ifdefined\pdfoutput\ifnum\pdfoutput=0
+  \DocumentMetadata{}
+\fi\fi


The conditionals are used to limit \DocumentMetadata{} to dvips. Otherwise pdftex and xetex engines will generate PDFs containing a long, decompressed stream with XML metedata.

(Feels like I should move some of these comments to commit messages.)

muzimuzhi · 2023-02-23T07:50:48Z

Add a new pattern of PDF stream to normalize PDFs.

In old version of test output 00-test-2.latexdvips.tpf, two streams were not normalized. Their beginning marks were /Subtype/Type1C/Length 938>>stream and /Type/Metadata/Length 1546>>stream, so I use a strict pattern /Length %d+>>stream to try to match them. (a sample 00-test-2.latexdvips.tpf.zip)

Setting ps2pdfopts = " -dCompressStreams=false " will decompress the second stream starting with /Type/Metadata/Length 1546>>stream. But adding option -dCompressFonts=false has no obvious effect on the embedded font stream starting with /Subtype/Type1C/Length 938>>stream. This (the remaining bit stream for embedded font) makes GitHub still treats the .tpf as a binary file hence the diff for decompressed metadata stream cannot be seen from GitHub webpage. Ghostscript docs

At the moment I can't find a ghostscript cli option acts like the -a option to mutool clean, which results in ASCII Hex encode binary streams.

Just for more real cases, here are all the different beginning patterns found in *.tpf files in pdfresources repo, testfiles-dvips directory:

/N 3/Length 8>>stream
/ProcSet [/PDF/ImageB/Text]>>/Length 170>>stream
/Resources<</ProcSet [/PDF/ImageB]>>/Length 107>>stream
/Resources<</ProcSet [/PDF/ImageB]>>/Length 108>>stream
/Resources<</ProcSet [/PDF]>>/Length 8>>stream
/Size[256]/Length 12>>stream
/Subtype /text#2fplain/Length 13>>stream
/Subtype/Type1C/Length 1006>>stream
/Subtype/XML/Length 1144>>stream
/Type/EmbeddedFile/Length 21>>stream
/Type/Metadata/Length 1547>>stream
/yyy(bla)/Length 89>>stream
<</Filter/FlateDecode/Length 172>>stream
>>>>/Length 71>>stream

Update: I'm inclined to add ps2pdfopts = " -dCompressStreams=false " and make the pattern more strict to only match streams for embedded fonts and images. (First half is done.)

muzimuzhi · 2023-02-24T20:27:44Z

Update: I'm inclined to add ps2pdfopts = " -dCompressStreams=false " and make the pattern more strict to only match streams for embedded fonts and images. (First half is done.)

I have a feeling that l3build should set -dCompressStreams=false by default, when calling ps2pdf in dvitopdf(). Because the expectation is, the pdf-based test outputs should be decompressed.

This can be done either by adding the option to variable ps2pdfopts which will be passed to ps2pdf, or by setting env var GS_OPTIONS.

On macOS, ps2pdf options will finally be passed to gs executable gs twice, hence I'm a bit tend to the GS_OPTIONS way.

$ grep -B 2 -- '-dBATCH' `which ps2pdfwr`
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec "$GS_EXECUTABLE" $OPTIONS -q -P- -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sstdout=%stderr "-sOutputFile=$outfile" $OPTIONS "$infile"

zauguin · 2023-02-24T21:49:29Z

so I use a strict pattern /Length %d+>>stream to try to match them. (a sample 00-test-2.latexdvips.tpf.zip)

This pattern would match pretty generic streams and rely on implementation details regarding spacing in order not to remove something important. That sounds rather fragile. IMO normalization should only happen for data which is known not to be important.

muzimuzhi · 2023-02-25T01:03:59Z

so I use a strict pattern /Length %d+>>stream to try to match them. (a sample 00-test-2.latexdvips.tpf.zip)

This pattern would match pretty generic streams and rely on implementation details regarding spacing in order not to remove something important. That sounds rather fragile. IMO normalization should only happen for data which is known not to be important.

Yes I too realized the fragility of the new pattern in #278 (comment). But then with the following three pieces of info, I tend to think the new pattern is not as fragile as we may think.

The use of gs option -dCompressStreams=false.

The fact that normalize_pdf() actually checks if the stream content contains "binary bytes" (roughly speaking), and will not mistakenly eat human readable stream. normalized_pdf() is first introduced in 64825da.

l3build/l3build-check.lua

Lines 531 to 549 in 26a593f

    
           if match(line,"endstream") then 
        
             stream = false 
        
             if binary then 
        
               new_content = new_content .. "[BINARY STREAM]" .. os_newline 
        
             else 
        
               new_content = new_content .. stream_content .. line .. os_newline 
        
             end 
        
             binary = false 
        
           else 
        
             for i = 0, 31 do 
        
               if match(line,char(i)) then 
        
                 binary = true 
        
                 break 
        
               end 
        
             end 
        
             if not binary and not match(line, "^ *$") then 
        
               stream_content = stream_content .. line .. os_newline 
        
             end 
        
           end

The fact (from pdf reference v1.7, sec. 3.1.1 Character Set) that whether the string stream is on its own line is just a taste of PDF writers, hence provides no indication about whether the stream is compressed or not. (not quite sure)

josephwright · 2023-07-16T19:29:32Z

Where did we get to here?

muzimuzhi · 2023-07-16T20:11:06Z

I think this PR can be closed for now. Maybe I'll do another try in the future.

The controversial part is, is the new stream match-and-drop logic added to l3build-check.lua safe?

Update: Part of the changes current PR wants to merge has been split in separate PRs which already got merged.

muzimuzhi added 8 commits February 23, 2023 00:39

Enable pdf-based tests for dvips

5ac8d35

Install missing texlive packages

50e7ae3

Force diff to consider all files to text files

3212b2e

Archive failed tests

86cc6ef

Add and update testfiles

f6fc3db

Add a note, ps2pdf doesn't use env var epoch settings

f9e2345

Normalize a new pattern of PDF stream

eba84bf

Update latexdvips testfile, with all streams normalized

35d2821

muzimuzhi commented Feb 22, 2023

View reviewed changes

muzimuzhi added 2 commits February 23, 2023 23:27

Add ps2ps option to not compress streams

4d861f7

Update testfile for latexdvips

6654d27

muzimuzhi marked this pull request as draft February 26, 2023 12:41

This was referenced Mar 3, 2023

Ensure ps2pdfopts is followed by a space #287

Merged

\DocumentMetadata{} disables stream uncompression for dvips latex3/pdfresources#48

Closed

muzimuzhi closed this Jul 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend PDF-based tests to `dvips` #278

Extend PDF-based tests to `dvips` #278

muzimuzhi commented Feb 22, 2023

muzimuzhi Feb 22, 2023

josephwright Feb 22, 2023

muzimuzhi Feb 22, 2023

muzimuzhi Feb 22, 2023

muzimuzhi Feb 22, 2023

muzimuzhi Feb 22, 2023 •

edited

Loading

muzimuzhi Feb 22, 2023

muzimuzhi commented Feb 23, 2023 •

edited

Loading

muzimuzhi commented Feb 24, 2023 •

edited

Loading

zauguin commented Feb 24, 2023

muzimuzhi commented Feb 25, 2023 •

edited

Loading

josephwright commented Jul 16, 2023

muzimuzhi commented Jul 16, 2023 •

edited

Loading

Extend PDF-based tests to dvips #278

Extend PDF-based tests to dvips #278

Conversation

muzimuzhi commented Feb 22, 2023

muzimuzhi Feb 22, 2023

Choose a reason for hiding this comment

josephwright Feb 22, 2023

Choose a reason for hiding this comment

muzimuzhi Feb 22, 2023

Choose a reason for hiding this comment

muzimuzhi Feb 22, 2023

Choose a reason for hiding this comment

muzimuzhi Feb 22, 2023

Choose a reason for hiding this comment

muzimuzhi Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

muzimuzhi Feb 22, 2023

Choose a reason for hiding this comment

muzimuzhi commented Feb 23, 2023 • edited Loading

muzimuzhi commented Feb 24, 2023 • edited Loading

zauguin commented Feb 24, 2023

muzimuzhi commented Feb 25, 2023 • edited Loading

josephwright commented Jul 16, 2023

muzimuzhi commented Jul 16, 2023 • edited Loading

Extend PDF-based tests to `dvips` #278

Extend PDF-based tests to `dvips` #278

muzimuzhi Feb 22, 2023 •

edited

Loading

muzimuzhi commented Feb 23, 2023 •

edited

Loading

muzimuzhi commented Feb 24, 2023 •

edited

Loading

muzimuzhi commented Feb 25, 2023 •

edited

Loading

muzimuzhi commented Jul 16, 2023 •

edited

Loading