Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend PDF-based tests to dvips #278

Closed
wants to merge 10 commits into from

Conversation

muzimuzhi
Copy link
Contributor

This PR adds dvips chain to the list of check engines use for PDF-based tests, and of course the new test output.

With the help from @u-fischer in #274, timestamp metadata in PDFs generated by ps2pdf is normalized by adding \DocumentMedadata{} to tex file, which defines \__pdf_backend_set_regression_data: to be used by regression-test.tex.

Two by-produces (perhaps they should live in separate PRs, but due to relations with current PR...):

  • Add option -a to diff program call, to make it consider all files to be text files.
    • This is helpful when there're remaining byte streams in files to compare.
    • Right now I don't have direct access to Windows. From doc it seems the corresponding option for fc in Windows is /b.
  • Add a new pattern of PDF stream to normalize PDFs.
    • In old version of test output 00-test-2.latexdvips.tpf, two streams were not normalized. Their beginning marks were /Subtype/Type1C/Length 938>>stream and /Type/Metadata/Length 1546>>stream, so I use a strict pattern /Length %d+>>stream to try to match them. (a sample 00-test-2.latexdvips.tpf.zip)

My original purpose was to test against #273, but it ends with findings that dvips chain (more specifically, the ghostscript behind ps2pdf) doesn't need the epoch settings in env vars to generate normalized metadata in PDF. Evidence: checks are passed for commit af4950c which reverts #273.

The settings are kept, as it does no harm and is used by dvips.

@@ -7,6 +7,11 @@ this project uses date-based 'snapshot' version identifiers.

## [Unreleased]

### Added
- Force non-Windows diff program to consider all files to be text files
- Extend PDF-based tests to `dvips`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does l3build test changes need to be listed in CHANGELOG?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean - that the change may require .tlg updates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes only involving "using l3build to test l3build itself"

Comment on lines +10 to +19
--[[ FIXME:
- setting specialformats in config-pdf.lua results in "binary" set to "latexdvips" (should be "latex")
- setting specialformats in build.lua disables specifying engine
l3build save -c config-pdf -e latexdvips 00-test-2
]]
specialformats = specialformats or {}
specialformats["latex"] = specialformats["latex"] or
{
latexdvips = {binary = "latex", format = ""}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the wired limitations are caused by some l3build problems, but for now let's just use a way that "works".

@@ -551,6 +551,11 @@ local function normalize_pdf(content)
binary = false
stream = true
stream_content = "stream" .. os_newline
elseif match(line, "/Length %d+>>stream$") then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ">>stream$" is enough, but I chose a more strict pattern, mainly to not match a non-stream.

Comment on lines +17 to +19
l3experimental
latex-lab
pdfmanagement-testphase
Copy link
Contributor Author

@muzimuzhi muzimuzhi Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these three packages too experimental to cause less stable test output files?

Comment on lines +1 to +4
% needed by dvips (ps2pdf), to normalize pdf metadata
\ifdefined\pdfoutput\ifnum\pdfoutput=0
\DocumentMetadata{}
\fi\fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditionals are used to limit \DocumentMetadata{} to dvips. Otherwise pdftex and xetex engines will generate PDFs containing a long, decompressed stream with XML metedata.

(Feels like I should move some of these comments to commit messages.)

@muzimuzhi
Copy link
Contributor Author

muzimuzhi commented Feb 23, 2023

Add a new pattern of PDF stream to normalize PDFs.

  • In old version of test output 00-test-2.latexdvips.tpf, two streams were not normalized. Their beginning marks were /Subtype/Type1C/Length 938>>stream and /Type/Metadata/Length 1546>>stream, so I use a strict pattern /Length %d+>>stream to try to match them. (a sample 00-test-2.latexdvips.tpf.zip)

Setting ps2pdfopts = " -dCompressStreams=false " will decompress the second stream starting with /Type/Metadata/Length 1546>>stream. But adding option -dCompressFonts=false has no obvious effect on the embedded font stream starting with /Subtype/Type1C/Length 938>>stream. This (the remaining bit stream for embedded font) makes GitHub still treats the .tpf as a binary file hence the diff for decompressed metadata stream cannot be seen from GitHub webpage. Ghostscript docs

At the moment I can't find a ghostscript cli option acts like the -a option to mutool clean, which results in ASCII Hex encode binary streams.

Just for more real cases, here are all the different beginning patterns found in *.tpf files in pdfresources repo, testfiles-dvips directory:

/N 3/Length 8>>stream
/ProcSet [/PDF/ImageB/Text]>>/Length 170>>stream
/Resources<</ProcSet [/PDF/ImageB]>>/Length 107>>stream
/Resources<</ProcSet [/PDF/ImageB]>>/Length 108>>stream
/Resources<</ProcSet [/PDF]>>/Length 8>>stream
/Size[256]/Length 12>>stream
/Subtype /text#2fplain/Length 13>>stream
/Subtype/Type1C/Length 1006>>stream
/Subtype/XML/Length 1144>>stream
/Type/EmbeddedFile/Length 21>>stream
/Type/Metadata/Length 1547>>stream
/yyy(bla)/Length 89>>stream
<</Filter/FlateDecode/Length 172>>stream
>>>>/Length 71>>stream

Update: I'm inclined to add ps2pdfopts = " -dCompressStreams=false " and make the pattern more strict to only match streams for embedded fonts and images. (First half is done.)

@muzimuzhi
Copy link
Contributor Author

muzimuzhi commented Feb 24, 2023

Update: I'm inclined to add ps2pdfopts = " -dCompressStreams=false " and make the pattern more strict to only match streams for embedded fonts and images. (First half is done.)

I have a feeling that l3build should set -dCompressStreams=false by default, when calling ps2pdf in dvitopdf(). Because the expectation is, the pdf-based test outputs should be decompressed.

This can be done either by adding the option to variable ps2pdfopts which will be passed to ps2pdf, or by setting env var GS_OPTIONS.

On macOS, ps2pdf options will finally be passed to gs executable gs twice, hence I'm a bit tend to the GS_OPTIONS way.

$ grep -B 2 -- '-dBATCH' `which ps2pdfwr`
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec "$GS_EXECUTABLE" $OPTIONS -q -P- -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sstdout=%stderr "-sOutputFile=$outfile" $OPTIONS "$infile"

@zauguin
Copy link
Member

zauguin commented Feb 24, 2023

This pattern would match pretty generic streams and rely on implementation details regarding spacing in order not to remove something important. That sounds rather fragile. IMO normalization should only happen for data which is known not to be important.

@muzimuzhi
Copy link
Contributor Author

muzimuzhi commented Feb 25, 2023

This pattern would match pretty generic streams and rely on implementation details regarding spacing in order not to remove something important. That sounds rather fragile. IMO normalization should only happen for data which is known not to be important.

Yes I too realized the fragility of the new pattern in #278 (comment). But then with the following three pieces of info, I tend to think the new pattern is not as fragile as we may think.

  • The use of gs option -dCompressStreams=false.
  • The fact that normalize_pdf() actually checks if the stream content contains "binary bytes" (roughly speaking), and will not mistakenly eat human readable stream. normalized_pdf() is first introduced in 64825da.

    l3build/l3build-check.lua

    Lines 531 to 549 in 26a593f

    if match(line,"endstream") then
    stream = false
    if binary then
    new_content = new_content .. "[BINARY STREAM]" .. os_newline
    else
    new_content = new_content .. stream_content .. line .. os_newline
    end
    binary = false
    else
    for i = 0, 31 do
    if match(line,char(i)) then
    binary = true
    break
    end
    end
    if not binary and not match(line, "^ *$") then
    stream_content = stream_content .. line .. os_newline
    end
    end
  • The fact (from pdf reference v1.7, sec. 3.1.1 Character Set) that whether the string stream is on its own line is just a taste of PDF writers, hence provides no indication about whether the stream is compressed or not. (not quite sure)

@josephwright
Copy link
Member

Where did we get to here?

@muzimuzhi
Copy link
Contributor Author

muzimuzhi commented Jul 16, 2023

I think this PR can be closed for now. Maybe I'll do another try in the future.

The controversial part is, is the new stream match-and-drop logic added to l3build-check.lua safe?

Update: Part of the changes current PR wants to merge has been split in separate PRs which already got merged.

@muzimuzhi muzimuzhi closed this Jul 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants