Skip to content

๐Ÿ“‘ Scripts to repair, verify, OCR, compress, wrangle, crop (etc.) PDFs

License

Notifications You must be signed in to change notification settings

jfilter/pdf-scripts

Repository files navigation

PDF Scripts

Scripts (mostly Bash) to repair, verify, OCR, compress (etc.) PDFs.

Currently in beta status, so except backward-incompatible changes.

Install

You need to have Bash installed.

The scripts use several software libraries. setup.sh installs them for macOS (via brew) or Ubuntu/Debian.

Usage

  1. Go to root of this repository: cd pdf-scripts
  2. Excute script ./pipeline.sh -l deu /path/to/document-in-german.pdf

Please refer to the scripts for the command-line arguments and options. NB: It's not possible to combine options, e.g., use -x -y instead of -xy.

Most scripts work on individual PDFs as well as on folders full of PDFs.

Overview

OCR PDFs with OCRmyPDF.

Using: pdftocairo from poppler, mutool clean from MuPDF, qpdf

Caveat: May remove text in OCRd PDFs. Use --check to check for OCRd text in order to preserve it.

Checks if text can be extracted (if it's already on the PDF)

Using ghostcript to compress images in PDFs.

Use compress_pdf.sh but also pdfsizeopt to reduze file size of PDFs.

Remove metadata with exiftool.

Detect OCRd PDFs. See also sort_ocrd_pdfs.sh to sort PDFs.

Combining several of the above scripts.

FAQ

Why Bash?

Bash is still the most-used shell. And the scipts comprise mostly of simple conditionals and sequences of CLI commands. This could also be done with Python's psutil but this would add yet another layer. However, at some point, I most probable port the scripts to simple POSIX-Shell.

Related Work

Development

  • focus on Bash v4+
  • write Python 3.6+ scripts if Bash gets too complicated
  • use Docker images if available
  • should run on the major Unix-like OSs (Linux (e.g. Ubuntu), macOS)
  • format code with shfmt, e.g., extension for VS Code
  • lint scripts with shellcheck, e.g., extension for VS Code

Common Commands

Concat PDFs into one PDF

qpdf --empty --pages *.pdf -- out.pdf

Images to PDF

convert *.jpg pictures.pdf

Rotate PDFs

qpdf in.pdf  out.pdf --rotate=+90

License

GPLv3.