Skip to content
This repository has been archived by the owner on May 7, 2024. It is now read-only.

Commit

Permalink
Merge pull request #6 from KonnexionsGmbH/wwe_0.8.0
Browse files Browse the repository at this point in the history
Version 0.8.0
  • Loading branch information
walter-weinmann authored Mar 18, 2022
2 parents 540438b + a5b9d21 commit 2daa171
Show file tree
Hide file tree
Showing 56 changed files with 2,309 additions and 342 deletions.
13 changes: 3 additions & 10 deletions .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# A comma-separated list of package or module names from where C extensions may
# be loaded. Extensions are loading into the active Python interpreter and may
# run arbitrary code.
extension-pkg-allow-list=
extension-pkg-allow-list=tetlib_py

# A comma-separated list of package or module names from where C extensions may
# be loaded. Extensions are loading into the active Python interpreter and may
Expand All @@ -20,7 +20,7 @@ fail-on=
fail-under=10.0

# Files or directories to be skipped. They should be base names, not paths.
ignore=CVS
ignore=TET.py

# Add files or directories matching the regex patterns to the ignore-list. The
# regex matches against paths and can be in Posix or Windows format.
Expand Down Expand Up @@ -78,14 +78,7 @@ confidence=
# --enable=similarities". If you want to run only the classes checker, but have
# no Warning level messages displayed, use "--disable=all --enable=classes
# --disable=W".
disable=raw-checker-failed,
bad-inline-option,
locally-disabled,
file-ignored,
suppressed-message,
useless-suppression,
deprecated-pragma,
use-symbolic-message-instead
disable=

# Enable the message, report, category or checker with the given id(s). You can
# either give multiple identifier separated by comma (,) or put this option
Expand Down
12 changes: 6 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,12 @@ export DCR_ENVIRONMENT_TYPE=test

ifeq ($(OS),Windows_NT)
DCR_DOCKER_CONTAINER=scripts\\run_setup_postgresql.bat test
export MYPYPATH=src\\dcr
export PYTHONPATH=src\\dcr
export MYPYPATH=src\\dcr;src\\dcr\\libs
export PYTHONPATH=src\\dcr;src\\dcr\\libs
else
DCR_DOCKER_CONTAINER=./scripts/run_setup_postgresql.sh test
export MYPYPATH=src/dcr
export PYTHONPATH=src/dcr:src/dcr
export MYPYPATH=src/dcr:src/dcr/libs
export PYTHONPATH=src/dcr:src/dcr:src/dcr/libs
endif

# Bandit is a tool designed to find common security issues in Python code.
Expand Down Expand Up @@ -94,7 +94,7 @@ docformatter: ## Format the docstrings with docformatter.
flake8: ## Enforce the Python Style Guides with Flake8.
@echo "Info ********** Start: Flake8 **************************************"
pipenv run flake8 --version
pipenv run flake8 src tests
pipenv run flake8 --exclude TET.py src tests
@echo "Info ********** End: Flake8 **************************************"

# isort your imports, so you don't have to.
Expand Down Expand Up @@ -123,7 +123,7 @@ mypy: ## Find typing issues with Mypy.
@echo MYPYPATH=${MYPYPATH}
pipenv run pip freeze | grep mypy
pipenv run mypy --version
pipenv run mypy src
pipenv run mypy --exclude TET.py src
@echo "Info ********** End: Mypy ****************************************"

# pip is the package installer for Python.
Expand Down
47 changes: 23 additions & 24 deletions Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,21 @@
![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.7.0)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.8.0)

Based on the paper "Unfolding the Structure of a Document using Deep Learning" (**[Rahman and Finin, 2019](https://konnexionsgmbh.github.io/dcr/research/#rahman-m-finin-t-2019)**), this software project attempts to automatically recognize the structure in arbitrary PDF documents and thus make them more searchable in a more qualified manner.
Documents not in PDF format are converted to PDF format using **[Pandoc](https://pandoc.org)** and **[TeX Live](https://www.tug.org/texlive)** .
Documents based on scanning which, therefore, do not contain text elements, are scanned and converted to PDF format using the **[Tesseract OCR](https://github.com/tesseract-ocr/tesseract)** software.
This process applies to all image format files e.g. jpeg, tiff etc., as well as scanned images in PDF format.
Based on the paper "Unfolding the Structure of a Document using Deep Learning" (**[Rahman and Finin, 2019](https://konnexionsgmbh.github.io/dcr/research/#rahman-m-finin-t-2019)**), this software project attempts to automatically recognize the structure in arbitrary **`pdf`** documents and thus make them more searchable in a more qualified manner.
Documents not in **`pdf`** format are converted in advance to **`pdf`** format using **[Pandoc](https://pandoc.org)** and **[TeX Live](https://www.tug.org/texlive)** .
Documents based on scanning which, therefore, do not contain text elements, are scanned and converted in advance to **`pdf`** format using the **[Tesseract OCR](https://github.com/tesseract-ocr/tesseract)** software.

Please see the **[Documentation](https://konnexionsgmbh.github.io/dcr)** for more detailed information.

## Features

- Identifying scanned image 'pdf' documents using [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/module.html).
- Converting scanned image 'pdf' documents to a series of 'jpeg' or 'png' files using [pdf2image](https://pypi.org/project/pdf2image).
- Convert 'csv', 'docx', 'epub', 'html', 'odt', 'rst' or 'rtf' type documents to 'pdf' format using [Pandoc](https://pandoc.org) and [TeX Live](https://www.tug.org/texlive).
- Convert 'bmp', 'gif', 'jp2', 'jpeg', 'png', 'pnm', 'tiff' or 'webp' type documents to 'pdf' format using [Tesseract OCR](https://github.com/tesseract-ocr/tesseract).
- Identifying scanned image **`pdf`** documents using [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/module.html).
- Converting scanned image **`pdf`** documents to a series of **`jpeg`** or **`png`** files using [pdf2image](https://pypi.org/project/pdf2image).
- Converting **`csv`**, **`docx`**, **`epub`**, **`html`**, **`odt`**, **`rst`** or **`rtf`** type documents to **`pdf`** format using [Pandoc](https://pandoc.org) and [TeX Live](https://www.tug.org/texlive).
- Converting **`bmp`**, **`gif`**, **`jp2`**, **`jpeg`**, **`png`**, **`pnm`**, **`tiff`** or **`webp`** type documents to **`pdf`** format using [Tesseract OCR](https://github.com/tesseract-ocr/tesseract).
- Extracting text and metadata from **`pdf`** documents using [PDFlib TET](https://www.pdflib.com/products/tet/).
- Much more!

## Support
Expand All @@ -27,9 +27,9 @@ If you need help with **DCR**, do not hesitate to get in contact with us!
- For questions and high-level discussions, use **[Discussions](https://github.com/KonnexionsGmbH/dcr/discussions)** on GitHub.
- To report a bug or make a feature request, open an **[Issue](https://github.com/KonnexionsGmbH/dcr/issues)** on GitHub.

Please note that we may only provide support for problems/questions regarding core features of **DCR**.
Please note that we may only provide support for problems / questions regarding core features of **DCR**.
Any questions or bug reports about features of third-party themes, plugins, extensions or similar should be made to their respective projects.
But, such questions are *not* banned from the **[Discussions](https://github.com/KonnexionsGmbH/dcr/discussions)**.
But, such questions are **not** banned from the **[Discussions](https://github.com/KonnexionsGmbH/dcr/discussions)**.

Make sure to stick around to answer some questions as well!

Expand Down
2 changes: 1 addition & 1 deletion docs/code_of_conduct.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.7.0)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.8.0)

----

Expand Down
2 changes: 1 addition & 1 deletion docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.7.0)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.8.0)

----

Expand Down
Loading

0 comments on commit 2daa171

Please sign in to comment.