Skip to content
This repository has been archived by the owner on May 7, 2024. It is now read-only.

Commit

Permalink
Merge pull request #3 from KonnexionsGmbH/wwe_0.6.5
Browse files Browse the repository at this point in the history
Version 0.6.5
  • Loading branch information
walter-weinmann authored Mar 10, 2022
2 parents be5309e + b012589 commit 72b2965
Show file tree
Hide file tree
Showing 51 changed files with 1,728 additions and 1,258 deletions.
25 changes: 19 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ on:

env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
VERSION_PANDOC: 2.17.1.1

jobs:
standards:
Expand Down Expand Up @@ -43,11 +44,15 @@ jobs:
run: |
chmod +x ./scripts/run_setup_postgresql.sh
./scripts/run_setup_postgresql.sh test
- name: Install Poppler
- name: Install Pandoc & Poppler & TeX Live
run: |
sudo apt-get update -qy
sudo apt-get upgrade -qy
sudo apt-get install -qy poppler-utils
sudo apt-get install -qy poppler-utils \
texlive-full \
wget
wget https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
sudo dpkg -i pandoc-${VERSION_PANDOC}-1-amd64.deb
- name: Publish the code coverage to coveralls.io
run: make coveralls

Expand Down Expand Up @@ -79,11 +84,15 @@ jobs:
run: |
chmod +x ./scripts/run_setup_postgresql.sh
./scripts/run_setup_postgresql.sh test
- name: Install Poppler
- name: Install Pandoc & Poppler & TeX Live
run: |
sudo apt-get update -qy
sudo apt-get upgrade -qy
sudo apt-get install -qy poppler-utils
sudo apt-get install -qy poppler-utils \
texlive-full \
wget
wget https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
sudo dpkg -i pandoc-${VERSION_PANDOC}-1-amd64.deb
- name: Run pytest for writing better program
run: make pytest

Expand Down Expand Up @@ -115,10 +124,14 @@ jobs:
run: |
chmod +x ./scripts/run_setup_postgresql.sh
./scripts/run_setup_postgresql.sh test
- name: Install Poppler
- name: Install Pandoc & Poppler & TeX Live
run: |
sudo apt-get update -qy
sudo apt-get upgrade -qy
sudo apt-get install -qy poppler-utils
sudo apt-get install -qy poppler-utils \
texlive-full \
wget
wget https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
sudo dpkg -i pandoc-${VERSION_PANDOC}-1-amd64.deb
- name: Run pytest for writing better program
run: make pytest-ci
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,20 @@
/src/dcr/*/__pycache__/
/src/dcr/__pycache__/
/tests/__pycache__/
/tests/inbox/*.csv
/tests/inbox/*.doc
/tests/inbox/*.docx
/tests/inbox/*.epub
/tests/inbox/*.htm
/tests/inbox/*.html
/tests/inbox/*.jpeg
/tests/inbox/*.jpg
/tests/inbox/*.odt
/tests/inbox/*.pdf
/tests/inbox/*.png
/tests/inbox/*.rst
/tests/inbox/*.rtf
/tests/inbox/*.tiff
/tests/inbox/*.txt
/tests/inbox/*.xxx
/tests/inbox/htm_ok_files/
/tests/inbox/html_ok_files/
553 changes: 276 additions & 277 deletions Pipfile.lock

Large diffs are not rendered by default.

13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
# DCR Document Content Recognition
# DCR - Document Content Recognition - README

![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.0)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.5)

Based on the paper "Unfolding the Structure of a Document using Deep Learning" (**[Rahman and Finin, 2019](https://konnexionsgmbh.github.io/dcr/research/#rahman-m-finin-t-2019)**), this software project attempts to automatically recognize the structure in arbitrary PDF documents and thus make them more searchable in a more qualified manner.
Documents not in PDF format are converted to PDF format using **[Pandoc](https://pandoc.org)**.
Documents not in PDF format are converted to PDF format using **[Pandoc](https://pandoc.org)** and **[TeX Live](https://www.tug.org/texlive/)** .
Documents based on scanning which, therefore, do not contain text elements, are scanned and converted to PDF format using the **[Tesseract OCR](https://github.com/tesseract-ocr/tesseract)** software.
This process applies to all image format files e.g. jpeg, tiff etc., as well as scanned images in PDF format.

Please see the **[Documentation](https://konnexionsgmbh.github.io/dcr/)** for more detailed information.

## Features

- Identifying scanned image pdf documents using [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/module.html).
- Converting scanned image pdf documents to a series of jpeg files using [pdf2image](https://pypi.org/project/pdf2image/).
- Identifying scanned image 'pdf' documents using [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/module.html).
- Converting scanned image 'pdf' documents to a series of 'jpeg' files using [pdf2image](https://pypi.org/project/pdf2image/).
- Convert 'csv', 'docx', 'epub', 'html', 'odt', 'rst' or 'rtf' type documents to 'pdf' format using [Pandoc](https://pandoc.org) and [TeX Live](https://www.tug.org/texlive/).
- Much more!

## Support
Expand All @@ -25,7 +26,7 @@ If you need help with **DCR**, do not hesitate to get in contact with us!
- For questions and high-level discussions, use **[Discussions](https://github.com/KonnexionsGmbH/dcr/discussions)** on GitHub.
- To report a bug or make a feature request, open an **[Issue](https://github.com/KonnexionsGmbH/dcr/issues)** on GitHub.

Please note that we may only provide support for problems/questions regarding core features of **DCR**
Please note that we may only provide support for problems/questions regarding core features of **DCR**.
Any questions or bug reports about features of third-party themes, plugins, extensions or similar should be made to their respective projects.
But, such questions are *not* banned from the **[Discussions](https://github.com/KonnexionsGmbH/dcr/discussions)**.

Expand Down
2 changes: 1 addition & 1 deletion docs/code_of_conduct.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.0)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.5)

----

Expand Down
2 changes: 1 addition & 1 deletion docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.0)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.5)

----

Expand Down
16 changes: 9 additions & 7 deletions docs/development_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.0)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.5)

----

Expand Down Expand Up @@ -96,18 +96,19 @@ In this format, the API documentation can then be integrated into the user docum
rejected file directories depending on the result of the check.
Depending on the file format, the accepted documents are then
converted into the pdf file format either with the help of Pandoc
or with the help of Tesseract OCR.
and TeX Live or with the help of Tesseract OCR.

**Function Documentation**:

Load the command line arguments into memory.
Load the command line arguments into memory.Pandoc and TeX Live

The command line arguments define the process steps to be executed.
The valid arguments are:

all - Run the complete processing of all new documents.
db_c - Create the database.
db_u - Upgrade the database.
n_2_p - Convert non-pdf docuents to pdf files.
p_i - Process the inbox directory.
p_2_i - Convert pdf documents to image files.

Expand All @@ -116,6 +117,7 @@ In this format, the API documentation can then be integrated into the user docum

1. p_i
2. p_2_i
3. n_2_p

Args:
argv (List[str]): Command line arguments.
Expand Down Expand Up @@ -210,8 +212,8 @@ When selecting the Docker image, care must be taken to select the appropriate ve

Alternatively, for a **`Ubuntu 20.04 LTS`** environment that is as unspoiled as possible, the following two scripts are available in the **`scripts`** file directory:

- **`scripts/0.6.0/run_install_4-vm_wsl2_1.sh`**
- **`scripts/0.6.0/run_install_4-vm_wsl2_2.sh`**
- **`scripts/0.6.5/run_install_4-vm_wsl2_1.sh`**
- **`scripts/0.6.5/run_install_4-vm_wsl2_2.sh`**

After a **`cd scripts`** command in a terminal window, the script **`run_install_4-vm_wsl2_1.sh`** must first be executed.
Administration rights (**`sudo`**) are required for this.
Expand All @@ -223,7 +225,7 @@ Afterwards, the second script **`run_install_4-vm_wsl2_2.sh`** must be executed
|-----------|--------------------------------------|
| ~~0.5.0~~ | ~~Inbox processing~~ |
| ~~0.6.0~~ | ~~pdf for Tesseract OCR processing~~ |
| 0.6.5 | Pandoc processing |
| ~~0.6.5~~ | ~~Pandoc processing~~ |
| 0.7.0 | Tesseract OCR processing |
| 0.8.0 | PDFlib TET processing |
| 0.9.0 | Parser |
Expand All @@ -232,7 +234,7 @@ Afterwards, the second script **`run_install_4-vm_wsl2_2.sh`** must be executed

**1<sup>st</sup> Priority:**

- convert the appropriate documents into the `pdf` format with Pandoc.
- ~~convert the appropriate documents into the `pdf` format with Pandoc and TeX Live~~
- test cases for file duplicate
- tools.py - verify the content of the inbox directories
- ~~API Documentation~~
Expand Down
Loading

0 comments on commit 72b2965

Please sign in to comment.