Merge pull request #3 from KonnexionsGmbH/wwe_0.6.5

Version 0.6.5
KonnexionsGmbH · Mar 10, 2022 · 72b2965 · 72b2965
2 parents be5309e + b012589
commit 72b2965
Show file tree

Hide file tree

Showing 51 changed files with 1,728 additions and 1,258 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -8,6 +8,7 @@ on:
 
 env:
   GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+  VERSION_PANDOC: 2.17.1.1
 
 jobs:
   standards:
@@ -43,11 +44,15 @@ jobs:
         run: |
             chmod +x ./scripts/run_setup_postgresql.sh
             ./scripts/run_setup_postgresql.sh test
-      - name: Install Poppler
+      - name: Install Pandoc & Poppler & TeX Live
         run: |
             sudo apt-get update -qy
             sudo apt-get upgrade -qy
-            sudo apt-get install -qy poppler-utils
+            sudo apt-get install -qy poppler-utils \
+                                     texlive-full \
+                                     wget
+            wget https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
+            sudo dpkg -i pandoc-${VERSION_PANDOC}-1-amd64.deb
       - name: Publish the code coverage to coveralls.io
         run: make coveralls
 
@@ -79,11 +84,15 @@ jobs:
         run: |
             chmod +x ./scripts/run_setup_postgresql.sh
             ./scripts/run_setup_postgresql.sh test
-      - name: Install Poppler
+      - name: Install Pandoc & Poppler & TeX Live
         run: |
             sudo apt-get update -qy
             sudo apt-get upgrade -qy
-            sudo apt-get install -qy poppler-utils
+            sudo apt-get install -qy poppler-utils \
+                                     texlive-full \
+                                     wget
+            wget https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
+            sudo dpkg -i pandoc-${VERSION_PANDOC}-1-amd64.deb
       - name: Run pytest for writing better program
         run: make pytest
 
@@ -115,10 +124,14 @@ jobs:
         run: |
             chmod +x ./scripts/run_setup_postgresql.sh
             ./scripts/run_setup_postgresql.sh test
-      - name: Install Poppler
+      - name: Install Pandoc & Poppler & TeX Live
         run: |
             sudo apt-get update -qy
             sudo apt-get upgrade -qy
-            sudo apt-get install -qy poppler-utils
+            sudo apt-get install -qy poppler-utils \
+                                     texlive-full \
+                                     wget
+            wget https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
+            sudo dpkg -i pandoc-${VERSION_PANDOC}-1-amd64.deb
       - name: Run pytest for writing better program
         run: make pytest-ci
diff --git a/.gitignore b/.gitignore
@@ -13,18 +13,20 @@
 /src/dcr/*/__pycache__/
 /src/dcr/__pycache__/
 /tests/__pycache__/
+/tests/inbox/*.csv
 /tests/inbox/*.doc
 /tests/inbox/*.docx
+/tests/inbox/*.epub
 /tests/inbox/*.htm
 /tests/inbox/*.html
 /tests/inbox/*.jpeg
 /tests/inbox/*.jpg
 /tests/inbox/*.odt
 /tests/inbox/*.pdf
 /tests/inbox/*.png
+/tests/inbox/*.rst
 /tests/inbox/*.rtf
 /tests/inbox/*.tiff
 /tests/inbox/*.txt
 /tests/inbox/*.xxx
-/tests/inbox/htm_ok_files/
 /tests/inbox/html_ok_files/
diff --git a/Pipfile.lock b/Pipfile.lock
diff --git a/README.md b/README.md
@@ -1,21 +1,22 @@
-# DCR Document Content Recognition
+# DCR - Document Content Recognition - README
 
 ![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
 ![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
 ![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
-![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.0)
+![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.5)
 
 Based on the paper "Unfolding the Structure of a Document using Deep Learning" (**[Rahman and Finin, 2019](https://konnexionsgmbh.github.io/dcr/research/#rahman-m-finin-t-2019)**), this software project attempts to automatically recognize the structure in arbitrary PDF documents and thus make them more searchable in a more qualified manner.
-Documents not in PDF format are converted to PDF format using **[Pandoc](https://pandoc.org)**.
+Documents not in PDF format are converted to PDF format using **[Pandoc](https://pandoc.org)** and **[TeX Live](https://www.tug.org/texlive/)** .
 Documents based on scanning which, therefore, do not contain text elements, are scanned and converted to PDF format using the **[Tesseract OCR](https://github.com/tesseract-ocr/tesseract)** software.
 This process applies to all image format files e.g. jpeg, tiff etc., as well as scanned images in PDF format.
 
 Please see the **[Documentation](https://konnexionsgmbh.github.io/dcr/)** for more detailed information.
 
 ## Features
 
-- Identifying scanned image pdf documents using [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/module.html).
-- Converting scanned image pdf documents to a series of jpeg files using [pdf2image](https://pypi.org/project/pdf2image/).
+- Identifying scanned image 'pdf' documents using [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/module.html).
+- Converting scanned image 'pdf' documents to a series of 'jpeg' files using [pdf2image](https://pypi.org/project/pdf2image/).
+- Convert 'csv', 'docx', 'epub', 'html', 'odt', 'rst' or 'rtf' type documents to 'pdf' format using [Pandoc](https://pandoc.org) and [TeX Live](https://www.tug.org/texlive/).
 - Much more!
 
 ## Support
@@ -25,7 +26,7 @@ If you need help with **DCR**, do not hesitate to get in contact with us!
 - For questions and high-level discussions, use **[Discussions](https://github.com/KonnexionsGmbH/dcr/discussions)** on GitHub.
 - To report a bug or make a feature request, open an **[Issue](https://github.com/KonnexionsGmbH/dcr/issues)** on GitHub.
 
-Please note that we may only provide support for problems/questions regarding core features of **DCR** 
+Please note that we may only provide support for problems/questions regarding core features of **DCR**.
 Any questions or bug reports about features of third-party themes, plugins, extensions or similar should be made to their respective projects. 
 But, such questions are *not* banned from the **[Discussions](https://github.com/KonnexionsGmbH/dcr/discussions)**.
 

diff --git a/docs/code_of_conduct.md b/docs/code_of_conduct.md
@@ -3,7 +3,7 @@
 ![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
 ![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
 ![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
-![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.0)
+![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.5)
 
 ----
 

diff --git a/docs/contributing.md b/docs/contributing.md
@@ -3,7 +3,7 @@
 ![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
 ![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
 ![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
-![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.0)
+![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.5)
 
 ----
 

diff --git a/docs/development_notes.md b/docs/development_notes.md
@@ -3,7 +3,7 @@
 ![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
 ![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
 ![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
-![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.0)
+![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.6.5)
 
 ----
 
@@ -96,18 +96,19 @@ In this format, the API documentation can then be integrated into the user docum
     rejected file directories depending on the result of the check.
     Depending on the file format, the accepted documents are then
     converted into the pdf file format either with the help of Pandoc
-    or with the help of Tesseract OCR.
+    and TeX Live or with the help of Tesseract OCR.
 
 **Function  Documentation**:
 
-    Load the command line arguments into memory.
+    Load the command line arguments into memory.Pandoc and TeX Live
 
     The command line arguments define the process steps to be executed.
     The valid arguments are:
 
         all   - Run the complete processing of all new documents.
         db_c  - Create the database.
         db_u  - Upgrade the database.
+        n_2_p - Convert non-pdf docuents to pdf files.
         p_i   - Process the inbox directory.
         p_2_i - Convert pdf documents to image files.
 
@@ -116,6 +117,7 @@ In this format, the API documentation can then be integrated into the user docum
 
         1. p_i
         2. p_2_i
+        3. n_2_p
 
     Args:
         argv (List[str]): Command line arguments.
@@ -210,8 +212,8 @@ When selecting the Docker image, care must be taken to select the appropriate ve
 
 Alternatively, for a **`Ubuntu 20.04 LTS`** environment that is as unspoiled as possible, the following two scripts are available in the **`scripts`** file directory:
 
-- **`scripts/0.6.0/run_install_4-vm_wsl2_1.sh`**
-- **`scripts/0.6.0/run_install_4-vm_wsl2_2.sh`**
+- **`scripts/0.6.5/run_install_4-vm_wsl2_1.sh`**
+- **`scripts/0.6.5/run_install_4-vm_wsl2_2.sh`**
 
 After a **`cd scripts`** command in a terminal window, the script **`run_install_4-vm_wsl2_1.sh`** must first be executed. 
 Administration rights (**`sudo`**) are required for this. 
@@ -223,7 +225,7 @@ Afterwards, the second script **`run_install_4-vm_wsl2_2.sh`** must be executed
 |-----------|--------------------------------------|
 | ~~0.5.0~~ | ~~Inbox processing~~                 |
 | ~~0.6.0~~ | ~~pdf for Tesseract OCR processing~~ |
-| 0.6.5     | Pandoc processing                    |
+| ~~0.6.5~~ | ~~Pandoc processing~~                |
 | 0.7.0     | Tesseract OCR processing             |
 | 0.8.0     | PDFlib TET processing                |
 | 0.9.0     | Parser                               |
@@ -232,7 +234,7 @@ Afterwards, the second script **`run_install_4-vm_wsl2_2.sh`** must be executed
 
 **1<sup>st</sup> Priority:**
 
-- convert the appropriate documents into the `pdf` format with Pandoc.
+- ~~convert the appropriate documents into the `pdf` format with Pandoc and TeX Live~~
 - test cases for file duplicate
 - tools.py - verify the content of the inbox directories
 - ~~API Documentation~~