Skip to content
This repository has been archived by the owner on May 7, 2024. It is now read-only.

Commit

Permalink
Merge pull request #31 from KonnexionsGmbH/wwe_0.9.7
Browse files Browse the repository at this point in the history
Version 0.9.7
  • Loading branch information
walter-weinmann authored Sep 8, 2022
2 parents 8f7819a + 686df69 commit b96c8e6
Show file tree
Hide file tree
Showing 95 changed files with 96 additions and 512 deletions.
5 changes: 3 additions & 2 deletions .github/workflows/standards.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,14 +94,14 @@ jobs:
echo "Current version of OpenSSL: $(openssl version -a)"
- name: Install Step 3 - Pandoc
run: |
wget --quiet https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
wget --quiet --no-check-certificate https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
sudo dpkg -i pandoc-${VERSION_PANDOC}-1-amd64.deb
echo "::echo::on"
echo "Current version of Pandoc: $(pandoc -v)"
echo "Current version of TeX Live: $(pdflatex --version)"
- name: Install Step 4 - Poppler
run: |
wget --quiet https://poppler.freedesktop.org/poppler-${VERSION_POPPLER}.tar.xz --quiet
wget --quiet --no-check-certificate https://poppler.freedesktop.org/poppler-${VERSION_POPPLER}.tar.xz
sudo tar -xf poppler-${VERSION_POPPLER}.tar.xz
cd poppler-${VERSION_POPPLER}/
sudo mkdir build
Expand All @@ -118,6 +118,7 @@ jobs:
sudo apt-get update -qy
sudo apt-get install -qy tesseract-ocr
sudo apt-get install -qy tesseract-ocr-eng
echo "::echo::on"
echo "Current version of Tesseract OCR: $(tesseract --version)"
- name: Publish the code coverage to coveralls.io
run: |
Expand Down
10 changes: 5 additions & 5 deletions .github/workflows/test_development.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
strategy:
max-parallel: 1
matrix:
os: ["ubuntu-20.04", "ubuntu-22.04"]
os: ["ubuntu-22.04"]
python-version: ["3.10"]
steps:
- uses: actions/checkout@v3
Expand Down Expand Up @@ -71,7 +71,7 @@ jobs:
wget
- name: Install Step 2 - OpenSSL
run: |
wget --no-check-certificate -nv https://github.com/openssl/openssl/archive/OpenSSL_${VERSION_OPENSSL}.tar.gz
wget --quiet --no-check-certificate -nv https://github.com/openssl/openssl/archive/OpenSSL_${VERSION_OPENSSL}.tar.gz
sudo tar -xf OpenSSL_${VERSION_OPENSSL}.tar.gz
sudo rm -rf openssl
sudo mv openssl-OpenSSL_${VERSION_OPENSSL} openssl
Expand All @@ -87,15 +87,15 @@ jobs:
echo "Current version of OpenSSL: $(openssl version -a)"
- name: Install Step 3 - Pandoc
run: |
wget https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
wget --quiet --no-check-certificate https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
sudo dpkg -i pandoc-${VERSION_PANDOC}-1-amd64.deb
echo "::echo::on"
echo "Current version of Pandoc: $(pandoc -v)"
echo "Current version of TeX Live: $(pdflatex --version)"
- name: Install Step 4 - Poppler
run: |
wget https://poppler.freedesktop.org/poppler-${VERSION_POPPLER}.tar.xz
sudo tar -xvf poppler-${VERSION_POPPLER}.tar.xz
wget --quiet --no-check-certificate https://poppler.freedesktop.org/poppler-${VERSION_POPPLER}.tar.xz
sudo tar -xf poppler-${VERSION_POPPLER}.tar.xz
cd poppler-${VERSION_POPPLER}/
sudo mkdir build
cd build
Expand Down
10 changes: 5 additions & 5 deletions .github/workflows/test_production.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
strategy:
max-parallel: 1
matrix:
os: ["ubuntu-20.04", "ubuntu-22.04"]
os: ["ubuntu-22.04"]
python-version: ["3.10"]
steps:
- uses: actions/checkout@v3
Expand Down Expand Up @@ -71,7 +71,7 @@ jobs:
wget
- name: Install Step 2 - OpenSSL
run: |
wget --no-check-certificate -nv https://github.com/openssl/openssl/archive/OpenSSL_${VERSION_OPENSSL}.tar.gz
wget --quiet --no-check-certificate -nv https://github.com/openssl/openssl/archive/OpenSSL_${VERSION_OPENSSL}.tar.gz
sudo tar -xf OpenSSL_${VERSION_OPENSSL}.tar.gz
sudo rm -rf openssl
sudo mv openssl-OpenSSL_${VERSION_OPENSSL} openssl
Expand All @@ -87,15 +87,15 @@ jobs:
echo "Current version of OpenSSL: $(openssl version -a)"
- name: Install Step 3 - Pandoc
run: |
wget https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
wget --quiet --no-check-certificate https://github.com/jgm/pandoc/releases/download/${VERSION_PANDOC}/pandoc-${VERSION_PANDOC}-1-amd64.deb
sudo dpkg -i pandoc-${VERSION_PANDOC}-1-amd64.deb
echo "::echo::on"
echo "Current version of Pandoc: $(pandoc -v)"
echo "Current version of TeX Live: $(pdflatex --version)"
- name: Install Step 4 - Poppler
run: |
wget https://poppler.freedesktop.org/poppler-${VERSION_POPPLER}.tar.xz
sudo tar -xvf poppler-${VERSION_POPPLER}.tar.xz
wget --quiet --no-check-certificate https://poppler.freedesktop.org/poppler-${VERSION_POPPLER}.tar.xz
sudo tar -xf poppler-${VERSION_POPPLER}.tar.xz
cd poppler-${VERSION_POPPLER}/
sudo mkdir build
cd build
Expand Down
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,14 @@
![Coveralls GitHub](https://img.shields.io/coveralls/github/KonnexionsGmbH/dcr.svg)
![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.9.6)
![GitHub commits since latest release](https://img.shields.io/github/commits-since/KonnexionsGmbH/dcr/0.9.7)

Based on the paper "Unfolding the Structure of a Document using Deep Learning" (**[Rahman and Finin, 2019](https://arxiv.org/abs/1910.03678)**), this software project attempts to use various software techniques to automatically recognise the structure in any **`pdf`** documents and thus make them more searchable.

The computer linguistic methods used here assume that the documents to be processed are in **`pdf`** format.
However, in order to be flexible in the selection of documents with regard to the file format, **DCR** contains a sophisticated preprocessor that can convert many of the non **`pdf`** formats into the **`pdf`** format.

From the documents in **`pdf`** format, the next steps are to extract the text with relevant metadata word by word, line by line or page by page. In the case of line-by-line extraction, the identified headers and footers are marked accordingly so that they can be neglected later in the token creation process.

In what is currently the last step, qualified tokens can be created, which on the one hand contain information about the localisation of the token in the document and on the other hand token classification features such as lemma, shape, normalisation, etc.
**DCR** enables batch processing of documents with the **DCR-CORE** library.
Details of the **DCR-CORE** library can be found [here}(https://konnexionsgmbh.github.io/dcr-core/).
The documents to be processed are expected in a defined file directory.
The processing result is made available either in a JSON file or in a PostgreSQL database.

Please see the **[Documentation](https://konnexionsgmbh.github.io/dcr)** for more detailed information.

Expand Down Expand Up @@ -67,6 +65,7 @@ Please see the **[Documentation](https://konnexionsgmbh.github.io/dcr)** for mor
| run_dcr_dev | Running the **DCR** functionality for development purposes. |
| run_dcr_prod | Running the **DCR** functionality for productiove operation. |
| setup.cfg | Configuration file for [coverage](https://github.com/nedbat/coveragepy/blob/6.3.2/doc/index.rst), **DCR**, [flake8](https://github.com/pycqa/flake8), and [radon](https://github.com/rubik/radon). |
| setup.cfg.reference | Original setup configuration file. |

## 3. Support

Expand Down
2 changes: 1 addition & 1 deletion docs/developing_continouos_delivery.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The GitHub Actions are used to enforce the following good practices of the softw
- creation of up-to-date user documentation.

The action **`standards`** in the GitHub Actions guarantees compliance with the required standards, the action **`test_production`** ensures error-free compilation for production use and the action **`test_development`** runs the tests against various operating system and **`Python`** versions.
The actions **`test_development`** and **`test_production`** must be able to run error-free on operating systems **`Ubuntu 20.04`** and **`Ubuntu 22.04`** and with **`Python`** version **`3.10`**, the action **`standards`** is only required error-free for the latest versions of **`Ubuntu`** and **`Python`**.
The actions **`test_development`** and **`test_production`** must be able to run error-free on operating system **`Ubuntu 22.04`** and with **`Python`** version **`3.10`**, the action **`standards`** is only required error-free for the latest versions of **`Ubuntu`** and **`Python`**.

The individual steps to be carried out

Expand Down
8 changes: 4 additions & 4 deletions docs/developing_development_environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)

To set up a suitable development environment under **`Ubuntu 20.04 LTS`**, on the one hand a suitable ready-made Docker image is provided and on the other hand two scripts to create the development system in a standalone system, a virtual environment or the **`Windows Subsystem for Linux (WSL2)`** are available.
To set up a suitable development environment under **`Ubuntu 22.04 LTS`**, on the one hand a suitable ready-made Docker image is provided and on the other hand two scripts to create the development system in a standalone system, a virtual environment or the **`Windows Subsystem for Linux (WSL2)`** are available.

### 1. Docker Image

Expand All @@ -15,10 +15,10 @@ When selecting the Docker image, care must be taken to select the appropriate ve

### 2. Script-based Solution

Alternatively, for a **`Ubuntu 20.04 LTS`** environment that is as unspoiled as possible, the following two scripts are available in the **`scripts`** file directory:
Alternatively, for a **`Ubuntu 22.04 LTS`** environment that is as unspoiled as possible, the following two scripts are available in the **`scripts`** file directory:

- **`scripts/0.9.6/run_install_4-vm_wsl2_1.sh`**
- **`scripts/0.9.6/run_install_4-vm_wsl2_2.sh`**
- **`scripts/0.9.7/run_install_4-vm_wsl2_1.sh`**
- **`scripts/0.9.7/run_install_4-vm_wsl2_2.sh`**

After a **`cd scripts`** command in a terminal window, the script **`run_install_4-vm_wsl2_1.sh`** must first be executed.
Administration rights (**`sudo`**) are required for this.
Expand Down
75 changes: 0 additions & 75 deletions docs/developing_research_notes.md

This file was deleted.

4 changes: 2 additions & 2 deletions docs/developing_system_environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
![GitHub (Pre-)Release](https://img.shields.io/github/v/release/KonnexionsGmbH/dcr?include_prereleases)
![GitHub (Pre-)Release Date](https://img.shields.io/github/release-date-pre/KonnexionsGmbh/dcr)

**DCR** is developed on the operating systems **`Ubuntu 20.04 LTS`** and **`Microsoft Windows 10`**.
**DCR** is developed on the operating systems **`Ubuntu 22.04 LTS`** and **`Microsoft Windows 10`**.
Ubuntu is used here via the **`VM Workstation Player 16`**.
**`Ubuntu`** can also be used in conjunction with the **`Windows Subsystem for Linux (WSL2)`**.

The GitHub actions for continuous integration run on **`Ubuntu 20.04`** and **`Ubuntu 22.04`**.
The GitHub actions for continuous integration run on **`Ubuntu 22.04`**.

Version **`3.10`** is used for the **`Python`** programming language.
3 changes: 2 additions & 1 deletion docs/developing_version_planning.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,13 @@

| Version | Feature(s) |
|---------|------------|
| 0.9.7 | TBD |
| 0.9.8 | TBD |

### 1.2 Already implemented

| Version | Feature(s) |
|---------|----------------------------------------|
| 0.9.7 | Documentation and test improvements |
| 0.9.6 | Extracting an API |
| 0.9.3 | Extending NLP capabilities |
| 0.9.2 | Refactoring database and code |
Expand Down
Loading

0 comments on commit b96c8e6

Please sign in to comment.