DCR - Document Content Recognition - README

Based on the paper "Unfolding the Structure of a Document using Deep Learning" (Rahman and Finin, 2019), this software project attempts to use various software techniques to automatically recognise the structure in any pdf documents and thus make them more searchable.

DCR enables batch processing of documents with the DCR-CORE library. Details of the DCR-CORE library can be found [here}(https://konnexionsgmbh.github.io/dcr-core/). The documents to be processed are expected in a defined file directory. The processing result is made available either in a JSON file or in a PostgreSQL database.

Please see the Documentation for more detailed information.

1. Features

1.1 General

Support for documents in different languages - English, French, German and Italian as standard.

1.2 Preprocessor

Identifying scanned image pdf documents using PyMuPDF.
Converting scanned image pdf documents to a series of jpeg or png files using pdf2image and Poppler.
Converting bmp, gif, jp2, jpeg, png, pnm, tif, tiff or webp type documents to pdf format using Tesseract OCR.
Converting csv, docx, epub, html, odt, rst or rtf type documents to pdf format using Pandoc and TeX Live.

1.3 Natural Language Processing (NLP)

Extracting text and metadata from pdf documents using PDFlib TET.
Categorisation of the lines in the document, e.g. body, footer, header lines etc.
Determination of the token structure sentence by sentence with the help of spaCy.
Storage of the analysis result optional in a PostgreSQL database or in a JSON flat file.

2. Directory and File Structure of this Repository

2.1 Directories

Directory	Content
.github/workflows	GitHub Action workflows
data	Inbox directories and database setup data
docs	DCR documentation files
resources	DBeaver configuration, Gammadyne utility and various external documentation
scripts	Ubuntu and Windows Script for running the application
src	Python scripts and PDFlib TET files
tests	Scripts and data for pytest

2.2 Files

File	Functionality
.gitignore	Configuration of files and folders to be ignored.
.pylintrc	Configuration file for pylint.
LICENSE	Text of the licence terms.
logging_cfg.yaml	Configuration of the Logger functionality.
Makefile	Definition of tasks to be excuted with the `make` command.
mkdocs.yml	Configuration file for MkDocs.
Pipfile	Definition of the Python package requirements.
Pipfile.lock	Definition of the specific versions of the Python packages.
pyproject.toml	Configuration file for bandit, black, isort, mypy, pydoc-markdown, pydocstyle, and pytest.
README.md	This file.
run_dcr_dev	Running the DCR functionality for development purposes.
run_dcr_prod	Running the DCR functionality for productiove operation.
setup.cfg	Configuration file for coverage, DCR, flake8, and radon.
setup.cfg.reference	Original setup configuration file.

3. Support

If you need help with DCR, do not hesitate to get in contact with us!

For questions and high-level discussions, use Discussions on GitHub.
To report a bug or make a feature request, open an Issue on GitHub.

Please note that we may only provide support for problems / questions regarding core features of DCR. Any questions or bug reports about features of third-party themes, plugins, extensions or similar should be made to their respective projects. But, such questions are not banned from the Discussions.

Make sure to stick around to answer some questions as well!

4. Links

Official Documentation
Release Notes
Discussions (Third-party themes, recipes, plugins and more)

5. Contributing to DCR

The DCR project welcomes, and depends on, contributions from developers and users in the open source community. Please see the Contributing Guide for information on how you can help.

6. Code of Conduct

Everyone who interacts in the DCR project's codebase, issue trackers, and discussion forums is expected to follow the Code of Conduct.

7. License

Konnexions Public License (KX-PL)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DCR - Document Content Recognition - README

1. Features

1.1 General

1.2 Preprocessor

1.3 Natural Language Processing (NLP)

2. Directory and File Structure of this Repository

2.1 Directories

2.2 Files

3. Support

4. Links

5. Contributing to DCR

6. Code of Conduct

7. License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 630 Commits
.github/workflows		.github/workflows
data		data
docs		docs
resources		resources
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
logging_cfg.yaml		logging_cfg.yaml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
run_dcr_dev.bat		run_dcr_dev.bat
run_dcr_dev.sh		run_dcr_dev.sh
run_dcr_prod.bat		run_dcr_prod.bat
run_dcr_prod.sh		run_dcr_prod.sh
setup.cfg		setup.cfg
setup.cfg.reference		setup.cfg.reference

License

KonnexionsGmbH/dcr

Folders and files

Latest commit

History

Repository files navigation

DCR - Document Content Recognition - README

1. Features

1.1 General

1.2 Preprocessor

1.3 Natural Language Processing (NLP)

2. Directory and File Structure of this Repository

2.1 Directories

2.2 Files

3. Support

4. Links

5. Contributing to DCR

6. Code of Conduct

7. License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages