Based on the paper "Unfolding the Structure of a Document using Deep Learning" (Rahman and Finin, 2019), this software project attempts to use various software techniques to automatically recognise the structure in any pdf
documents and thus make them more searchable.
DCR enables batch processing of documents with the DCR-CORE library. Details of the DCR-CORE library can be found [here}(https://konnexionsgmbh.github.io/dcr-core/). The documents to be processed are expected in a defined file directory. The processing result is made available either in a JSON file or in a PostgreSQL database.
Please see the Documentation for more detailed information.
- Support for documents in different languages - English, French, German and Italian as standard.
- Identifying scanned image
pdf
documents using PyMuPDF. - Converting scanned image
pdf
documents to a series ofjpeg
orpng
files using pdf2image and Poppler. - Converting
bmp
,gif
,jp2
,jpeg
,png
,pnm
,tif
,tiff
orwebp
type documents topdf
format using Tesseract OCR. - Converting
csv
,docx
,epub
,html
,odt
,rst
orrtf
type documents topdf
format using Pandoc and TeX Live.
- Extracting text and metadata from
pdf
documents using PDFlib TET. - Categorisation of the lines in the document, e.g. body, footer, header lines etc.
- Determination of the token structure sentence by sentence with the help of spaCy.
- Storage of the analysis result optional in a PostgreSQL database or in a JSON flat file.
Directory | Content |
---|---|
.github/workflows | GitHub Action workflows |
data | Inbox directories and database setup data |
docs | DCR documentation files |
resources | DBeaver configuration, Gammadyne utility and various external documentation |
scripts | Ubuntu and Windows Script for running the application |
src | Python scripts and PDFlib TET files |
tests | Scripts and data for pytest |
File | Functionality |
---|---|
.gitignore | Configuration of files and folders to be ignored. |
.pylintrc | Configuration file for pylint. |
LICENSE | Text of the licence terms. |
logging_cfg.yaml | Configuration of the Logger functionality. |
Makefile | Definition of tasks to be excuted with the make command. |
mkdocs.yml | Configuration file for MkDocs. |
Pipfile | Definition of the Python package requirements. |
Pipfile.lock | Definition of the specific versions of the Python packages. |
pyproject.toml | Configuration file for bandit, black, isort, mypy, pydoc-markdown, pydocstyle, and pytest. |
README.md | This file. |
run_dcr_dev | Running the DCR functionality for development purposes. |
run_dcr_prod | Running the DCR functionality for productiove operation. |
setup.cfg | Configuration file for coverage, DCR, flake8, and radon. |
setup.cfg.reference | Original setup configuration file. |
If you need help with DCR, do not hesitate to get in contact with us!
- For questions and high-level discussions, use Discussions on GitHub.
- To report a bug or make a feature request, open an Issue on GitHub.
Please note that we may only provide support for problems / questions regarding core features of DCR. Any questions or bug reports about features of third-party themes, plugins, extensions or similar should be made to their respective projects. But, such questions are not banned from the Discussions.
Make sure to stick around to answer some questions as well!
- Official Documentation
- Release Notes
- Discussions (Third-party themes, recipes, plugins and more)
The DCR project welcomes, and depends on, contributions from developers and users in the open source community. Please see the Contributing Guide for information on how you can help.
Everyone who interacts in the DCR project's codebase, issue trackers, and discussion forums is expected to follow the Code of Conduct.