Skip to content

Commit

Permalink
Initial commit of code
Browse files Browse the repository at this point in the history
Signed-off-by: VEIY82L <[email protected]>
  • Loading branch information
DaBeIDS committed Jan 29, 2024
1 parent acb89a6 commit 45b26f3
Show file tree
Hide file tree
Showing 29 changed files with 913 additions and 225 deletions.
31 changes: 21 additions & 10 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@

.. image:: https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white
:alt: Source code on GitHub
:target: https://github.com/ModeSevenIndustrialSolutions/osc-transformer-presteps
:target: https://github.com/ModeSevenIndustrialSolutions/osc-data-extractor

.. image:: https://img.shields.io/pypi/v/osc-transformer-presteps.svg
.. image:: https://img.shields.io/pypi/v/osc-data-extractor.svg
:alt: PyPI package
:target: https://pypi.org/project/osc-transformer-presteps/
:target: https://pypi.org/project/osc-data-extractor/

.. image:: https://api.cirrus-ci.com/github/os-climate/osc-transformer-presteps.svg?branch=main
.. image:: https://api.cirrus-ci.com/github/os-climate/osc-data-extractor.svg?branch=main
:alt: Built Status
:target: https://cirrus-ci.com/github/os-climate/osc-transformer-presteps
:target: https://cirrus-ci.com/github/os-climate/osc-data-extractor

.. image:: https://img.shields.io/badge/PDM-Project-purple
:alt: Built using PDM
Expand All @@ -28,10 +28,9 @@
:target: https://pyscaffold.org/



========================
osc-transformer-presteps
========================
==================
osc-data-extractor
==================

OS-Climate Data Extraction Tool

Expand All @@ -40,4 +39,16 @@ OS-Climate Data Extraction Tool
Notes
=====

Placeholder notes content
For adding new dependencies use pdm. First install via pip:

.. code-block:: python
pip install pdm
pdm add <new_package>==<version>
For running linting tools just to the following:

.. code-block:: python
pip install tox
tox -e lint
19 changes: 7 additions & 12 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@
# serve to show the default.

import os
import sys
import shutil
import sys

# -- Path setup --------------------------------------------------------------

Expand All @@ -34,7 +34,7 @@
from sphinx import apidoc

output_dir = os.path.join(__location__, "api")
module_dir = os.path.join(__location__, "../src/osc_transformer_presteps")
module_dir = os.path.join(__location__, "../src/osc_data_extractor")
try:
shutil.rmtree(output_dir)
except FileNotFoundError:
Expand Down Expand Up @@ -87,7 +87,7 @@
master_doc = "index"

# General information about the project.
project = "osc-transformer-presteps"
project = "osc-data-extractor"
copyright = "2023, Matthew Watkins"

# The version info for the project you're documenting, acts as replacement for
Expand All @@ -99,7 +99,7 @@
# If you don’t need the separation provided between version and release,
# just set them both to the same value.
try:
from osc_transformer_presteps import __version__ as version
from osc_data_extractor import __version__ as version
except ImportError:
version = ""

Expand Down Expand Up @@ -158,10 +158,7 @@
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
html_theme_options = {
"sidebar_width": "300px",
"page_width": "1200px"
}
html_theme_options = {"sidebar_width": "300px", "page_width": "1200px"}

# Add any paths that contain custom themes here, relative to this directory.
# html_theme_path = []
Expand Down Expand Up @@ -229,7 +226,7 @@
# html_file_suffix = None

# Output file base name for HTML help builder.
htmlhelp_basename = "osc-transformer-presteps-doc"
htmlhelp_basename = "osc-data-extractor-doc"


# -- Options for LaTeX output ------------------------------------------------
Expand All @@ -245,9 +242,7 @@

# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title, author, documentclass [howto/manual]).
latex_documents = [
("index", "user_guide.tex", "osc-transformer-presteps Documentation", "Matthew Watkins", "manual")
]
latex_documents = [("index", "user_guide.tex", "osc-data-extractor Documentation", "Matthew Watkins", "manual")]

# The name of an image file (relative to this directory) to place at the top of
# the title page.
Expand Down
4 changes: 2 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
==================
osc-transformer-presteps
osc-data-extractor
==================

This is the documentation of **osc-transformer-presteps**.
This is the documentation of **osc-data-extractor**.

.. note::

Expand Down
285 changes: 284 additions & 1 deletion pdm.lock

Large diffs are not rendered by default.

51 changes: 24 additions & 27 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,20 @@ classifiers = [
"Topic :: Scientific/Engineering",
"Topic :: Software Development",
]
dependencies = [
"pdfminer>=20191125",
"pydantic-settings>=2.1.0",
"pytest>=7.4.4",
"pypdf>=4.0.0",
]

[project.urls]
Homepage = "https://github.com/os-climate/osc-transformer-presteps"
Repository = "https://github.com/os-climate/osc-transformer-presteps"
Downloads = "https://github.com/os-climate/osc-transformer-presteps/releases"
"Bug Tracker" = "https://github.com/os-climate/osc-transformer-presteps/issues"
Documentation = "https://github.com/os-climate/osc-transformer-presteps/tree/main/docs"
"Source Code" = "https://github.com/os-climate/osc-transformer-presteps"
Homepage = "https://github.com/os-climate/osc-data-extraction"
Repository = "https://github.com/os-climate/osc-data-extraction"
Downloads = "https://github.com/os-climate/osc-data-extraction/releases"
"Bug Tracker" = "https://github.com/os-climate/osc-data-extraction/issues"
Documentation = "https://github.com/os-climate/osc-data-extraction/tree/main/docs"
"Source Code" = "https://github.com/os-climate/osc-data-extraction"

[build-system]
requires = ["pdm-backend"]
Expand All @@ -42,35 +48,19 @@ build-backend = "pdm.backend"
license-files = ["LICENSES.txt"]

[project.scripts]
osc-transformer-presteps = "osc_transformer_presteps.skeleton:run"
osc-transformer-presteps = "osc_data_extractor.skeleton:run"

[project.optional-dependencies]
dev = [
dev = [
"pylint",
"toml",
"yapf",
"pdm",
"tox",
"tox-pdm"
"pdm"
]
test = [
"pytest",
"pytest-cov",
]
tox = [
"tox",
"tox-pdm>=0.5",
]
docs = [
"sphinx>=7.2.6",
"sphinx-copybutton>=0.5.2"
# "pytest-cov",
]
lint = [
"pre-commit",
"pyproject-flake8"
]

[tool.setuptools_scm]

[tool.pdm.scripts]
pre_release = "scripts/dev-versioning.sh"
Expand All @@ -81,11 +71,18 @@ docs = { shell = "cd docs && mkdocs serve", help = "Start the dev server for doc
lint = "pre-commit run --all-files"
complete = { call = "tasks.complete:main", help = "Create autocomplete files for bash and fish" }

[tool.pdm.dev-dependencies]
test = ["pdm[pytest]", "pytest", "pytest-cov"]
tox = ["tox", "tox-pdm>=0.5"]
docs = ["sphinx>=7.2.6", "sphinx-copybutton>=0.5.2"]
dev = ["tox>=4.11.3", "tox-pdm>=0.7.0"]
lint = ["pre-commit", "pyproject-flake8"]

[tool.pytest.ini_options]
testpaths = [
"tests/",
]
addopts = "--cov --cov-report html --cov-report term-missing --cov-fail-under 95"
# addopts = "--cov --cov-report html --cov-report term-missing --cov-fail-under 95"

[tool.coverage.run]
source = ["src"]
Expand Down
Empty file added src/__init__.py
Empty file.
109 changes: 109 additions & 0 deletions src/demo/extraction_demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
import logging
from pathlib import Path
from typing import Dict, Union, Optional
from pprint import pformat
import traceback

from src.osc_transformer_presteps.content_extraction.extraction_factory import get_extractor

__author__ = "David Besslich"
__copyright__ = "David Besslich"
__license__ = "MIT"

_logger = logging.getLogger(__name__)

log_levels = {
"CRITICAL": logging.CRITICAL,
"ERROR": logging.ERROR,
"WARNING": logging.WARNING,
"INFO": logging.INFO,
"DEBUG": logging.DEBUG,
"NOTSET": logging.NOTSET,
}


def specify_root_logger(log_level: str):
"""
Configures the root logger with a specific formatting and log level.
This function sets up the root logger, which is the top-level logger in the logging hierarchy, with a specific
configuration. It creates a StreamHandler that logs messages to stdout, sets the log level to DEBUG for all
messages, and applies a specific formatter to format the log messages.
Args:
log_level (str): The log_level to use for the logging.
Usage:
Call this function at the beginning of your code to configure the root logger
with the desired formatting and log level.
"""
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")

log_level = log_levels[log_level.upper()]

handler = logging.StreamHandler()
handler.setLevel(log_level)
handler.setFormatter(formatter)

logging.root.handlers = [handler]
logging.root.setLevel(log_level)


def extract_main(
input_file_path: Path,
output_file_path: Optional[Path] = None,
settings: Optional[Dict[str, Union[str, bool]]] = None,
) -> Dict[int, Dict[str, str]]:
"""
Extract information from an input file using a specified extractor and save the extraction results to a file.
Args:
input_file_path (Path): The path of the input file.
output_file_path (Path): The path of the output file to save the extraction results.
settings (Dict[str, Union[str, bool]]): A dictionary containing the settings for the extractor.
Returns:
Union[None, str]: The extracted information if successful, otherwise None.
Example:
input_file_path = Path("input.pdf")
output_file_path = Path("output.json")
settings = {
"option1": True,
"option2": "value2",
"store_to_file": True
}
output = extract_main(input_file_path, output_file_path, settings)
"""
extractor = get_extractor(input_file_path.suffix, settings)
if not extractor.check_for_skip_files(input_file_path, output_file_path):
extractor.extract(input_file_path=input_file_path)

if extractor.get_settings()["store_to_file"]:
extractor.save_extraction_to_file(output_file_path=output_file_path)

return extractor.get_extractions()


if __name__ == "__main__":
specify_root_logger("info")
try:
input_folder = Path(__file__).resolve().parent / "input"
output_folder = Path(__file__).resolve().parent / "output"

file_name = "test.pdf"

input_file_path_main = input_folder / file_name
output_file_path_main = output_folder / input_file_path_main.with_suffix(".json").name
settings_main = {"skip_extracted_files": False, "store_to_file": False}

_logger.info(f"Input file path is :\n {input_file_path_main}.")
extraction_dict = extract_main(
input_file_path=input_file_path_main, output_file_path=output_file_path_main, settings=settings_main
)
_logger.debug(pformat(extraction_dict))
except Exception as e:
_logger.error("---ERROR---" * 10)
_logger.error(repr(e))
_logger.error(traceback.format_exc())
Binary file added src/demo/input/shell_2019.pdf
Binary file not shown.
Binary file added src/demo/input/test.pdf
Binary file not shown.
Binary file added src/demo/input/test_error.pdf
Binary file not shown.
1 change: 1 addition & 0 deletions src/demo/output/shell_2019.pdf.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions src/demo/output/test.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions src/demo/output/test.pdf.json

Large diffs are not rendered by default.

Loading

0 comments on commit 45b26f3

Please sign in to comment.