Skip to content

Commit

Permalink
Refactored demos to demo folder and made linting pass (#110)
Browse files Browse the repository at this point in the history
Signed-off-by: „veiy82l“ <[email protected]>
  • Loading branch information
DaBeIDS authored Jul 10, 2024
1 parent e7d5ec3 commit 194c57b
Show file tree
Hide file tree
Showing 26 changed files with 218 additions and 88 deletions.
1 change: 1 addition & 0 deletions AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ Contributors
============

* Matthew Watkins <[email protected]>
* David Besslich <[email protected]>
66 changes: 33 additions & 33 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,19 @@ On June 26 2024, Linux Foundation announced the merger of its financial services


=====================================================================
OSC Data Extractor Pre-Steps
OSC Transformer Pre-Steps
=====================================================================

|osc-climate-project| |osc-climate-slack| |osc-climate-github| |pypi| |build-status| |pdm| |PyScaffold|

OS-Climate Data Extraction Tool
OS-Climate Transformer Pre-Steps Tool
===============================

.. _notes:

This code provides you with an api and a streamlit app to which you
can provide a pdf document and the output will be the text content in a json format.
In the backend it is using a python module for extracting text from pdfs, which
might be extended in the future to other file types.
The json file is needed for later usage in the context of transformer models
This code provides you with a cli tool with the possibility to extract data from
a pdf to a json document and to create a training data set for a later usage in the
context of transformer models
to extract relevant information, but it can also be used independently.

Quick start
Expand All @@ -39,53 +37,55 @@ We are using typer to have a nice CLI tool here. All details and help will be sh
tool itself and are not described here in more detail.


Install via Github Repository
Developer space
===============

Use code directly without CLI via Github Repository
-----------------------------

For a quick start with the tool install python and clone the repository to your local environment::
First clone the repository to your local environment::

$ git clone https://github.com/os-climate/osc-transformer-presteps

Afterwards update your python to the requirements (possible for example
via pdm update) and start a local api server via::
We are using pdm to manage the packages and tox for a stable test framework.
Hence, first install pdm (possibly in a virtual environment) via

$ pip install pdm

$ python ./src/run_server.py
Afterwards sync you system via

**Note**:
* We assume that you are located in the cloned repository.
* To check if it is running open "http://localhost:8000/liveness" and you should see the
message {"message": "OSC Transformer Pre-Steps Server is running."}.
$ pdm sync

Finally, run the following code to start a streamlit app which opens up the possibility
to "upload" a file and extract data from pdf to json via this UI. Note that the UI needs
the running server so you have to open the streamlit and the server in two different
terminals.::
Now you have multiple demos on how to go on. See folder
[here](demo)

$ streamlit run ./src/osc_transformer_presteps/streamlit/app.py
pdm
-----------------------------

**Note**: Check also docs/demo. There you can
find local_extraction_demo.py which will start an extraction
without any API call and then there is post_request_demo.py
which will send a file to the API (of course you have to start
server as above first).
For adding new dependencies use pdm. You could add new packages via pdm add.
For example numpy via::

Developer Notes
===============
$ pdm add numpy

For adding new dependencies use pdm. First install via pip::
For a very detailed description check the homepage of the pdm project:

$ pip install pdm
https://pdm-project.org/en/latest/

And then you could add new packages via pdm add. For example numpy via::

$ pdm add numpy
tox
-----------------------------

For running linting tools just to the following::
For running linting tools we use tox which you run outside of your virtual environment::

$ pip install tox
$ tox -e lint
$ tox -e test

This will automatically apply some checks on your code and run the provided pytests. See
more details on tox on the homepage of the tox project:

https://tox.wiki/en/4.16.0/


.. |osc-climate-project| image:: https://img.shields.io/badge/OS-Climate-blue
:alt: An OS-Climate Project
Expand Down
62 changes: 62 additions & 0 deletions demo/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
=====================================================================
DEMO Scripts Overview
=====================================================================

.. _notes:

In this folder you can find multiple demo scripts on how to use the python scripts in
different ways beside the *normal* CLI tool.

**Note**:

* We assume that you are located in an environment where you have
already installed the necessary requirements (see initial readme).

* The demos are not part of the tox setup and the tests. Hence, it might be that some
packages or code parts can be outdated. Those are just ideas on how to use and not
prod ready. Feel free to inform us nevertheless if you encounter issues with the demos.


extraction_api
....................

This demo is an implementation of the code via FastAPI. In api.py the API is created and the
extraction route is build up in extract.py. To start the server run:

$ python demo/extraction_api/api.py

Then the server will run and you can test in your browser that it worked at:

http://localhost:8000/liveness

You should see the message {"message": "OSC Transformer Pre-Steps Server is running."}.

extraction
....................

This demo has two parts to extract data from the input folder to the output folder.

a) The post_request_extract.py is using the api endpoint from extraction_api to send a
file to the api via a post request and receives the output via an api respons. The file
you want to extract can be entered in the cmd line:

$ python demo/extraction/post_request_extract.py

b) The local_extraction_demo.py runs the extraction code directly for Test.pdf file.
If you want to use another file you have to change that in the code.

extraction_streamlit
....................

This is an example implementation of a streamlit app which opens up the possibility
to "upload" a file and extract data from pdf to json. Note that the UI needs
the running server from extraction_api and so you have to open the streamlit
and the server in two different terminals. An example file to upload can be found in
"/demo/extraction/input". You can start the streamlit via:

$ streamlit run ./src/osc_transformer_presteps/extraction_streamlit/app.py

curation
....................

T.B.D.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
"""Python Script for locally running extraction on FastAPI."""
"""Python Script for locally running extraction on FastAPI.
Note: To make the following demo work you first have to start the server in the folder demo/extraction_api!
"""

import json
from pathlib import Path
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@
import uvicorn
from fastapi import APIRouter, FastAPI
from starlette.responses import RedirectResponse
from server_settings import ExtractionServerSettings

from osc_transformer_presteps.api.extract import router as extraction_router
from osc_transformer_presteps.settings import ExtractionServerSettings
from extract import router as extraction_router

_logger = logging.getLogger(__name__)

Expand Down
File renamed without changes.
48 changes: 48 additions & 0 deletions demo/extraction_api/server_settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
from pydantic import BaseModel
from enum import Enum
import logging


class LogLevel(str, Enum):
"""Class for different log levels."""

critical = "critical"
error = "error"
warning = "warning"
info = "info"
debug = "debug"
notset = "notset"


_log_dict = {
"critical": logging.CRITICAL,
"error": logging.ERROR,
"warning": logging.WARNING,
"info": logging.INFO,
"debug": logging.DEBUG,
"notset": logging.NOTSET,
}


class ExtractionServerSettingsBase(BaseModel):
"""Class for Extraction server settings."""

port: int = 8000
host: str = "localhost"
log_type: int = 20
log_level: LogLevel = LogLevel("info")


class ExtractionServerSettings(ExtractionServerSettingsBase):
"""Settings for configuring the extraction server.
This class extends `ExtractionServerSettingsBase` and adds additional
logging configuration.
"""

def __init__(self, **data) -> None:
"""Initialize the ExtractionServerSettings."""
if "log_level" in data:
data["log_level"] = LogLevel(data["log_level"])
super().__init__(**data)
self.log_type: int = _log_dict[self.log_level.value]
Empty file.
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,6 @@
if st.button("Extract data"):
st.info("Extraction started")
file_bytes = input_file.getvalue()
liveness = requests.get(
url="http://localhost:8000/liveness", proxies={"http": "", "https": ""}
)
st.info(f"Liveness Check: {liveness.status_code}")
file_upload = requests.post(
url="http://localhost:8000/extract",
files={"file": (input_file.name, file_bytes)},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,12 @@ def get_extractor(
Args:
----
- extractor_type (str): Type of extractor to be retrieved
- settings: Settings specific to the extractor
extractor_type (str): Type of extractor to be retrieved
settings: Settings specific to the extractor
Returns:
-------
- BaseExtractor: Instance of the specified extractor type
BaseExtractor: Instance of the specified extractor type
"""
_logger.info("The extractor type is: " + extractor_type)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ class _BaseSettings(BaseModel):
min_paragraph_length (int)(Optional): Minimum alphabetic characters for paragraph,
any paragraph shorter than that will be disregarded.
annotation_folder (str)(Optional): path to the folder containing all annotated
excel files. If provided, just the pdfs mentioned in annotation excels are
Excel files. If provided, just the pdfs mentioned in annotation excels are
extracted. Otherwise, all the pdfs in the pdf folder will be extracted.
skip_extracted_files (bool)(Optional): whether to skip extracting a file if it exist in the extraction folder.
skip_extracted_files (bool)(Optional): whether to skip extracting a file if it exists in the extraction folder.
"""

annotation_folder: Optional[str] = None
Expand Down Expand Up @@ -59,7 +59,7 @@ def __init__(self, settings: Optional[dict] = None):
self._settings: dict = settings_base

def __init_subclass__(cls, **kwargs):
"""Intialize the subclass."""
"""Initialize the subclass."""
super().__init_subclass__(**kwargs)
if cls.extractor_name == "base":
raise ValueError(
Expand Down Expand Up @@ -142,7 +142,7 @@ def extract(
raise ExtractionError(
f"While doing the extraction we faced the following error:\n "
f"{repr(e)}.\n Trace to the error is given by:\n {traceback_str}"
)
) from e

@abstractmethod
def _generate_extractions(
Expand Down
24 changes: 0 additions & 24 deletions src/osc_transformer_presteps/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,30 +27,6 @@ class LogLevel(str, Enum):
}


class ExtractionServerSettingsBase(BaseModel):
"""Class for Extraction server settings."""

port: int = 8000
host: str = "localhost"
log_type: int = 20
log_level: LogLevel = LogLevel("info")


class ExtractionServerSettings(ExtractionServerSettingsBase):
"""Settings for configuring the extraction server.
This class extends `ExtractionServerSettingsBase` and adds additional
logging configuration.
"""

def __init__(self, **data) -> None:
"""Initialize the ExtractionServerSettings."""
if "log_level" in data:
data["log_level"] = LogLevel(data["log_level"])
super().__init__(**data)
self.log_type: int = _log_dict[self.log_level.value]


class ExtractionSettings(BaseModel):
"""Settings for controlling extraction behavior.
Expand Down
1 change: 0 additions & 1 deletion src/osc_transformer_presteps/streamlit/__init__.py

This file was deleted.

Loading

0 comments on commit 194c57b

Please sign in to comment.