Refactored demos to demo folder and made linting pass (#110)

Signed-off-by: „veiy82l“ <[email protected]>
os-climate · Jul 10, 2024 · 194c57b · 194c57b
1 parent e7d5ec3
commit 194c57b
Show file tree

Hide file tree

Showing 26 changed files with 218 additions and 88 deletions.
diff --git a/AUTHORS.rst b/AUTHORS.rst
@@ -3,3 +3,4 @@ Contributors
 ============
 
 * Matthew Watkins <[email protected]>
+* David Besslich <[email protected]>
diff --git a/README.rst b/README.rst
@@ -4,21 +4,19 @@ On June 26 2024, Linux Foundation announced the merger of its financial services
 
 
 =====================================================================
-OSC Data Extractor Pre-Steps
+OSC Transformer Pre-Steps
 =====================================================================
 
 |osc-climate-project| |osc-climate-slack| |osc-climate-github| |pypi| |build-status| |pdm| |PyScaffold|
 
-OS-Climate Data Extraction Tool
+OS-Climate Transformer Pre-Steps Tool
 ===============================
 
 .. _notes:
 
-This code provides you with an api and a streamlit app to which you
-can provide a pdf document and the output will be the text content in a json format.
-In the backend it is using a python module for extracting text from pdfs, which
-might be extended in the future to other file types.
-The json file is needed for later usage in the context of transformer models
+This code provides you with a cli tool with the possibility to extract data from
+a pdf to a json document and to create a training data set for a later usage in the
+context of transformer models
 to extract relevant information, but it can also be used independently.
 
 Quick start
@@ -39,53 +37,55 @@ We are using typer to have a nice CLI tool here. All details and help will be sh
 tool itself and are not described here in more detail.
 
 
-Install via Github Repository
+Developer space
+===============
+
+Use code directly without CLI via Github Repository
 -----------------------------
 
-For a quick start with the tool install python and clone the repository to your local environment::
+First clone the repository to your local environment::
 
     $ git clone https://github.com/os-climate/osc-transformer-presteps
 
-Afterwards update your python to the requirements (possible for example
-via pdm update) and start a local api server via::
+We are using pdm to manage the packages and tox for a stable test framework.
+Hence, first install pdm (possibly in a virtual environment) via
+
+    $ pip install pdm
 
-    $ python ./src/run_server.py
+Afterwards sync you system via
 
-**Note**:
-    * We assume that you are located in the cloned repository.
-    * To check if it is running open "http://localhost:8000/liveness" and you should see the
-      message {"message": "OSC Transformer Pre-Steps Server is running."}.
+    $ pdm sync
 
-Finally, run the following code to start a streamlit app which opens up the possibility
-to "upload" a file and extract data from pdf to json via this UI. Note that the UI needs
-the running server so you have to open the streamlit and the server in two different
-terminals.::
+Now you have multiple demos on how to go on. See folder
+[here](demo)
 
-    $ streamlit run ./src/osc_transformer_presteps/streamlit/app.py
+pdm
+-----------------------------
 
-**Note**: Check also docs/demo. There you can
-find local_extraction_demo.py which will start an extraction
-without any API call and then there is post_request_demo.py
-which will send a file to the API (of course you have to start
-server as above first).
+For adding new dependencies use pdm. You could add new packages via pdm add.
+For example numpy via::
 
-Developer Notes
-===============
+    $ pdm add numpy
 
-For adding new dependencies use pdm. First install via pip::
+For a very detailed description check the homepage of the pdm project:
 
-    $ pip install pdm
+https://pdm-project.org/en/latest/
 
-And then you could add new packages via pdm add. For example numpy via::
 
-    $ pdm add numpy
+tox
+-----------------------------
 
-For running linting tools just to the following::
+For running linting tools we use tox which you run outside of your virtual environment::
 
     $ pip install tox
     $ tox -e lint
     $ tox -e test
 
+This will automatically apply some checks on your code and run the provided pytests. See
+more details on tox on the homepage of the tox project:
+
+https://tox.wiki/en/4.16.0/
+
 
 .. |osc-climate-project| image:: https://img.shields.io/badge/OS-Climate-blue
   :alt: An OS-Climate Project

diff --git a/demo/README.rst b/demo/README.rst
@@ -0,0 +1,62 @@
+=====================================================================
+DEMO Scripts Overview
+=====================================================================
+
+.. _notes:
+
+In this folder you can find multiple demo scripts on how to use the python scripts in
+different ways beside the *normal* CLI tool.
+
+**Note**:
+
+* We assume that you are located in an environment where you have
+  already installed the necessary requirements (see initial readme).
+
+* The demos are not part of the tox setup and the tests. Hence, it might be that some
+  packages or code parts can be outdated. Those are just ideas on how to use and not
+  prod ready. Feel free to inform us nevertheless if you encounter issues with the demos.
+
+
+extraction_api
+....................
+
+This demo is an implementation of the code via FastAPI. In api.py the API is created and the
+extraction route is build up in extract.py. To start the server run:
+
+    $ python demo/extraction_api/api.py
+
+Then the server will run and you can test in your browser that it worked at:
+
+    http://localhost:8000/liveness
+
+You should see the message {"message": "OSC Transformer Pre-Steps Server is running."}.
+
+extraction
+....................
+
+This demo has two parts to extract data from the input folder to the output folder.
+
+a) The post_request_extract.py is using the api endpoint from extraction_api to send a
+file to the api via a post request and receives the output via an api respons. The file
+you want to extract can be entered in the cmd line:
+
+    $ python demo/extraction/post_request_extract.py
+
+b) The local_extraction_demo.py runs the extraction code directly for Test.pdf file.
+If you want to use another file you have to change that in the code.
+
+extraction_streamlit
+....................
+
+This is an example implementation of a streamlit app which opens up the possibility
+to "upload" a file and extract data from pdf to json. Note that the UI needs
+the running server from extraction_api and so you have to open the streamlit
+and the server in two different terminals. An example file to upload can be found in
+"/demo/extraction/input". You can start the streamlit via:
+
+    $ streamlit run ./src/osc_transformer_presteps/extraction_streamlit/app.py
+
+curation
+....................
+
+T.B.D.
diff --git a/docs/demo/curation/input/kpi_mapping.csv → demo/curation/input/kpi_mapping.csv b/docs/demo/curation/input/kpi_mapping.csv → demo/curation/input/kpi_mapping.csv
diff --git a/...demo/curation/input/test_annotations.xlsx → demo/curation/input/test_annotations.xlsx b/...demo/curation/input/test_annotations.xlsx → demo/curation/input/test_annotations.xlsx
diff --git a/docs/demo/curation/local_cuartion_demo.py → demo/curation/local_cuartion_demo.py b/docs/demo/curation/local_cuartion_demo.py → demo/curation/local_cuartion_demo.py
diff --git a/docs/demo/curation/output/Test.csv → demo/curation/output/Test.csv b/docs/demo/curation/output/Test.csv → demo/curation/output/Test.csv
diff --git a/docs/demo/extraction/input/Test.pdf → demo/extraction/input/Test.pdf b/docs/demo/extraction/input/Test.pdf → demo/extraction/input/Test.pdf
diff --git a/docs/demo/extraction/input/test-2.pdf → demo/extraction/input/test-2.pdf b/docs/demo/extraction/input/test-2.pdf → demo/extraction/input/test-2.pdf
diff --git a/.../demo/extraction/local_extraction_demo.py → demo/extraction/local_extraction_demo.py b/.../demo/extraction/local_extraction_demo.py → demo/extraction/local_extraction_demo.py
diff --git a/docs/demo/extraction/output/Test.json → demo/extraction/output/Test.json b/docs/demo/extraction/output/Test.json → demo/extraction/output/Test.json
diff --git a/docs/demo/extraction/post_request_extract.py → demo/extraction/post_request_extract.py b/docs/demo/extraction/post_request_extract.py → demo/extraction/post_request_extract.py
@@ -1,4 +1,8 @@
-"""Python Script for locally running extraction on FastAPI."""
+"""Python Script for locally running extraction on FastAPI.
+
+Note: To make the following demo work you first have to start the server in the folder demo/extraction_api!
+
+"""
 
 import json
 from pathlib import Path

diff --git a/src/osc_transformer_presteps/api/__init__.py → demo/extraction_api/__init__.py b/src/osc_transformer_presteps/api/__init__.py → demo/extraction_api/__init__.py
diff --git a/src/osc_transformer_presteps/api/api.py → demo/extraction_api/api.py b/src/osc_transformer_presteps/api/api.py → demo/extraction_api/api.py
@@ -5,9 +5,9 @@
 import uvicorn
 from fastapi import APIRouter, FastAPI
 from starlette.responses import RedirectResponse
+from server_settings import ExtractionServerSettings
 
-from osc_transformer_presteps.api.extract import router as extraction_router
-from osc_transformer_presteps.settings import ExtractionServerSettings
+from extract import router as extraction_router
 
 _logger = logging.getLogger(__name__)
 

diff --git a/src/osc_transformer_presteps/api/extract.py → demo/extraction_api/extract.py b/src/osc_transformer_presteps/api/extract.py → demo/extraction_api/extract.py
diff --git a/demo/extraction_api/server_settings.py b/demo/extraction_api/server_settings.py
@@ -0,0 +1,48 @@
+from pydantic import BaseModel
+from enum import Enum
+import logging
+
+
+class LogLevel(str, Enum):
+    """Class for different log levels."""
+
+    critical = "critical"
+    error = "error"
+    warning = "warning"
+    info = "info"
+    debug = "debug"
+    notset = "notset"
+
+
+_log_dict = {
+    "critical": logging.CRITICAL,
+    "error": logging.ERROR,
+    "warning": logging.WARNING,
+    "info": logging.INFO,
+    "debug": logging.DEBUG,
+    "notset": logging.NOTSET,
+}
+
+
+class ExtractionServerSettingsBase(BaseModel):
+    """Class for Extraction server settings."""
+
+    port: int = 8000
+    host: str = "localhost"
+    log_type: int = 20
+    log_level: LogLevel = LogLevel("info")
+
+
+class ExtractionServerSettings(ExtractionServerSettingsBase):
+    """Settings for configuring the extraction server.
+
+    This class extends `ExtractionServerSettingsBase` and adds additional
+    logging configuration.
+    """
+
+    def __init__(self, **data) -> None:
+        """Initialize the ExtractionServerSettings."""
+        if "log_level" in data:
+            data["log_level"] = LogLevel(data["log_level"])
+        super().__init__(**data)
+        self.log_type: int = _log_dict[self.log_level.value]
diff --git a/demo/extraction_api/temp_storage/.gitkeep b/demo/extraction_api/temp_storage/.gitkeep
diff --git a/...osc_transformer_presteps/streamlit/app.py → demo/extraction_streamlit/app.py b/...osc_transformer_presteps/streamlit/app.py → demo/extraction_streamlit/app.py
@@ -27,10 +27,6 @@
     if st.button("Extract data"):
         st.info("Extraction started")
         file_bytes = input_file.getvalue()
-        liveness = requests.get(
-            url="http://localhost:8000/liveness", proxies={"http": "", "https": ""}
-        )
-        st.info(f"Liveness Check: {liveness.status_code}")
         file_upload = requests.post(
             url="http://localhost:8000/extract",
             files={"file": (input_file.name, file_bytes)},

diff --git a/src/osc_transformer_presteps/content_extraction/extraction_factory.py b/src/osc_transformer_presteps/content_extraction/extraction_factory.py
@@ -38,12 +38,12 @@ def get_extractor(
 
     Args:
     ----
-    - extractor_type (str): Type of extractor to be retrieved
-    - settings: Settings specific to the extractor
+        extractor_type (str): Type of extractor to be retrieved
+        settings: Settings specific to the extractor
 
     Returns:
     -------
-    - BaseExtractor: Instance of the specified extractor type
+        BaseExtractor: Instance of the specified extractor type
 
     """
     _logger.info("The extractor type is: " + extractor_type)

diff --git a/src/osc_transformer_presteps/content_extraction/extractors/base_extractor.py b/src/osc_transformer_presteps/content_extraction/extractors/base_extractor.py
@@ -21,9 +21,9 @@ class _BaseSettings(BaseModel):
     min_paragraph_length (int)(Optional): Minimum alphabetic characters for paragraph,
                         any paragraph shorter than that will be disregarded.
     annotation_folder (str)(Optional): path to the folder containing all annotated
-            excel files. If provided, just the pdfs mentioned in annotation excels are
+            Excel files. If provided, just the pdfs mentioned in annotation excels are
             extracted. Otherwise, all the pdfs in the pdf folder will be extracted.
-    skip_extracted_files (bool)(Optional): whether to skip extracting a file if it exist in the extraction folder.
+    skip_extracted_files (bool)(Optional): whether to skip extracting a file if it exists in the extraction folder.
     """
 
     annotation_folder: Optional[str] = None
@@ -59,7 +59,7 @@ def __init__(self, settings: Optional[dict] = None):
         self._settings: dict = settings_base
 
     def __init_subclass__(cls, **kwargs):
-        """Intialize the subclass."""
+        """Initialize the subclass."""
         super().__init_subclass__(**kwargs)
         if cls.extractor_name == "base":
             raise ValueError(
@@ -142,7 +142,7 @@ def extract(
             raise ExtractionError(
                 f"While doing the extraction we faced the following error:\n "
                 f"{repr(e)}.\n Trace to the error is given by:\n {traceback_str}"
-            )
+            ) from e
 
     @abstractmethod
     def _generate_extractions(

diff --git a/src/osc_transformer_presteps/settings.py b/src/osc_transformer_presteps/settings.py
@@ -27,30 +27,6 @@ class LogLevel(str, Enum):
 }
 
 
-class ExtractionServerSettingsBase(BaseModel):
-    """Class for Extraction server settings."""
-
-    port: int = 8000
-    host: str = "localhost"
-    log_type: int = 20
-    log_level: LogLevel = LogLevel("info")
-
-
-class ExtractionServerSettings(ExtractionServerSettingsBase):
-    """Settings for configuring the extraction server.
-
-    This class extends `ExtractionServerSettingsBase` and adds additional
-    logging configuration.
-    """
-
-    def __init__(self, **data) -> None:
-        """Initialize the ExtractionServerSettings."""
-        if "log_level" in data:
-            data["log_level"] = LogLevel(data["log_level"])
-        super().__init__(**data)
-        self.log_type: int = _log_dict[self.log_level.value]
-
-
 class ExtractionSettings(BaseModel):
     """Settings for controlling extraction behavior.
 

diff --git a/src/osc_transformer_presteps/streamlit/__init__.py b/src/osc_transformer_presteps/streamlit/__init__.py
Original file line number	Diff line number	Diff line change
Expand Up		@@ -3,3 +3,4 @@ Contributors
		============

		* Matthew Watkins <[email protected]>
		* David Besslich <[email protected]>