diff --git a/AUTHORS.rst b/AUTHORS.rst index 9517329..992df85 100644 --- a/AUTHORS.rst +++ b/AUTHORS.rst @@ -3,3 +3,4 @@ Contributors ============ * Matthew Watkins +* David Besslich diff --git a/README.rst b/README.rst index a6de362..0c829bb 100644 --- a/README.rst +++ b/README.rst @@ -4,21 +4,19 @@ On June 26 2024, Linux Foundation announced the merger of its financial services ===================================================================== -OSC Data Extractor Pre-Steps +OSC Transformer Pre-Steps ===================================================================== |osc-climate-project| |osc-climate-slack| |osc-climate-github| |pypi| |build-status| |pdm| |PyScaffold| -OS-Climate Data Extraction Tool +OS-Climate Transformer Pre-Steps Tool =============================== .. _notes: -This code provides you with an api and a streamlit app to which you -can provide a pdf document and the output will be the text content in a json format. -In the backend it is using a python module for extracting text from pdfs, which -might be extended in the future to other file types. -The json file is needed for later usage in the context of transformer models +This code provides you with a cli tool with the possibility to extract data from +a pdf to a json document and to create a training data set for a later usage in the +context of transformer models to extract relevant information, but it can also be used independently. Quick start @@ -39,53 +37,55 @@ We are using typer to have a nice CLI tool here. All details and help will be sh tool itself and are not described here in more detail. -Install via Github Repository +Developer space +=============== + +Use code directly without CLI via Github Repository ----------------------------- -For a quick start with the tool install python and clone the repository to your local environment:: +First clone the repository to your local environment:: $ git clone https://github.com/os-climate/osc-transformer-presteps -Afterwards update your python to the requirements (possible for example -via pdm update) and start a local api server via:: +We are using pdm to manage the packages and tox for a stable test framework. +Hence, first install pdm (possibly in a virtual environment) via + + $ pip install pdm - $ python ./src/run_server.py +Afterwards sync you system via -**Note**: - * We assume that you are located in the cloned repository. - * To check if it is running open "http://localhost:8000/liveness" and you should see the - message {"message": "OSC Transformer Pre-Steps Server is running."}. + $ pdm sync -Finally, run the following code to start a streamlit app which opens up the possibility -to "upload" a file and extract data from pdf to json via this UI. Note that the UI needs -the running server so you have to open the streamlit and the server in two different -terminals.:: +Now you have multiple demos on how to go on. See folder +[here](demo) - $ streamlit run ./src/osc_transformer_presteps/streamlit/app.py +pdm +----------------------------- -**Note**: Check also docs/demo. There you can -find local_extraction_demo.py which will start an extraction -without any API call and then there is post_request_demo.py -which will send a file to the API (of course you have to start -server as above first). +For adding new dependencies use pdm. You could add new packages via pdm add. +For example numpy via:: -Developer Notes -=============== + $ pdm add numpy -For adding new dependencies use pdm. First install via pip:: +For a very detailed description check the homepage of the pdm project: - $ pip install pdm +https://pdm-project.org/en/latest/ -And then you could add new packages via pdm add. For example numpy via:: - $ pdm add numpy +tox +----------------------------- -For running linting tools just to the following:: +For running linting tools we use tox which you run outside of your virtual environment:: $ pip install tox $ tox -e lint $ tox -e test +This will automatically apply some checks on your code and run the provided pytests. See +more details on tox on the homepage of the tox project: + +https://tox.wiki/en/4.16.0/ + .. |osc-climate-project| image:: https://img.shields.io/badge/OS-Climate-blue :alt: An OS-Climate Project diff --git a/demo/README.rst b/demo/README.rst new file mode 100644 index 0000000..e3f4c14 --- /dev/null +++ b/demo/README.rst @@ -0,0 +1,62 @@ +===================================================================== +DEMO Scripts Overview +===================================================================== + +.. _notes: + +In this folder you can find multiple demo scripts on how to use the python scripts in +different ways beside the *normal* CLI tool. + +**Note**: + +* We assume that you are located in an environment where you have + already installed the necessary requirements (see initial readme). + +* The demos are not part of the tox setup and the tests. Hence, it might be that some + packages or code parts can be outdated. Those are just ideas on how to use and not + prod ready. Feel free to inform us nevertheless if you encounter issues with the demos. + + +extraction_api +.................... + +This demo is an implementation of the code via FastAPI. In api.py the API is created and the +extraction route is build up in extract.py. To start the server run: + + $ python demo/extraction_api/api.py + +Then the server will run and you can test in your browser that it worked at: + + http://localhost:8000/liveness + +You should see the message {"message": "OSC Transformer Pre-Steps Server is running."}. + +extraction +.................... + +This demo has two parts to extract data from the input folder to the output folder. + +a) The post_request_extract.py is using the api endpoint from extraction_api to send a +file to the api via a post request and receives the output via an api respons. The file +you want to extract can be entered in the cmd line: + + $ python demo/extraction/post_request_extract.py + +b) The local_extraction_demo.py runs the extraction code directly for Test.pdf file. +If you want to use another file you have to change that in the code. + +extraction_streamlit +.................... + +This is an example implementation of a streamlit app which opens up the possibility +to "upload" a file and extract data from pdf to json. Note that the UI needs +the running server from extraction_api and so you have to open the streamlit +and the server in two different terminals. An example file to upload can be found in +"/demo/extraction/input". You can start the streamlit via: + + $ streamlit run ./src/osc_transformer_presteps/extraction_streamlit/app.py + +curation +.................... + +T.B.D. diff --git a/docs/demo/curation/input/kpi_mapping.csv b/demo/curation/input/kpi_mapping.csv similarity index 100% rename from docs/demo/curation/input/kpi_mapping.csv rename to demo/curation/input/kpi_mapping.csv diff --git a/docs/demo/curation/input/test_annotations.xlsx b/demo/curation/input/test_annotations.xlsx similarity index 100% rename from docs/demo/curation/input/test_annotations.xlsx rename to demo/curation/input/test_annotations.xlsx diff --git a/docs/demo/curation/local_cuartion_demo.py b/demo/curation/local_cuartion_demo.py similarity index 100% rename from docs/demo/curation/local_cuartion_demo.py rename to demo/curation/local_cuartion_demo.py diff --git a/docs/demo/curation/output/Test.csv b/demo/curation/output/Test.csv similarity index 100% rename from docs/demo/curation/output/Test.csv rename to demo/curation/output/Test.csv diff --git a/docs/demo/extraction/input/Test.pdf b/demo/extraction/input/Test.pdf similarity index 100% rename from docs/demo/extraction/input/Test.pdf rename to demo/extraction/input/Test.pdf diff --git a/docs/demo/extraction/input/test-2.pdf b/demo/extraction/input/test-2.pdf similarity index 100% rename from docs/demo/extraction/input/test-2.pdf rename to demo/extraction/input/test-2.pdf diff --git a/docs/demo/extraction/local_extraction_demo.py b/demo/extraction/local_extraction_demo.py similarity index 100% rename from docs/demo/extraction/local_extraction_demo.py rename to demo/extraction/local_extraction_demo.py diff --git a/docs/demo/extraction/output/Test.json b/demo/extraction/output/Test.json similarity index 100% rename from docs/demo/extraction/output/Test.json rename to demo/extraction/output/Test.json diff --git a/docs/demo/extraction/post_request_extract.py b/demo/extraction/post_request_extract.py similarity index 90% rename from docs/demo/extraction/post_request_extract.py rename to demo/extraction/post_request_extract.py index 9faf166..5f70bde 100644 --- a/docs/demo/extraction/post_request_extract.py +++ b/demo/extraction/post_request_extract.py @@ -1,4 +1,8 @@ -"""Python Script for locally running extraction on FastAPI.""" +"""Python Script for locally running extraction on FastAPI. + +Note: To make the following demo work you first have to start the server in the folder demo/extraction_api! + +""" import json from pathlib import Path diff --git a/src/osc_transformer_presteps/api/__init__.py b/demo/extraction_api/__init__.py similarity index 100% rename from src/osc_transformer_presteps/api/__init__.py rename to demo/extraction_api/__init__.py diff --git a/src/osc_transformer_presteps/api/api.py b/demo/extraction_api/api.py similarity index 90% rename from src/osc_transformer_presteps/api/api.py rename to demo/extraction_api/api.py index a03c71a..5b1957f 100644 --- a/src/osc_transformer_presteps/api/api.py +++ b/demo/extraction_api/api.py @@ -5,9 +5,9 @@ import uvicorn from fastapi import APIRouter, FastAPI from starlette.responses import RedirectResponse +from server_settings import ExtractionServerSettings -from osc_transformer_presteps.api.extract import router as extraction_router -from osc_transformer_presteps.settings import ExtractionServerSettings +from extract import router as extraction_router _logger = logging.getLogger(__name__) diff --git a/src/osc_transformer_presteps/api/extract.py b/demo/extraction_api/extract.py similarity index 100% rename from src/osc_transformer_presteps/api/extract.py rename to demo/extraction_api/extract.py diff --git a/demo/extraction_api/server_settings.py b/demo/extraction_api/server_settings.py new file mode 100644 index 0000000..6a2d600 --- /dev/null +++ b/demo/extraction_api/server_settings.py @@ -0,0 +1,48 @@ +from pydantic import BaseModel +from enum import Enum +import logging + + +class LogLevel(str, Enum): + """Class for different log levels.""" + + critical = "critical" + error = "error" + warning = "warning" + info = "info" + debug = "debug" + notset = "notset" + + +_log_dict = { + "critical": logging.CRITICAL, + "error": logging.ERROR, + "warning": logging.WARNING, + "info": logging.INFO, + "debug": logging.DEBUG, + "notset": logging.NOTSET, +} + + +class ExtractionServerSettingsBase(BaseModel): + """Class for Extraction server settings.""" + + port: int = 8000 + host: str = "localhost" + log_type: int = 20 + log_level: LogLevel = LogLevel("info") + + +class ExtractionServerSettings(ExtractionServerSettingsBase): + """Settings for configuring the extraction server. + + This class extends `ExtractionServerSettingsBase` and adds additional + logging configuration. + """ + + def __init__(self, **data) -> None: + """Initialize the ExtractionServerSettings.""" + if "log_level" in data: + data["log_level"] = LogLevel(data["log_level"]) + super().__init__(**data) + self.log_type: int = _log_dict[self.log_level.value] diff --git a/demo/extraction_api/temp_storage/.gitkeep b/demo/extraction_api/temp_storage/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/src/osc_transformer_presteps/streamlit/app.py b/demo/extraction_streamlit/app.py similarity index 88% rename from src/osc_transformer_presteps/streamlit/app.py rename to demo/extraction_streamlit/app.py index 34dba9c..1214a89 100644 --- a/src/osc_transformer_presteps/streamlit/app.py +++ b/demo/extraction_streamlit/app.py @@ -27,10 +27,6 @@ if st.button("Extract data"): st.info("Extraction started") file_bytes = input_file.getvalue() - liveness = requests.get( - url="http://localhost:8000/liveness", proxies={"http": "", "https": ""} - ) - st.info(f"Liveness Check: {liveness.status_code}") file_upload = requests.post( url="http://localhost:8000/extract", files={"file": (input_file.name, file_bytes)}, diff --git a/src/osc_transformer_presteps/content_extraction/extraction_factory.py b/src/osc_transformer_presteps/content_extraction/extraction_factory.py index 7eaf363..c3b805b 100644 --- a/src/osc_transformer_presteps/content_extraction/extraction_factory.py +++ b/src/osc_transformer_presteps/content_extraction/extraction_factory.py @@ -38,12 +38,12 @@ def get_extractor( Args: ---- - - extractor_type (str): Type of extractor to be retrieved - - settings: Settings specific to the extractor + extractor_type (str): Type of extractor to be retrieved + settings: Settings specific to the extractor Returns: ------- - - BaseExtractor: Instance of the specified extractor type + BaseExtractor: Instance of the specified extractor type """ _logger.info("The extractor type is: " + extractor_type) diff --git a/src/osc_transformer_presteps/content_extraction/extractors/base_extractor.py b/src/osc_transformer_presteps/content_extraction/extractors/base_extractor.py index 9d0bcd2..501220d 100644 --- a/src/osc_transformer_presteps/content_extraction/extractors/base_extractor.py +++ b/src/osc_transformer_presteps/content_extraction/extractors/base_extractor.py @@ -21,9 +21,9 @@ class _BaseSettings(BaseModel): min_paragraph_length (int)(Optional): Minimum alphabetic characters for paragraph, any paragraph shorter than that will be disregarded. annotation_folder (str)(Optional): path to the folder containing all annotated - excel files. If provided, just the pdfs mentioned in annotation excels are + Excel files. If provided, just the pdfs mentioned in annotation excels are extracted. Otherwise, all the pdfs in the pdf folder will be extracted. - skip_extracted_files (bool)(Optional): whether to skip extracting a file if it exist in the extraction folder. + skip_extracted_files (bool)(Optional): whether to skip extracting a file if it exists in the extraction folder. """ annotation_folder: Optional[str] = None @@ -59,7 +59,7 @@ def __init__(self, settings: Optional[dict] = None): self._settings: dict = settings_base def __init_subclass__(cls, **kwargs): - """Intialize the subclass.""" + """Initialize the subclass.""" super().__init_subclass__(**kwargs) if cls.extractor_name == "base": raise ValueError( @@ -142,7 +142,7 @@ def extract( raise ExtractionError( f"While doing the extraction we faced the following error:\n " f"{repr(e)}.\n Trace to the error is given by:\n {traceback_str}" - ) + ) from e @abstractmethod def _generate_extractions( diff --git a/src/osc_transformer_presteps/settings.py b/src/osc_transformer_presteps/settings.py index c3c66e4..272dfe1 100644 --- a/src/osc_transformer_presteps/settings.py +++ b/src/osc_transformer_presteps/settings.py @@ -27,30 +27,6 @@ class LogLevel(str, Enum): } -class ExtractionServerSettingsBase(BaseModel): - """Class for Extraction server settings.""" - - port: int = 8000 - host: str = "localhost" - log_type: int = 20 - log_level: LogLevel = LogLevel("info") - - -class ExtractionServerSettings(ExtractionServerSettingsBase): - """Settings for configuring the extraction server. - - This class extends `ExtractionServerSettingsBase` and adds additional - logging configuration. - """ - - def __init__(self, **data) -> None: - """Initialize the ExtractionServerSettings.""" - if "log_level" in data: - data["log_level"] = LogLevel(data["log_level"]) - super().__init__(**data) - self.log_type: int = _log_dict[self.log_level.value] - - class ExtractionSettings(BaseModel): """Settings for controlling extraction behavior. diff --git a/src/osc_transformer_presteps/streamlit/__init__.py b/src/osc_transformer_presteps/streamlit/__init__.py deleted file mode 100644 index 9dd09fe..0000000 --- a/src/osc_transformer_presteps/streamlit/__init__.py +++ /dev/null @@ -1 +0,0 @@ -"""Module for Streamlit app.""" diff --git a/tests/osc_transformer_presteps/content_extraction/extractors/test_base_extractor.py b/tests/osc_transformer_presteps/content_extraction/extractors/test_base_extractor.py index 3e1ee06..3caf75f 100644 --- a/tests/osc_transformer_presteps/content_extraction/extractors/test_base_extractor.py +++ b/tests/osc_transformer_presteps/content_extraction/extractors/test_base_extractor.py @@ -1,3 +1,5 @@ +"""Module to test the base_extractor.py.""" + from pathlib import Path from typing import Optional @@ -10,7 +12,7 @@ def concrete_base_extractor(name: str): - """This function replaces all abstract methods by concrete ones.""" + """Replace all abstract methods by concrete ones.""" class ConcreteBaseExtractor(BaseExtractor): extractor_name = name @@ -25,14 +27,15 @@ def _generate_extractions( class TestBaseExtractor: + """Class to collect tests for the BaseExtractor.""" + @pytest.fixture() def base_extractor(self): + """Initialize a concrete BaseExtractor element to test it.""" return concrete_base_extractor("base_test") def test_extractor_name_is_base(self): - """This function tests if we get a ValueError in case a subclass has not changed extractor_name to - something different base. - """ + """Tests if we get a ValueError in case a subclass has not changed extractor_name.""" with pytest.raises( ValueError, match="Subclass must define an extractor_name not equal to 'base'.", @@ -40,6 +43,7 @@ def test_extractor_name_is_base(self): concrete_base_extractor("base") def test_get_settings(self, base_extractor): + """Test if retrieving the right settings.""" settings = base_extractor.get_settings() assert settings["annotation_folder"] is None assert settings["min_paragraph_length"] == 20 @@ -47,6 +51,7 @@ def test_get_settings(self, base_extractor): assert settings["store_to_file"] is True def test_get_extractions(self, base_extractor): + """Test if we can retrieve extraction response correctly.""" base_extractor._extraction_response = ExtractionResponse( **{"dictionary": {"a": "b"}, "success": True} ) @@ -54,6 +59,7 @@ def test_get_extractions(self, base_extractor): assert base_extractor.get_extractions().success is True def test_check_for_skip_files(self, base_extractor): + """Test if files are really skipped when defined as such.""" input_file_path = Path(__file__).resolve().parent / "test.pdf" output_folder_path = Path(__file__).resolve().parent assert not base_extractor.check_for_skip_files( @@ -77,6 +83,7 @@ def test_check_for_skip_files(self, base_extractor): json_file_path.unlink(missing_ok=True) def test_save_extraction_to_file(self, base_extractor): + """Test if we can save the output.""" output_file_path = Path(__file__).resolve().parent / "output.json" er = ExtractionResponse() er.dictionary = {"key": "value"} diff --git a/tests/osc_transformer_presteps/content_extraction/extractors/test_pdf_extractor.py b/tests/osc_transformer_presteps/content_extraction/extractors/test_pdf_extractor.py index d0a45fc..9bdad4d 100644 --- a/tests/osc_transformer_presteps/content_extraction/extractors/test_pdf_extractor.py +++ b/tests/osc_transformer_presteps/content_extraction/extractors/test_pdf_extractor.py @@ -1,3 +1,5 @@ +"""Module to test the pdf_extractor.py.""" + import json from pathlib import Path @@ -7,8 +9,12 @@ class TestPdfExtractor: + """Class to collect tests for the PDFExtractor class.""" + def test_pdf_with_extraction_issues(self): - """In this test we try to extract the data from a pdf, where one can not extract text as it was produced via + """Test with extraction issue. + + A test where we try to extract the data from a pdf, where one can not extract text as it was produced via a "print". Check the file test_issue.pdf. """ extractor = PDFExtractor() @@ -17,7 +23,9 @@ def test_pdf_with_extraction_issues(self): assert extraction_response.dictionary == {} def test_pdf_with_no_extraction_issues(self): - """In this test we try to extract the data from a pdf, where one can not extract text as it was produced via + """Test with no extraction issue. + + In this test we try to extract the data from a pdf, where one can not extract text as it was produced via a "print". Check the file test_issue.pdf. """ extractor = PDFExtractor() diff --git a/tests/osc_transformer_presteps/content_extraction/test_extraction_factory.py b/tests/osc_transformer_presteps/content_extraction/test_extraction_factory.py index 4be8daa..7c9c8b6 100644 --- a/tests/osc_transformer_presteps/content_extraction/test_extraction_factory.py +++ b/tests/osc_transformer_presteps/content_extraction/test_extraction_factory.py @@ -1,3 +1,5 @@ +"""Module to test the extraction_factory.py.""" + import pytest from osc_transformer_presteps.content_extraction.extraction_factory import get_extractor @@ -7,10 +9,14 @@ class TestGetExtractor: + """Class to collect tests for the get_extractor function.""" + def test_get_pdf_extractor(self): + """Test if we can retrieve the pdf extractor.""" extractor = get_extractor(".pdf") assert isinstance(extractor, PDFExtractor) def test_get_non_existing_extractor(self): + """Test for an error message for an invalid extractor type.""" with pytest.raises(KeyError, match="Invalid extractor type"): get_extractor(".thisdoesnotexist") diff --git a/tests/osc_transformer_presteps/dataset_creation_curation/test_curator.py b/tests/osc_transformer_presteps/dataset_creation_curation/test_curator.py index ea61a40..8895aa0 100644 --- a/tests/osc_transformer_presteps/dataset_creation_curation/test_curator.py +++ b/tests/osc_transformer_presteps/dataset_creation_curation/test_curator.py @@ -1,3 +1,5 @@ +"""Module to test the curator.py.""" + import os from pathlib import Path @@ -17,6 +19,7 @@ @pytest.fixture def mock_curator_data(): + """Mimics the curator settings data.""" return { "annotation_folder": cwd / "test_annotations_sliced.xlsx", "extract_json": cwd / "Test.json", @@ -28,16 +31,18 @@ def mock_curator_data(): @pytest.fixture def curator_object(mock_curator_data): + """Fixture to create a fixed Curator object with the given mocked settings data.""" return Curator( - annotation_folder=mock_curator_data["annotation_folder"], + annotation_folder=str(mock_curator_data["annotation_folder"]), extract_json=mock_curator_data["extract_json"], - kpi_mapping_path=mock_curator_data["kpi_mapping_path"], + kpi_mapping_path=str(mock_curator_data["kpi_mapping_path"]), neg_pos_ratio=1, create_neg_samples=True, ) def annotation_to_df(filepath: Path) -> pd.Series: + """Load curation data and return the first row.""" df = pd.read_excel(filepath, sheet_name="data_ex_in_xls") df["annotation_file"] = os.path.basename(filepath) @@ -50,7 +55,10 @@ def annotation_to_df(filepath: Path) -> pd.Series: class TestAnnotationData: + """Class to collect tests for the AnnotationData class.""" + def test_annotation_data_valid_paths(self, mock_curator_data): + """A test to validate that all mentioned paths are ok.""" data = AnnotationData( annotation_folder=mock_curator_data["annotation_folder"], extract_json=mock_curator_data["extract_json"], @@ -61,6 +69,7 @@ def test_annotation_data_valid_paths(self, mock_curator_data): assert data.kpi_mapping_path == cwd / "kpi_mapping_sliced.csv" def test_annotation_data_invalid_paths(self): + """A test to validate that wrong paths will raise an error.""" with pytest.raises(ValidationError): AnnotationData( annotation_folder="/invalid/path", @@ -70,6 +79,8 @@ def test_annotation_data_invalid_paths(self): class TestCurator: + """Class to collect tests for the curator module.""" + @pytest.mark.parametrize( "input_text, expected_output", [ @@ -84,29 +95,35 @@ class TestCurator: ], ) def test_clean_text(self, curator_object, input_text, expected_output): + """A test where we test multiple test sentences.""" cleaned_text = curator_object.clean_text(input_text) assert cleaned_text == expected_output def test_clean_text_basic(self, curator_object): + """A test where test sentence is already clean.""" cleaned_text = curator_object.clean_text("This is a test sentence.") assert cleaned_text == "This is a test sentence." def test_clean_text_with_fancy_quotes(self, curator_object): + """A test on cleaning text with special quotes.""" text_with_fancy_quotes = "“This is a test sentence.”" cleaned_text = curator_object.clean_text(text_with_fancy_quotes) assert cleaned_text == '"This is a test sentence."' def test_clean_text_with_newlines_and_tabs(self, curator_object): + """A test on removing new lines and tabs.""" text_with_newlines_tabs = "This\nis\ta\ttest\nsentence." cleaned_text = curator_object.clean_text(text_with_newlines_tabs) assert cleaned_text == "This is a test sentence." def test_clean_text_removing_specific_terms(self, curator_object): + """A test on removing specific terms.""" text_with_boe = "This sentence contains the term BOE." cleaned_text = curator_object.clean_text(text_with_boe) assert cleaned_text == "This sentence contains the term ." def test_clean_text_removing_invalid_escape_sequence(self, curator_object): + """A test on removing invalid escape sequence.""" text_with_invalid_escape_sequence = ( "This sentence has an invalid escape sequence: \x9d" ) @@ -114,12 +131,14 @@ def test_clean_text_removing_invalid_escape_sequence(self, curator_object): assert cleaned_text == "This sentence has an invalid escape sequence: " def test_clean_text_removing_extra_backslashes(self, curator_object): + """A test on removing extra backslashes.""" text_with_extra_backslashes = "This\\ sentence\\ has\\ extra\\ backslashes." cleaned_text = curator_object.clean_text(text_with_extra_backslashes) assert cleaned_text == "This sentence has extra backslashes." def test_create_pos_examples_correct_samples(self, curator_object): - row = annotation_to_df(curator_object.annotation_folder) + """A test where we create positive examples via curator.""" + row = annotation_to_df(Path(curator_object.annotation_folder)) pos_example = curator_object.create_pos_examples(row) expected_pos_example = [ "We continue to work towards delivering on our Net Carbon Footprint ambition to " @@ -133,35 +152,39 @@ def test_create_pos_examples_correct_samples(self, curator_object): assert pos_example == expected_pos_example def test_create_pos_examples_json_filename_mismatch(self, mock_curator_data): + """A test for positive examples where we have a json filename mismatch.""" curator = Curator( - annotation_folder=mock_curator_data["annotation_folder"], + annotation_folder=str(mock_curator_data["annotation_folder"]), extract_json=cwd / "Test_another.json", - kpi_mapping_path=mock_curator_data["kpi_mapping_path"], + kpi_mapping_path=str(mock_curator_data["kpi_mapping_path"]), neg_pos_ratio=1, create_neg_samples=True, ) - row = annotation_to_df(curator.annotation_folder) + row = annotation_to_df(Path(curator.annotation_folder)) pos_example = curator.create_pos_examples(row) assert pos_example == [""] def test_create_neg_examples_correct_samples(self, curator_object): - row = annotation_to_df(curator_object.annotation_folder) + """A test where we create negative examples via curator.""" + row = annotation_to_df(Path(curator_object.annotation_folder)) neg_example = curator_object.create_neg_examples(row) assert neg_example == ["Shell 2019 Sustainability Report"] def test_create_neg_examples_json_filename_mismatch(self, mock_curator_data): + """A test for negative examples where we have a json filename mismatch.""" curator = Curator( - annotation_folder=mock_curator_data["annotation_folder"], + annotation_folder=str(mock_curator_data["annotation_folder"]), extract_json=cwd / "Test_another.json", - kpi_mapping_path=mock_curator_data["kpi_mapping_path"], + kpi_mapping_path=str(mock_curator_data["kpi_mapping_path"]), neg_pos_ratio=1, create_neg_samples=True, ) - row = annotation_to_df(curator.annotation_folder) + row = annotation_to_df(Path(curator.annotation_folder)) neg_example = curator.create_neg_examples(row) assert neg_example == [""] def test_create_curator_df(self, curator_object): + """A test to create the final dataframe output.""" actual_df = pd.read_csv(cwd / "Actual.csv") output = curator_object.create_curator_df()