[REFACTOR]Clean and organise data processing #405

HAEKADI · 2024-10-17T14:07:53Z

[REFACTO]Nettoyer la config infra #392
Implement Airflow best practices
Use a RessourceManager to download, process and save datasets

HAEKADI · 2024-10-25T13:24:01Z

The target organisation for the repository will look something like this:

dag_datalake_sirene/
│
├── config/
│   ├── __init__.py
│   ├── constants.py  # Global constants
│   └── settings.py   # Environment-specific settings
│
├── helpers/
│   ├── __init__.py
│   ├── data_processor.py
│   ├── config_models.py
│   ├── minio_helpers.py
│   ├── tchap.py
│   ├── utils.py
│   └── sqlite_client.py
│
├── workflows/
│   ├── __init__.py
│   │
│   ├── data_pipelines/
│   │   ├── __init__.py
│   │   │
│   │   ├── egapro/
│   │   │   ├── __init__.py
│   │   │   ├── dag_egapro.py
│   │   │   ├── egapro_processor.py
│   │   │   └── task_functions.py
│   │   │
│   │   ├── rge/
│   │   │   ├── __init__.py
│   │   │   ├── dag_rge.py
│   │   │   ├── rge_processor.py
│   │   │   └── task_functions.py
│   │   │
│   │   ├── colter/
│   │   │   ├── __init__.py
│   │   │   ├── dag_colter.py
│   │   │   ├── colter_processor.py
│   │   │   └── task_functions.py
│   │   │
│   │   └── ... (other data sources)
│   │
│   └── common/
│       ├── __init__.py
│       └── common_tasks.py  # Tasks that might be shared across DAGs
│
├── tests/
│   ├── __init__.py
│   ├── conftest.py
│   │
│   ├── helpers/
│   │   ├── __init__.py
│   │   ├── test_data_processor.py
│   │   └── ... (tests for other helpers)
│   │
│   └── workflows/
│       ├── __init__.py
│       │
│       └── data_pipelines/
│           ├── __init__.py
│           ├── test_egapro.py
│           ├── test_rge.py
│           └── ... (tests for other data pipelines)
│
├── requirements.txt
├── setup.py
└── README.md

HAEKADI · 2024-10-25T13:26:40Z

This refactoring would involve creating a universal client or framework that can handle different data sources with similar processing patterns. Something that would look like this:

class DataProcessor(ABC):
    def __init__(self, config):
        self.config = config
        self.minio_client = minio_client

    @abstractmethod
    def download_data(self):
        pass

    @abstractmethod
    def process_data(self):
        pass

    @abstractmethod
    def save_to_minio(self):
        pass

    @abstractmethod
    def compare_files_minio(self):
        pass

    def send_notification(self, message):
        send_message(message)):
            self.send_notification()

This will be a huge undertaking given the size of this codebase.

Here's a step-by-step plan that focuses on gradually refactoring configuration management, then using a universal client for processing data sources, without drastically changing the existing structure:

Introduce data classes for configuration management
Create a universal data processing client (like shown above)
Refactor one data source using data classes and DataProcessor
Gradually refactor other data sources (each data source gets a PR)
Enhance DataProcessor as needed
Update tests
Documentation updates

In order to avoid bugs as much as possible, and simplify reviewing efforts, each step will preferably have its own PR.

HAEKADI · 2024-10-25T13:36:31Z

@XavierJp Any feedback so far ?

XavierJp · 2024-10-25T15:25:09Z

It is... beautiful 🥹

XavierJp · 2024-10-25T15:26:11Z

Honestly sounds very relevant. Step by step is always a good choice. Will start the migration with the most complicated clients or the simpler ones ?

HAEKADI · 2024-10-25T15:41:54Z

HAEKADI · 2024-10-25T15:45:42Z

I’m thinking of implementing an initial version of the client focused on a straightforward data source, such as EGAPRO (see related draft).

Then each PR will add a new data source, potentially introducing additional layers of complexity.

This is very much a work in progress. Many improvements are coming (so don't mind the naming conventions etc).

HAEKADI · 2024-10-25T15:45:54Z

@XavierJp What do you think?

XavierJp · 2024-10-25T15:48:33Z

Tottally agree. You could even do one or two basic sources, then the hard ones like insee and rne. Thus ensuring the model is both straightforward and flexible enough.

HAEKADI · 2024-11-22T09:25:49Z

Potential enhancement :

Add a MinIOFile and an AirflowFile to distinguish types of files while processing.
Use Dataclass and Pydantic when possible instead of TypeDict.

related to #405

@dag

Related to #405 This PR creates the DatabaseTableProcessor class so it can be used as a generic tool to create the SQLite tables. AgenceBio is the first data source to use this new class. We will refactor the other data sources in a second step if we are ok with the implementation. It does not work for RNE and SIRENE yet. We will tackle does later. Current implementation design: 1- Move any transformation to the data done in `etl` back to the relevant DAG. 2- Add a `table_ddl` to the relevant config 3- Use generic DatabaseTableProcessor methods for downloading the file from MinIO and upload it to the SQLite database The whole `data_fetch_clean` and `sqlite` folders should disappear as a result. Note about `dag.py`: PythonOperator still in use so we can easily identify the tasks that need to be refactored. All tasks instances had to be renamed because with @dag the instance's name is conflicting with the callable's name.

HAEKADI self-assigned this Oct 17, 2024

HAEKADI mentioned this issue Oct 17, 2024

[METADATA]Create json file containing last modified dates for data sources #382

Merged

HAEKADI changed the title ~~[REFACTOR]Clean data ressource processing~~ [REFACTOR]Clean and organise data processing Oct 25, 2024

HAEKADI added enhancement New feature or request refactoring/clean code clean code labels Oct 25, 2024

HAEKADI removed the clean code label Nov 4, 2024

HAEKADI assigned hacherix Nov 4, 2024

This was referenced Nov 12, 2024

[REFACTOR]Introduce a data processor for Egapro #409

Merged

[SERVICE PUBLIC]Update service public definition #421

Merged

This was referenced Nov 26, 2024

[REFACTOR][FINESS]Introduce a data processor for Finess #429

Merged

[REFACTOR][ESS-FRANCE]Introduce a data processor for ESS France #431

Merged

hacherix mentioned this issue Dec 1, 2024

[REFACTOR][SPECTACLE] Move to DataProcessor #433

Merged

HAEKADI mentioned this issue Dec 4, 2024

[REFACTOR][RGE]Introduce a data processor for RGE #435

Merged

This was referenced Dec 16, 2024

[REFACTOR][FORMATION] Move to DataProcessor #440

Merged

[REFACTOR][UAI] Refactor uai with DataProcessor + update MENJ resource to use the latest one #441

Merged

hacherix mentioned this issue Dec 30, 2024

[REFACTOR][COLTER] Refactor colter dag with DataProcessor #444

Merged

This was referenced Jan 3, 2025

[REFACTOR][SIRENE][FLUX]Introduce a data processor for Sirene flux + Use CamelCase for class names #447

Merged

[REFACTOR][SIRENE][STOCK]Introduce a data processor for Sirene stock #448

Merged

hacherix mentioned this issue Jan 6, 2025

[REFACTOR][AGENCE-BIO] Use DataProcessor #449

Merged

This was referenced Jan 7, 2025

[REFACTOR][BILANS FINANCIERS] Use DataProcessor #450

Merged

[REFACTOR][MARCHE INCLUSION] Use DataProcessor #451

Merged

[REFACTOR][CONVENTION COLLECTIVE] Use DataProcessor #453

Merged

HAEKADI mentioned this issue Jan 23, 2025

[REFACTOR][RNE][STOCK]Use DataProcessor #466

Merged

hacherix mentioned this issue Jan 24, 2025

[REFACTOR][ETL][AGENCE BIO] Use DataProcessor #464

Merged

HAEKADI added a commit that referenced this issue Jan 24, 2025

[REFACTOR][RNE][STOCK]Use DataProcessor (#466)

5768c27

related to #405

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REFACTOR]Clean and organise data processing #405

[REFACTOR]Clean and organise data processing #405

HAEKADI commented Oct 17, 2024

HAEKADI commented Oct 25, 2024

HAEKADI commented Oct 25, 2024 •

edited

Loading

HAEKADI commented Oct 25, 2024

XavierJp commented Oct 25, 2024

XavierJp commented Oct 25, 2024

HAEKADI commented Oct 25, 2024 •

edited

Loading

HAEKADI commented Oct 25, 2024 •

edited

Loading

HAEKADI commented Oct 25, 2024

XavierJp commented Oct 25, 2024

HAEKADI commented Nov 22, 2024

[REFACTOR]Clean and organise data processing #405

[REFACTOR]Clean and organise data processing #405

Comments

HAEKADI commented Oct 17, 2024

HAEKADI commented Oct 25, 2024

HAEKADI commented Oct 25, 2024 • edited Loading

HAEKADI commented Oct 25, 2024

XavierJp commented Oct 25, 2024

XavierJp commented Oct 25, 2024

HAEKADI commented Oct 25, 2024 • edited Loading

HAEKADI commented Oct 25, 2024 • edited Loading

HAEKADI commented Oct 25, 2024

XavierJp commented Oct 25, 2024

HAEKADI commented Nov 22, 2024

HAEKADI commented Oct 25, 2024 •

edited

Loading

HAEKADI commented Oct 25, 2024 •

edited

Loading

HAEKADI commented Oct 25, 2024 •

edited

Loading