-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REFACTOR]Clean and organise data processing #405
Comments
The target organisation for the repository will look something like this:
|
This refactoring would involve creating a universal client or framework that can handle different data sources with similar processing patterns. Something that would look like this:
This will be a huge undertaking given the size of this codebase. Here's a step-by-step plan that focuses on gradually refactoring configuration management, then using a universal client for processing data sources, without drastically changing the existing structure:
In order to avoid bugs as much as possible, and simplify reviewing efforts, each step will preferably have its own PR. |
@XavierJp Any feedback so far ? |
It is... beautiful 🥹 |
Honestly sounds very relevant. Step by step is always a good choice. Will start the migration with the most complicated clients or the simpler ones ? |
Data sources : |
I’m thinking of implementing an initial version of the client focused on a straightforward data source, such as EGAPRO (see related draft). Then each PR will add a new data source, potentially introducing additional layers of complexity. This is very much a work in progress. Many improvements are coming (so don't mind the naming conventions etc). |
@XavierJp What do you think? |
Tottally agree. You could even do one or two basic sources, then the hard ones like insee and rne. Thus ensuring the model is both straightforward and flexible enough. |
Potential enhancement :
|
Related to #405 This PR creates the DatabaseTableProcessor class so it can be used as a generic tool to create the SQLite tables. AgenceBio is the first data source to use this new class. We will refactor the other data sources in a second step if we are ok with the implementation. It does not work for RNE and SIRENE yet. We will tackle does later. Current implementation design: 1- Move any transformation to the data done in `etl` back to the relevant DAG. 2- Add a `table_ddl` to the relevant config 3- Use generic DatabaseTableProcessor methods for downloading the file from MinIO and upload it to the SQLite database The whole `data_fetch_clean` and `sqlite` folders should disappear as a result. Note about `dag.py`: PythonOperator still in use so we can easily identify the tasks that need to be refactored. All tasks instances had to be renamed because with @dag the instance's name is conflicting with the callable's name.
The text was updated successfully, but these errors were encountered: