Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isolate the data acquisition from the data publication #3

Open
jufemaiz opened this issue Aug 28, 2018 · 2 comments
Open

Isolate the data acquisition from the data publication #3

jufemaiz opened this issue Aug 28, 2018 · 2 comments

Comments

@jufemaiz
Copy link
Contributor

At the moment, nemweb mashes together data acquisition (e.g. download zip, extract, process) along with the data publication (persistence to nemweb_sqlite).

I am proposing that we at a minimum separate the two different processes and have a processor. This way, there is the potential to use other storage solutions (including publication to a queue for writing).

Thoughts?

@dylanjmcconnell
Copy link
Member

Hey,

At the moment there there is data server at unimelb running the backend. The python scripts actually interact with a mysql database (rather than sqlite db).

I made a very simple sqlite interface, basically because I thought it would be more useful (or user friendly) than requiring someone to set up a mysql server. The mysql server is pretty strict (i.e. normalised, foreign key constraints etc) - and quite large... Some series go back to the start of the NEM.

There is some degree of abstraction from the mysql interface (...which uses sqlalchemy) and the the downloading / processing - but there is also a fair bit of interaction between the download/processing and the data persistence (since there is a degree of mapping between primary key tables in the mysql db and the downloaded files. If that makes sense)... I think I did try separating it out completely once before (but I gave up).

Longer term - I was thinking of running the python scripts on an EC2 server, and using Amazon RDS (rather than the unimelb data server). .. The web front end is on S3 btw. Even better / longer term would be a docker container - but that's a looong way down the track I think.

Am open to suggestions on all of this - but that's where my thinking is at the moment. Have some local branches for interfacing with mysql etc (which I'll eventually push to the repo when I am not entirely embarrassed by them). But yeah, In the mean time, have only got the light weight sqlite interface in the master repo..

p.s. looking to add you to slack (already have one) but am not the workspace 'owner'

Cheers, Dylan

@jufemaiz
Copy link
Contributor Author

Sweet! Ok thanks for a bit more information on this. I've got some ideas that I'll try and put down to throw at you.

I was just going through the repo to try and bring some pytests & pylints and the above was my first impression. Funny you mention docker because i've already got the container part working. Insofar as the database & data interactions, I've got some ideas there too to try and make this scalable + ultra cheap to run!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants