-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Centralize osf data links #304
Comments
Seems like an ok idea, but we have a rule the tutorials can't use pandas, so downloading the centralized csv file and parsing it is going to be a pain. I'm +0 on it. Good in principle, but work on the tutorials is fairly distributed right now, and it has the potential to cause conflicts. For our workflow, I don't think it saves much effort, though I appreciate how it might make your life easier. |
why was pandas banned? |
Is a |
I agree. I think this will save a lot of time for localization. There are only a dozen parts (about 10-20) in the tutorials that involve data sets, and it doesn’t require lots of time or complex logic to change. In addition, some data sets are downloaded directly and used, while others are preferentially loaded locally. I think this is not so good. |
not my decision, but the scope of libraries introduced to the students was intentionally limited to numpy/scipy/matplotlib/scikit-learn to avoid overwhelming them. this is most easily enforced by not having pandas in the environment file that the tests run against.
We've discussed a few times implementing a centralized NMA helper function library, but those conversations never converged. With the colab workflow, you'd need to pip install it in each notebook. So either we need to (A) copy-paste a little helper function everywhere we want to use this pattern, or (B) stand up a centralized utility library while the course is live. I'm -1 on (B), and concerned about the potential of subtle bugs in (A). |
got it. I would have voted for a central NMA helper module in pypi. ok, we will stick with our find/replace scripts. |
Agreed, I was a proponent (Although we'd be iterating on that just as fast as on the tutorials themselves, so pushing releases to pypi would have been a limiting factor, and versioning could have been a nightmare. You can point pip at github, but we'd have needed a whole separate CI pipeline, etc etc). Still strongly in favor of bringing that online for version 2.0, once we have a better sense of common patterns across the tutorials. Robust/abstract data loading is an obvious one. |
There is a solution that doesn't (I think) have any of the problems you mentioned @mwaskom . Instead of putting the code in a module that needs to run in each notebook, we just have a public service like it is a bad solution in that the service is a new dependency. But it is so simple, it could just be an nginx configuration. |
It seems like poor design to have the links to datasets peppered throughout the tutorials. One solution would be a csv with columns "dataset", "url"
Then if we change the url for a dataset it would just be in one spot. It also means if the China team wants to store our data in China we can just change one file.
Would you be ok with this PR? if so I can work on the changes.
The text was updated successfully, but these errors were encountered: