-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial commit, faq on dask and parallelization #340
base: master
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
for more information, see https://pre-commit.ci
View / edit / reply to this conversation on ReviewNB tlogan2000 commented on 2025-01-06T16:43:46Z Suggestion (this might be a bit wordy??) but I think maybe a bit more context would be good :
I begin by checking the on-disk chunk structure of my dataset by loading it with |
View / edit / reply to this conversation on ReviewNB tlogan2000 commented on 2025-01-06T16:43:46Z suggested addition at the end here (makes connection with what I suggest in the code later):
Since we are calculating a single year but for all of North America a logical choice might be to chunk the data in memory for a few days over the |
View / edit / reply to this conversation on ReviewNB tlogan2000 commented on 2025-01-06T16:43:47Z Line #2. ds = xr.open_dataset(url, chunks={"time": 200, "lat": 30, "lon": 30})
|
View / edit / reply to this conversation on ReviewNB tlogan2000 commented on 2025-01-06T16:43:48Z suggestion : Add short markdown section above this code describing how this original chunking choice is likely a bad decision as we know that the chunksize on-disk for the time dimension is much larger than 10. Instead we can set time to be the smallest on disk chunk size and chunk lat, lon in multiples of 50 ... (to avoid having a large number of very small chunks using directly n=50) |
View / edit / reply to this conversation on ReviewNB tlogan2000 commented on 2025-01-06T16:43:49Z Line #2. ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50, "lon": 50})
ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50*5, "lon": 50*5})
|
View / edit / reply to this conversation on ReviewNB tlogan2000 commented on 2025-01-06T16:43:50Z Line #17. ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50, "lon": 50})
Suggestion : use
Basically a chunk of 4 years (add one because ERA5 has leap years) ... might have to explain in the markdown cell why you are changing but you can say it's because you are now calculating more than one year |
View / edit / reply to this conversation on ReviewNB tlogan2000 commented on 2025-01-06T16:43:50Z Line #15. ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50, "lon": 50}) Suggest same chunking as before (4years +1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great made a few suggestions for the initial sections
View / edit / reply to this conversation on ReviewNB tlogan2000 commented on 2025-01-06T17:48:12Z Line #21. lon=slice(-12.5, -20.0),
This slice goes from biggest to smallest resulting in an empty dataset ... suggest : |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
implemented all the changes + added 'time': (365*4)+1 chunking to the last two examples as well
View / edit / reply to this conversation on ReviewNB huard commented on 2025-01-21T14:52:09Z Move first paragraph above sub-title, so readers are not intimidated because they don't understand the sub-title.
The text uses both the first person ("I begin") and the third ("users can specify"). I suggest to stick to one or the other. |
View / edit / reply to this conversation on ReviewNB huard commented on 2025-01-21T14:52:10Z delaying execution until Saving to disk, or creating a graphic will trigger the compute without "explicitly" calling it. I would just drop "explicitly". |
View / edit / reply to this conversation on ReviewNB huard commented on 2025-01-21T14:52:11Z I think a brief explanation of cores vs threads would be valuable here. |
View / edit / reply to this conversation on ReviewNB huard commented on 2025-01-21T14:52:12Z resample_before_rl=False I think this introduces additional complexity that is not explained in the text. Could we pick another indicator that does not need this ?
Also the compute time is fairly long for a tutorial notebook. I think we want people to be able to run those and confirm the get the same things. This won't happen if it takes half an hour to run one cell. |
View / edit / reply to this conversation on ReviewNB huard commented on 2025-01-21T14:52:13Z Can we resolve the UserWarning? It adds noise to the results which can be intimidating to new users. |
View / edit / reply to this conversation on ReviewNB huard commented on 2025-01-21T14:52:14Z Line #9. from xscen.io import save_to_zarr
I note that the header of each cell (imports, url) is the same. Is there a reason to repeat it? Also I think we should reduce the cell outputs here. |
View / edit / reply to this conversation on ReviewNB huard commented on 2025-01-21T14:52:14Z Reduce the cell output. Add a short conclusion, repeating the important "lessons". |
I really like the notebook. It's clearly written and provides solutions to typical issues we face. My suggestions are to
|
No description provided.