Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial commit, faq on dask and parallelization #340

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

aslibese
Copy link

No description provided.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@aslibese aslibese requested a review from tlogan2000 November 20, 2024 17:09
Copy link

review-notebook-app bot commented Jan 6, 2025

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:46Z
----------------------------------------------------------------

Suggestion (this might be a bit wordy??) but I think maybe a bit more context would be good :

Datasets on PAVICS are stored on disk in chunks (small blocks of data) which Dask processes one at a time. Efficient data processing on PAVICS, or using xarray/Dask in general, typically requires understanding two different notions of data chunking: 1) on-disk chunking or the way the data is stored in the filesystem and 2) in-memory chunking, or the way that Dask will break up a large dataset into manageable portions and process the data in parallel. In the case of memory chunks it is important to note that users can specify any size of in-memory chunking they feel is appropriate with respect to their system resources and the analysis at hand. However, this does not mean that the calculation will be efficient. Indeed, to efficiently process large datasets and prevent memory overloads, aligning in-memory chunks with the on-disk chunk structure is essential. This allows Dask to load and process data in a way that minimizes memory usage and I/O operations, speeding up computation.

I begin by checking the on-disk chunk structure of my dataset by loading it with decode_times=False. This skips time decoding and loads only metadata without reading the data into memory.


Copy link

review-notebook-app bot commented Jan 6, 2025

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:46Z
----------------------------------------------------------------

suggested addition at the end here (makes connection with what I suggest in the code later):

Since we are calculating a single year but for all of North America a logical choice might be to chunk the data in memory for a few days over the time dimension but over the entire spatial domain in the lat and lon dimensions:


Copy link

review-notebook-app bot commented Jan 6, 2025

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:47Z
----------------------------------------------------------------

Line #2.    ds = xr.open_dataset(url, chunks={"time": 200, "lat": 30, "lon": 30})

suggestion : ds = xr.open_dataset(url, chunks={"time":10, "lat"=-1, "lon"=-1})


Copy link

review-notebook-app bot commented Jan 6, 2025

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:48Z
----------------------------------------------------------------

suggestion : Add short markdown section above this code describing how this original chunking choice is likely a bad decision as we know that the chunksize on-disk for the time dimension is much larger than 10. Instead we can set time to be the smallest on disk chunk size and chunk lat, lon in multiples of 50 ... (to avoid having a large number of very small chunks using directly n=50)


Copy link

review-notebook-app bot commented Jan 6, 2025

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:49Z
----------------------------------------------------------------

Line #2.    ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50, "lon": 50})

ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50*5, "lon": 50*5})


Copy link

review-notebook-app bot commented Jan 6, 2025

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:50Z
----------------------------------------------------------------

Line #17.        ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50, "lon": 50})

Suggestion : use chunks={"time": (365*4)+1, "lat": 50, "lon": 50}

Basically a chunk of 4 years (add one because ERA5 has leap years) ... might have to explain in the markdown cell why you are changing but you can say it's because you are now calculating more than one year


Copy link

review-notebook-app bot commented Jan 6, 2025

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:50Z
----------------------------------------------------------------

Line #15.        ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50, "lon": 50})

Suggest same chunking as before (4years +1)


Copy link
Contributor

@tlogan2000 tlogan2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great made a few suggestions for the initial sections

Copy link

review-notebook-app bot commented Jan 6, 2025

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T17:48:12Z
----------------------------------------------------------------

Line #21.        lon=slice(-12.5, -20.0),

This slice goes from biggest to smallest resulting in an empty dataset ...

suggest : lon(slice(-80.0, -73.5))


@aslibese aslibese requested review from tlogan2000 and huard January 20, 2025 18:26
@aslibese aslibese dismissed tlogan2000’s stale review January 20, 2025 18:55

implemented all the changes + added 'time': (365*4)+1 chunking to the last two examples as well

Copy link

review-notebook-app bot commented Jan 21, 2025

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:09Z
----------------------------------------------------------------

Move first paragraph above sub-title, so readers are not intimidated because they don't understand the sub-title.

The text uses both the first person ("I begin") and the third ("users can specify"). I suggest to stick to one or the other.


Copy link

review-notebook-app bot commented Jan 21, 2025

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:10Z
----------------------------------------------------------------

 delaying execution until compute() is explicitly called.

Saving to disk, or creating a graphic will trigger the compute without "explicitly" calling it. I would just drop "explicitly".


Copy link

review-notebook-app bot commented Jan 21, 2025

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:11Z
----------------------------------------------------------------

I think a brief explanation of cores vs threads would be valuable here.


Copy link

review-notebook-app bot commented Jan 21, 2025

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:12Z
----------------------------------------------------------------

resample_before_rl=False

I think this introduces additional complexity that is not explained in the text. Could we pick another indicator that does not need this ?

Also the compute time is fairly long for a tutorial notebook. I think we want people to be able to run those and confirm the get the same things. This won't happen if it takes half an hour to run one cell.


Copy link

review-notebook-app bot commented Jan 21, 2025

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:13Z
----------------------------------------------------------------

Can we resolve the UserWarning? It adds noise to the results which can be intimidating to new users.


Copy link

review-notebook-app bot commented Jan 21, 2025

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:14Z
----------------------------------------------------------------

Line #9.    from xscen.io import save_to_zarr

I note that the header of each cell (imports, url) is the same. Is there a reason to repeat it? Also I think we should reduce the cell outputs here.


Copy link

review-notebook-app bot commented Jan 21, 2025

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:14Z
----------------------------------------------------------------

Reduce the cell output.

Add a short conclusion, repeating the important "lessons".


@huard
Copy link
Contributor

huard commented Jan 21, 2025

I really like the notebook. It's clearly written and provides solutions to typical issues we face.

My suggestions are to

  • keep the runtime short for all cells, so the nb can be tested and edited quickly;
  • only display cell outputs relevant to the discussion;
  • keep the examples as conceptually simple as possible to focus on the main concepts;
  • address very technical optimization discussions in a separate section (e.g. Optimization tips)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants