initial commit, faq on dask and parallelization #340

aslibese · 2024-11-20T17:07:31Z

No description provided.

review-notebook-app · 2024-11-20T17:07:36Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

for more information, see https://pre-commit.ci

review-notebook-app · 2025-01-06T16:43:46Z

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:46Z
----------------------------------------------------------------

Suggestion (this might be a bit wordy??) but I think maybe a bit more context would be good :

~~Datasets on PAVICS are stored on disk in chunks (small blocks of data) which Dask processes one at a time.~~ Efficient data processing on PAVICS, or using xarray/Dask in general, typically requires understanding two different notions of data chunking: 1) on-disk chunking or the way the data is stored in the filesystem and 2) in-memory chunking, or the way that Dask will break up a large dataset into manageable portions and process the data in parallel. In the case of memory chunks it is important to note that users can specify any size of in-memory chunking they feel is appropriate with respect to their system resources and the analysis at hand. However, this does not mean that the calculation will be efficient. Indeed, to efficiently process large datasets and prevent memory overloads, aligning in-memory chunks with the on-disk chunk structure is essential. This allows Dask to load and process data in a way that minimizes memory usage and I/O operations, speeding up computation.

I begin by checking the on-disk chunk structure of my dataset by loading it with decode_times=False. This skips time decoding and loads only metadata without reading the data into memory.

review-notebook-app · 2025-01-06T16:43:47Z

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:46Z
----------------------------------------------------------------

suggested addition at the end here (makes connection with what I suggest in the code later):

Since we are calculating a single year but for all of North America a logical choice might be to chunk the data in memory for a few days over the time dimension but over the entire spatial domain in the lat and lon dimensions:

review-notebook-app · 2025-01-06T16:43:48Z

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:47Z
----------------------------------------------------------------

Line #2.    ds = xr.open_dataset(url, chunks={"time": 200, "lat": 30, "lon": 30})

suggestion : ds = xr.open_dataset(url, chunks={"time":10, "lat"=-1, "lon"=-1})

review-notebook-app · 2025-01-06T16:43:49Z

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:48Z
----------------------------------------------------------------

suggestion : Add short markdown section above this code describing how this original chunking choice is likely a bad decision as we know that the chunksize on-disk for the time dimension is much larger than 10. Instead we can set time to be the smallest on disk chunk size and chunk lat, lon in multiples of 50 ... (to avoid having a large number of very small chunks using directly n=50)

review-notebook-app · 2025-01-06T16:43:49Z

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:49Z
----------------------------------------------------------------

Line #2.    ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50, "lon": 50})

ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50*5, "lon": 50*5})

review-notebook-app · 2025-01-06T16:43:50Z

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:50Z
----------------------------------------------------------------

Line #17.        ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50, "lon": 50})

Suggestion : use chunks={"time": (365*4)+1, "lat": 50, "lon": 50}

Basically a chunk of 4 years (add one because ERA5 has leap years) ... might have to explain in the markdown cell why you are changing but you can say it's because you are now calculating more than one year

review-notebook-app · 2025-01-06T16:43:51Z

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T16:43:50Z
----------------------------------------------------------------

Line #15.        ds = xr.open_dataset(url, chunks={"time": 366, "lat": 50, "lon": 50})

Suggest same chunking as before (4years +1)

tlogan2000

This is really great made a few suggestions for the initial sections

review-notebook-app · 2025-01-06T17:48:12Z

View / edit / reply to this conversation on ReviewNB

tlogan2000 commented on 2025-01-06T17:48:12Z
----------------------------------------------------------------

Line #21.        lon=slice(-12.5, -20.0),

This slice goes from biggest to smallest resulting in an empty dataset ...

suggest : lon(slice(-80.0, -73.5))

for more information, see https://pre-commit.ci

implemented all the changes + added 'time': (365*4)+1 chunking to the last two examples as well

review-notebook-app · 2025-01-21T14:52:10Z

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:09Z
----------------------------------------------------------------

Move first paragraph above sub-title, so readers are not intimidated because they don't understand the sub-title.

The text uses both the first person ("I begin") and the third ("users can specify"). I suggest to stick to one or the other.

review-notebook-app · 2025-01-21T14:52:11Z

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:10Z
----------------------------------------------------------------

delaying execution until compute() is explicitly called.

Saving to disk, or creating a graphic will trigger the compute without "explicitly" calling it. I would just drop "explicitly".

review-notebook-app · 2025-01-21T14:52:12Z

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:11Z
----------------------------------------------------------------

I think a brief explanation of cores vs threads would be valuable here.

review-notebook-app · 2025-01-21T14:52:12Z

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:12Z
----------------------------------------------------------------

resample_before_rl=False

I think this introduces additional complexity that is not explained in the text. Could we pick another indicator that does not need this ?

Also the compute time is fairly long for a tutorial notebook. I think we want people to be able to run those and confirm the get the same things. This won't happen if it takes half an hour to run one cell.

review-notebook-app · 2025-01-21T14:52:13Z

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:13Z
----------------------------------------------------------------

Can we resolve the UserWarning? It adds noise to the results which can be intimidating to new users.

review-notebook-app · 2025-01-21T14:52:14Z

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:14Z
----------------------------------------------------------------

Line #9.    from xscen.io import save_to_zarr

I note that the header of each cell (imports, url) is the same. Is there a reason to repeat it? Also I think we should reduce the cell outputs here.

review-notebook-app · 2025-01-21T14:52:15Z

View / edit / reply to this conversation on ReviewNB

huard commented on 2025-01-21T14:52:14Z
----------------------------------------------------------------

Reduce the cell output.

Add a short conclusion, repeating the important "lessons".

huard · 2025-01-21T14:59:45Z

I really like the notebook. It's clearly written and provides solutions to typical issues we face.

My suggestions are to

keep the runtime short for all cells, so the nb can be tested and edited quickly;
only display cell outputs relevant to the discussion;
keep the examples as conceptually simple as possible to focus on the main concepts;
address very technical optimization discussions in a separate section (e.g. Optimization tips)

initial commit, faq on dask and parallelization

21a8152

[pre-commit.ci] auto fixes from pre-commit.com hooks

facb87e

for more information, see https://pre-commit.ci

aslibese requested a review from tlogan2000 November 20, 2024 17:09

tlogan2000 previously requested changes Jan 6, 2025

View reviewed changes

Asli Bese and others added 10 commits January 18, 2025 00:51

Replaced FAQ_dask_parallel.ipynb with an updated version

4faa90a

[pre-commit.ci] auto fixes from pre-commit.com hooks

9f8bfff

for more information, see https://pre-commit.ci

rerun the notebook, output cells updated

d6ec03b

Resolved conflict in FAQ_dask_parallel.ipynb by keeping local changes

5034ac0

Merge remote-tracking branch 'origin/master' into add-faq

b3d7d80

fixed merge conflicts

af5e83c

fixed merge conflicts again

ae216aa

fixed merge conflicts again_2

e795900

fixed merge conflicts again_3

69fa660

[pre-commit.ci] auto fixes from pre-commit.com hooks

8fad778

for more information, see https://pre-commit.ci

aslibese requested review from tlogan2000 and huard January 20, 2025 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial commit, faq on dask and parallelization #340

initial commit, faq on dask and parallelization #340

aslibese commented Nov 20, 2024

review-notebook-app bot commented Nov 20, 2024

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

tlogan2000 left a comment •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

huard commented Jan 21, 2025

initial commit, faq on dask and parallelization #340

Are you sure you want to change the base?

initial commit, faq on dask and parallelization #340

Conversation

aslibese commented Nov 20, 2024

review-notebook-app bot commented Nov 20, 2024

review-notebook-app bot commented Jan 6, 2025 • edited Loading

review-notebook-app bot commented Jan 6, 2025 • edited Loading

review-notebook-app bot commented Jan 6, 2025 • edited Loading

review-notebook-app bot commented Jan 6, 2025 • edited Loading

review-notebook-app bot commented Jan 6, 2025 • edited Loading

review-notebook-app bot commented Jan 6, 2025 • edited Loading

review-notebook-app bot commented Jan 6, 2025 • edited Loading

tlogan2000 left a comment • edited Loading

Choose a reason for hiding this comment

review-notebook-app bot commented Jan 6, 2025 • edited Loading

review-notebook-app bot commented Jan 21, 2025 • edited Loading

review-notebook-app bot commented Jan 21, 2025 • edited Loading

review-notebook-app bot commented Jan 21, 2025 • edited Loading

review-notebook-app bot commented Jan 21, 2025 • edited Loading

review-notebook-app bot commented Jan 21, 2025 • edited Loading

review-notebook-app bot commented Jan 21, 2025 • edited Loading

review-notebook-app bot commented Jan 21, 2025 • edited Loading

huard commented Jan 21, 2025

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

tlogan2000 left a comment •

edited

Loading

review-notebook-app bot commented Jan 6, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading

review-notebook-app bot commented Jan 21, 2025 •

edited

Loading