Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Info for chunking to fix #52 #58

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
## Learning Objectives
- Understand what cloud native data formats are
- Understand how the cloud does computing more efficiently
- Understand chunking
- Understand how chunking impact performance


## Outline
Expand All @@ -13,6 +15,7 @@
- Examples
- Performance in the cloud
- Tiling
- Chunking
- Scaling
- Distributed Computing
- The Microsoft Planetary Computer Setup - State of the Art open source cloud native technology stack
Expand All @@ -30,7 +33,26 @@ Cloud native formats or cloud-optimized formats, are file formats specifically d
### Characteristics of cloud native data formats
Cloud-optimized means mainly optimized "read" access with partial reads and also parallel reads. Main characteristics common for cloud-optimized formats:

- **Data Chunking:** Cloud native formats employ a chunk-based organization, where the data is divided into smaller chunks or blocks. This enables parallel processing and efficient retrieval of specific portions of the data, reducing the need to access the entire dataset.
- **Data Chunking**: When working with large data files or collections, it’s often impossible to load all the data into a single computer’s memory at once. In such cases, a data chunking approach can be highly effective. By dividing the dataset into smaller chunks, the data can be processed piece by piece without exceeding the computer's memory capacity. This approach is particularly useful for managing large datasets on a single machine and can also scale to distributed computing environments, such as cloud platforms or high-performance computing systems.

**Cloud native** formats employ a chunk-based organization, where the data is divided into smaller chunks or blocks. This enables parallel processing and efficient retrieval of specific portions of the data, reducing the need to access the entire dataset.

A **chunk** is the smallest atomic unit of a larger dataset that can be processed independently, enabling efficient data handling by dividing the dataset into manageable pieces without requiring the entire dataset to be loaded into memory.

The figure below visually explains the concept of chunking: on the left, a three-dimensional dataset (x, y, and time) is shown without chunks, while on the right, the same dataset is displayed with chunks highlighted.

| Dataset without chunking | Dataset with chunking |
| ---------------------------------------------------------------- | ------------------------------------------------------- |
| ![No Chunking](assets/notchunked.png "Dataset without chunking") | ![Chunking](assets/chunked.png "Dataset with chunking") |


There are different ways to chunk data, depending on the nature of the dataset and the analysis requirements. Spatial chunking divides data based on geographical or spatial dimensions (e.g., longitude, latitude), which is ideal for geospatial datasets where the data is naturally distributed across space. Time-based chunking focuses on temporal dimensions (e.g., by day, month, or year), which is suitable for time-series data. Another approach is box chunking, where data is divided into fixed-size blocks (e.g., cubes or boxes), providing a balance between spatial and time-based chunking. The choice of chunking strategy can significantly impact the efficiency of data access—spatial chunking is optimal for spatial queries, while time-based chunking improves access to time-series data. Using the right chunking strategy can reduce the computational overhead and improve the overall performance of data processing tasks.

The table below illustrates the two most current chunking strategies:

| Spatial chunking strategy | Box chunking strategy |
| ------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| ![Spatial Chunking](assets/spatialchunking.png "Dataset with spatial chunking") | ![Box Chunking](assets/boxchunking.png "Dataset with box chunking") |

- **Internal Indexing:** These formats incorporate internal indexing structures that facilitate fast spatial and attribute queries. This enables efficient data access and retrieval operations without the need for extensive scanning or processing of the entire dataset.

Expand Down Expand Up @@ -80,11 +102,14 @@ Both horizontal and vertical scaling have their advantages and considerations. H

In common workflows, a combination of both approaches is used to ensure optimal speed and resource utilization while being able to keep the simplicity of a workflow.

## How to scaling
## How to scale

There are many approaches how to handle scaling properly.
We will use two Pangeo excerside to understand __Vertical scaling__ and __Horizontal scaling__ using chunking and Dask.

[Exercise 2.4 chunking](./exercises/24_chunking.ipynb)

todo: parallel computing section
[Exercise 2.4 dask](./exercises/24_dask.ipynb)

### Subscription vs. On-Demand usage

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [
{ "type": "Feature", "properties": { "HYBAS_ID": 2090516090, "NEXT_DOWN": 2090516950, "NEXT_SINK": 2090012980, "MAIN_BAS": 2090012980, "DIST_SINK": 334.5, "DIST_MAIN": 334.5, "SUB_AREA": 419.1, "UP_AREA": 419.2, "PFAF_ID": 214040804, "ENDO": 0, "COAST": 0, "ORDER": 3, "SORT": 10988 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 11.075, 46.729166666666693 ], [ 11.072575547960094, 46.728813340928845 ], [ 11.069091118706622, 46.725353325737871 ], [ 11.048285590277802, 46.724646674262175 ], [ 11.042024739583358, 46.730876329210091 ], [ 11.041666666666691, 46.733333333333356 ], [ 11.038123575846377, 46.73424241807728 ], [ 11.0375, 46.745833333333358 ], [ 11.03664923773874, 46.749149237738742 ], [ 11.030017428927975, 46.750850762261308 ], [ 11.028599378797768, 46.756377156575546 ], [ 11.025567287868947, 46.76028951009117 ], [ 11.0244327121311, 46.764710489908879 ], [ 11.021400621202281, 46.768622843424502 ], [ 11.020833333333357, 46.770833333333357 ], [ 11.024291653103322, 46.771720716688392 ], [ 11.025567287868947, 46.781377156575545 ], [ 11.032766045464435, 46.789456176757838 ], [ 11.033900621202282, 46.793877156575547 ], [ 11.036932712131101, 46.797789510091171 ], [ 11.038067287868948, 46.80221048990888 ], [ 11.042586263020858, 46.807629055447073 ], [ 11.052210489908878, 46.808900621202284 ], [ 11.06028951009117, 46.816099378797766 ], [ 11.06471048990888, 46.817233954535617 ], [ 11.068622843424503, 46.820266045464436 ], [ 11.073043823242212, 46.82140062120228 ], [ 11.077974107530407, 46.825221761067731 ], [ 11.079733954535614, 46.830103895399333 ], [ 11.071295166015648, 46.839574517144122 ], [ 11.074432712131101, 46.843622843424505 ], [ 11.075567287868948, 46.846579996744815 ], [ 11.071400621202281, 46.851956176757838 ], [ 11.070266045464434, 46.859321424696205 ], [ 11.082766045464435, 46.872789510091174 ], [ 11.084041680230058, 46.882445949978326 ], [ 11.089710489908878, 46.88390062120228 ], [ 11.099959648980059, 46.893196953667562 ], [ 11.096400621202282, 46.897789510091172 ], [ 11.095266045464435, 46.913437228732661 ], [ 11.103599378797767, 46.922789510091171 ], [ 11.104166666666691, 46.929166666666696 ], [ 11.114924452039954, 46.929519992404543 ], [ 11.11875, 46.933318752712701 ], [ 11.123286946614609, 46.928813340928848 ], [ 11.139924452039956, 46.929519992404543 ], [ 11.143408881293428, 46.932980007595511 ], [ 11.148288302951414, 46.933691067165825 ], [ 11.157980007595512, 46.943408881293429 ], [ 11.158691406250025, 46.948290337456626 ], [ 11.164242214626761, 46.953813340928846 ], [ 11.166666666666693, 46.954166666666694 ], [ 11.167024739583358, 46.951709662543429 ], [ 11.172575547960095, 46.946186659071209 ], [ 11.177424452039956, 46.945480007595513 ], [ 11.180908881293428, 46.942019992404539 ], [ 11.185757785373289, 46.941313340928843 ], [ 11.189242214626761, 46.937853325737876 ], [ 11.194123670789956, 46.93714192708336 ], [ 11.199646674262178, 46.931591118706621 ], [ 11.200353325737872, 46.910075547960098 ], [ 11.204519992404538, 46.905879720052113 ], [ 11.203813340928845, 46.901742214626765 ], [ 11.199646674262178, 46.897546386718773 ], [ 11.200353325737872, 46.893408881293432 ], [ 11.203813340928845, 46.889924452039956 ], [ 11.204519992404538, 46.885075547960099 ], [ 11.212486097547769, 46.877084011501765 ], [ 11.208686659071207, 46.873257785373291 ], [ 11.208333333333359, 46.858333333333363 ], [ 11.210543823242213, 46.857766045464437 ], [ 11.214456176757839, 46.854733954535618 ], [ 11.231377156575547, 46.853599378797767 ], [ 11.239456176757837, 46.846400621202285 ], [ 11.260543823242212, 46.845266045464435 ], [ 11.264583333333359, 46.842135281033009 ], [ 11.26875, 46.845364718967041 ], [ 11.272789510091172, 46.842233954535615 ], [ 11.323043823242214, 46.841099378797772 ], [ 11.331122843424506, 46.833900621202282 ], [ 11.343877156575548, 46.832766045464439 ], [ 11.347789510091172, 46.829733954535612 ], [ 11.353315904405409, 46.828315904405407 ], [ 11.354733954535616, 46.82278951009117 ], [ 11.357766045464437, 46.818877156575546 ], [ 11.358900621202284, 46.814456176757837 ], [ 11.36609937879777, 46.806377156575543 ], [ 11.366666666666694, 46.804166666666688 ], [ 11.365779283311658, 46.800708346896727 ], [ 11.356122843424506, 46.7994327121311 ], [ 11.35208333333336, 46.796301947699675 ], [ 11.347916666666693, 46.799531385633706 ], [ 11.343877156575548, 46.796400621202281 ], [ 11.339456176757839, 46.795266045464437 ], [ 11.335543823242213, 46.792233954535618 ], [ 11.321806165907145, 46.791011895073808 ], [ 11.31306728786895, 46.781377156575545 ], [ 11.311932712131103, 46.776956176757835 ], [ 11.304733954535617, 46.768877156575549 ], [ 11.30359937879777, 46.76445617675784 ], [ 11.300567287868949, 46.760543823242216 ], [ 11.299149237738742, 46.755017428927978 ], [ 11.293622843424505, 46.753599378797766 ], [ 11.289710489908881, 46.750567287868947 ], [ 11.279166666666693, 46.75 ], [ 11.278813340928846, 46.739242214626763 ], [ 11.274646674262179, 46.735046386718778 ], [ 11.275353325737873, 46.730908881293431 ], [ 11.283319430881102, 46.722917344835096 ], [ 11.27951999240454, 46.719091118706622 ], [ 11.278813340928846, 46.705908881293425 ], [ 11.275353325737873, 46.702424452039956 ], [ 11.275, 46.7 ], [ 11.274032253689262, 46.696229383680581 ], [ 11.267413330078151, 46.691099378797766 ], [ 11.260289510091171, 46.692233954535617 ], [ 11.25491333007815, 46.69640062120228 ], [ 11.251956176757838, 46.695266045464436 ], [ 11.248043823242213, 46.692233954535617 ], [ 11.239456176757837, 46.691099378797766 ], [ 11.235543823242214, 46.688067287868947 ], [ 11.218622843424505, 46.686932712131103 ], [ 11.21471048990888, 46.683900621202284 ], [ 11.197789510091171, 46.682766045464433 ], [ 11.193877156575546, 46.679733954535614 ], [ 11.184220716688394, 46.678458319769987 ], [ 11.182413736979193, 46.671416219075546 ], [ 11.173043823242214, 46.663067287868948 ], [ 11.168622843424505, 46.661932712131097 ], [ 11.159270562065997, 46.653599378797765 ], [ 11.154985215928845, 46.655143907335095 ], [ 11.154166666666692, 46.65833333333336 ], [ 11.15085076226131, 46.65918409559464 ], [ 11.15, 46.6625 ], [ 11.150353325737871, 46.669091118706618 ], [ 11.155876329210095, 46.674641927083357 ], [ 11.160757785373288, 46.675353325737873 ], [ 11.166668023003497, 46.68123406304256 ], [ 11.162853325737872, 46.685075547960096 ], [ 11.162146674262178, 46.689924452039953 ], [ 11.15451999240454, 46.697575547960092 ], [ 11.153813340928846, 46.702424452039956 ], [ 11.148290337456622, 46.707975260416688 ], [ 11.143408881293428, 46.708686659071205 ], [ 11.139924452039956, 46.712146674262179 ], [ 11.122575547960095, 46.712853325737875 ], [ 11.119091118706622, 46.716313340928842 ], [ 11.11424221462676, 46.717019992404538 ], [ 11.110757785373288, 46.720480007595512 ], [ 11.101742214626761, 46.721186659071208 ], [ 11.098257785373288, 46.724646674262175 ], [ 11.085075547960095, 46.725353325737871 ], [ 11.081591118706623, 46.728813340928845 ], [ 11.075, 46.729166666666693 ] ] ] } }
]
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading