Incompatible image size with RandomGeoSampler #1773

hfangcat · 2023-12-14T20:40:38Z

hfangcat
Dec 14, 2023

The code example is here:

from torchgeo.datasets import stack_samples

from torch.utils.data import DataLoader

from torchgeo.samplers import RandomGeoSampler, RandomBatchGeoSampler
from torchgeo.datasets import ChesapeakeCVPR

root = '/scratch/local/cvpr_chesapeake_landcover/'
dataset_train = ChesapeakeCVPR(root, splits=['md-train'], layers=['naip-new', 'lc', 'nlcd'], download=False)

sampler_train = RandomGeoSampler(dataset_train, size=224, length=1)
loader_train = DataLoader(dataset_train, sampler=sampler_train, collate_fn=stack_samples)

for i, data in enumerate(loader_train):
    images = data["image"]
    targets = data["mask"]
    print(images.shape, targets.shape)
    break

Then it shows the shape of images and targets:

torch.Size([1, 4, 176, 176]) torch.Size([1, 2, 176, 176])

which is not compatible with 224: what I specified in the code sampler_train = RandomGeoSampler(dataset_train, size=224, length=1), can the community help me find out why it happens?

Answered by calebrob6

Dec 15, 2023

Hey @hfangcat, this issue is related to #278 and #409. The problem is that the ChesapeakeCVPR dataset is made up of large tiles from several different CRSs. The approach of RasterDataset is to choose some single CRS and reproject all data into that CRS, and have the samplers produce bboxes in that CRS -- resampling on-the-fly to ensure that everything is pixel-aligned. This is good for cases where your data is not pixel-aligned beforehand, but produces unnecessary (and significant) slowdowns when your data is already pixel-aligned. As ChesapeakeCVPR data is already pixel-aligned, the compromise I take is to resample the bboxes to each tiles local CRS (which is fast) then mask from each la…

View full answer

adamjstewart · 2023-12-15T15:22:40Z

adamjstewart
Dec 15, 2023
Maintainer

This is a result of the way ChesapeakeCVPR is defined. It uses GeoDataset instead of RasterDataset. This has the following advantage:

Skips reprojection, so loads data faster

but has the following disadvantages as a result:

Returned image dimensions are variable, so can't be converted to mini-batches easily
Dataset can't be combined with other datasets because it doesn't support reprojection

Personally, I vote we convert it to RasterDataset so it works the same as all other datasets. We can figure out how to make it faster later. @calebrob6 wrote the current implementation, so it's up to him.

As a workaround, you can look at the ChesapeakeCVPRDataModule. Basically, you can load images 3x larger than normal, then crop them to the size you want. Currently this is the only way to use the dataset. Honestly, this is probably slower than just reprojecting and loading the correct size.

Hope this helps! Want to convert this to an issue so I can assign @calebrob6? There should be a button on the right.

1 reply

hfangcat Dec 15, 2023
Author

Thanks for your answer! Yes, it would be nice if the dataset could work as all other datasets, I have converted this to an issue. To quickly validate my method now, I can start from ChesapeakeCVPRDataModule first :)

calebrob6 · 2023-12-15T15:44:37Z

calebrob6
Dec 15, 2023
Maintainer

Hey @hfangcat, this issue is related to #278 and #409. The problem is that the ChesapeakeCVPR dataset is made up of large tiles from several different CRSs. The approach of RasterDataset is to choose some single CRS and reproject all data into that CRS, and have the samplers produce bboxes in that CRS -- resampling on-the-fly to ensure that everything is pixel-aligned. This is good for cases where your data is not pixel-aligned beforehand, but produces unnecessary (and significant) slowdowns when your data is already pixel-aligned. As ChesapeakeCVPR data is already pixel-aligned, the compromise I take is to resample the bboxes to each tiles local CRS (which is fast) then mask from each layer (also fast). So, even if you ask for a 224x224 meter bbox in EPSG:3857, this translates into variable sized bboxes in the CRSs of the tiles in the dataset (and will vary with latitude)

4 replies

calebrob6 Dec 15, 2023
Maintainer

@adamjstewart have you put any more thought into the idea of a "TileDataset" -- something that lays in between RasterDataset and NonGeoDataset? This would apply to cases where you have large tiles of pixel-aligned geo or non-geo data that you index by (tile index, x, y, size)

calebrob6 Dec 15, 2023
Maintainer

This would also solve the problem of the datasets that need that weird NCrops augmentation.

hfangcat Dec 15, 2023
Author

Aha! I understand why the codes produce different sizes' outputs. So if I want to produce several batches with the same size (such as 224) because this is easier for training, do you have any advice? I will try what @adamjstewart mentioned above, but it would be helpful if you have other thoughts :)

adamjstewart Dec 15, 2023
Maintainer

@calebrob6 I think TileDataset becomes obsolete if we just avoid unnecessary reprojection a la #409. That will work not only for pixel-aligned datasets but also for uncurated raster files.

Is your current approach faster than reprojecting since you need to sample an area 9x larger and then crop it down to size? Seems to me like that might actually be slower, but I haven't benchmarked it to confirm.

hfangcat · 2024-01-19T14:29:34Z

hfangcat
Jan 19, 2024
Author

Hi all @calebrob6 @adamjstewart, I solved the problem by looking at the source code of ChesapeakeCVPRDataModule. Here is my code snippet for reference:

    dataset_train = ChesapeakeCVPR(root, splits=['md-train'], layers=['naip-new', 'lc'], \
                                   transforms=_Transform(K.CenterCrop(224)))
    dataset_val = ChesapeakeCVPR(root, splits=['wv-val'], layers=['naip-new', 'lc'], \
                                 transforms=_Transform(K.CenterCrop(224)))
    train_batch_sampler = RandomBatchGeoSampler(dataset_train, size=224*3, batch_size=64, length=1000)
    val_batch_sampler = RandomBatchGeoSampler(dataset_val, size=224*3, batch_size=64, length=100)
    loader_train = DataLoader(dataset_train, batch_sampler=train_batch_sampler, collate_fn=stack_samples, num_workers=0)
    loader_val = DataLoader(dataset_val, batch_sampler=val_batch_sampler, collate_fn=stack_samples, num_workers=0)

The _Transform class is copied from the source code for torchgeo.datamodules.chesapeake: https://torchgeo.readthedocs.io/en/latest/api/samplers.html#torchgeo.samplers.RandomBatchGeoSampler.

Although the code is working now, I have two questions regarding the length and the crop function:

if I don't specify length in the code, it will return approximately the maximal number of non-overlapping chips of size that could be sampled from the dataset. Since we sampled 3 times from the dataset and did the center crop operation, leaving it as default would ignore many areas in the original tile, is my understanding correct?
Why source code of ChesapeakeCVPRDataModule use CenterCrop instead of RandomCrop? I think CenterCrop would ignore areas close to the border, right? If not, please correct me :)

5 replies

adamjstewart Jan 19, 2024
Maintainer

Correct. Note that due to random sampling, you're already ignoring many areas and double counting many areas, even if the length perfectly matches the maximal number. This isn't an issue as you usually run for many epochs anyway. It just affects the definition of how many epochs you ran for. Running for 10 epochs with length 100 is literally identical to running for 100 epochs with length 10, so don't worry too much about having the perfect length specification.
Correct. We're already doing random sampling, so there's no need to further randomize the location of the patch. Instead, we want to avoid reprojection artifacts (nodata border around the image) as much as possible.

hfangcat Jan 19, 2024
Author

Thanks for the quick answer!
Right, it would not be a big problem for training, as random sampling each epoch would make the model robust indeed. For the testing set, maybe we need to consider that if we want to cover every area in the testing set...
Also inspiring to hear about reprojection artifacts, I am not very familiar with that but I will do a bit of research to understand better!

adamjstewart Jan 19, 2024
Maintainer

For the testing set, you should really be using GridGeoSampler instead of RandomGeoSampler. That's what it's designed for, to ensure complete coverage of the entire dataset.

By reprojection artifacts, I'm actually referring to the nodata pixels around the rotated image, although there are also real reprojection artifacts involved that center cropping won't help with. I should probably have chosen a different terminology.

hfangcat Jan 19, 2024
Author

Thanks for your clarification! Now I understand.
I am also pretty sure that we need a TorchGeo tutorial somewhere for better and quicker usage!

adamjstewart Jan 19, 2024
Maintainer

Yes, that's one of our goals for 2024. Just need to convince conferences to allow us to give tutorials so we have a deadline to force us to write tutorials.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incompatible image size with RandomGeoSampler #1773

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 10 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Incompatible image size with RandomGeoSampler #1773

hfangcat Dec 14, 2023

Replies: 3 comments · 10 replies

adamjstewart Dec 15, 2023 Maintainer

hfangcat Dec 15, 2023 Author

calebrob6 Dec 15, 2023 Maintainer

calebrob6 Dec 15, 2023 Maintainer

calebrob6 Dec 15, 2023 Maintainer

hfangcat Dec 15, 2023 Author

adamjstewart Dec 15, 2023 Maintainer

hfangcat Jan 19, 2024 Author

adamjstewart Jan 19, 2024 Maintainer

hfangcat Jan 19, 2024 Author

adamjstewart Jan 19, 2024 Maintainer

hfangcat Jan 19, 2024 Author

adamjstewart Jan 19, 2024 Maintainer

hfangcat
Dec 14, 2023

Replies: 3 comments 10 replies

adamjstewart
Dec 15, 2023
Maintainer

hfangcat Dec 15, 2023
Author

calebrob6
Dec 15, 2023
Maintainer

calebrob6 Dec 15, 2023
Maintainer

calebrob6 Dec 15, 2023
Maintainer

hfangcat Dec 15, 2023
Author

adamjstewart Dec 15, 2023
Maintainer

hfangcat
Jan 19, 2024
Author

adamjstewart Jan 19, 2024
Maintainer

hfangcat Jan 19, 2024
Author

adamjstewart Jan 19, 2024
Maintainer

hfangcat Jan 19, 2024
Author

adamjstewart Jan 19, 2024
Maintainer