license | language | tags | size_categories | pretty_name | task_categories | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cc-by-4.0 |
|
|
|
SPIDER: Spine MRI Segmentation |
|
The SPIDER dataset contains (human) lumbar spine magnetic resonance images (MRI) and segmentation masks from the following paper:
van der Graaf, J.W., van Hooff, M.L., Buckens, C.F.M. et al. Lumbar spine segmentation in MR images: a dataset and a public benchmark. Sci Data 11, 264 (2024). https://doi.org/10.1038/s41597-024-03090-w
The format of the data has been modified slightly to support loading through the Hugging Face datasets
library (see the Data Modifications section below). The original data are available on Zenodo. More information can be found at SPIDER Grand Challenge.
Additionally, two detailed tutorials are provided for this dataset:
- Loading the SPIDER Dataset from Hugging Face
- Building a U-Net CNN Model for Magnetic Resonance Imaging (MRI) Segmentation
Example MRI scan (at three different depths) Example MRI scan with segmentation masks
- Published Paper: Lumbar spine segmentation in MR images: a dataset and a public benchmark
- ArXiv Link: https://arxiv.org/abs/2306.12217
- Repository: Zenodo
- Grand Challenge: SPIDER Grand Challenge
In addition to the information in this README, two detailed tutorials for this dataset are provided in the tutorials folder:
- Loading the SPIDER Dataset from Hugging Face
- Building a U-Net CNN Model for Magnetic Resonance Imaging (MRI) Segmentation
First, you will need to install the following dependencies:
datasets >= 2.18.0
scikit-image >= 0.19.3
SimpleITK >= 2.3.1
Then you can load the SPIDER dataset as follows:
from datasets import load_dataset
dataset = load_dataset("cdoswald/SPIDER, name="default", trust_remote_code=True)
See the Loading the Dataset tutorial for more information.
The dataset includes 447 sagittal T1 and T2 MRI series collected from 218 patients across four hospitals. Segmentation masks indicating the vertebrae, intervertebral discs (IVDs), and spinal canal are also included. Segmentation masks were created manually by a medical trainee under the supervision of a medical imaging expert and an experienced musculoskeletal radiologist.
In addition to MR images and segmentation masks, additional metadata (e.g., scanner manufacturer, pixel bandwidth, etc.), limited patient characteristics (biological sex and age, when available), and radiological gradings indicating specific degenerative changes can be loaded with the corresponding image data.
This version of the SPIDER dataset (i.e., available through the Hugging Face datasets
library) differs from the original
data available on Zenodo in two key ways:
-
Image Rescaling/Resizing: The original 3D volumetric MRI data are stored as .mha files and do not have a standardized height, width, depth, and image resolution. To enable the data to be loaded through the Hugging Face
datasets
library, all 447 MRI series are standardized to have height and width of(512, 512)
and (unsigned) 16-bit integer resolution. Segmentation masks have the same height and width dimension but are (unsigned) 8-bit integer resolution. The depth dimension has not been modified; rather, each scan is formatted as a sequence of(512, 512)
grayscale images, where the index in the sequence indicates the depth value. N-dimensional interpolation is used to resize and/or rescale the images (via theskimage.transform.resize
andskimage.img_as_uint
functions). If you need a different standardization, you have two options:i. Pass your preferred height and width size as a
Tuple[int, int]
to theresize_shape
argument inload_dataset
(see the LoadData Tutorial); ORii. After loading the dataset from Hugging Face, use the
SimpleITK
library to import each image using the file path of the locally cached .mha file. The local cache file path is provided for each example when iterating over the dataset (again, see the LoadData Tutorial). -
Train, Validation, and Test Set: The original dataset contained 257 unique studies (i.e., patients) that were partitioned into 218 (85%) studies for the public training/validation set and 39 (15%) studies for the SPIDER Grand Challenge hidden test set. To enable users to train, validate, and test their models prior to submitting their models to the SPIDER Grand Challenge, the original 218 studies that comprised the public training/validation set were further partitioned using a 60%/20%/20% split. The original split for each study (i.e., training or validation set) is recorded in the
OrigSubset
variable in the study's linked metadata.
There are 447 images and corresponding segmentation masks for 218 unique patients.
The format for each generated data instance is as follows:
-
patient_id: a unique ID number indicating the specific patient (note that many patients have more than one scan in the data)
-
scan_type: an indicator for whether the image is a T1-weighted, T2-weighted, or T2-SPACE MRI
-
image: a sequence of 2-dimensional grayscale images of the MRI scan
-
mask: a sequence of 2-dimensional values indicating the following segmented anatomical feature(s):
- 0 = background
- 1-25 = vertebrae (numbered from the bottom, i.e., L5 = 1)
- 100 = spinal canal
- 101-125 = partially visible vertebrae
- 201-225 = intervertebral discs (numbered from the bottom, i.e., L5/S1 = 201)
See the SPIDER Grand Challenge documentation for more details.
-
image_path: path to the local cache containing the original (non-rescaled and non-resized) MRI image
-
mask_path: path to the local cache containing the original (non-rescaled and non-resized) segementation mask
-
metadata: a dictionary of metadata of image, patient, and scanner characteristics:
- number of vertebrae
- number of discs
- biological sex
- age
- manufacturer
- manufacturer model name
- serial number
- software version
- echo numbers
- echo time
- echo train length
- flip angle
- imaged nucleus
- imaging frequency
- inplane phase encoding direction
- MR acquisition type
- magnetic field strength
- number of phase encoding steps
- percent phase field of view
- percent sampling
- photometric interpretation
- pixel bandwidth
- pixel spacing
- repetition time
- specific absorption rate (SAR)
- samples per pixel
- scanning sequence
- sequence name
- series description
- slice thickness
- spacing between slices
- specific character set
- transmit coil name
- window center
- window width
-
rad_gradings: radiological gradings by an expert musculoskeletal radiologist indicating specific degenerative changes at all intervertebral disc (IVD) levels (see page 3 of the original paper for more details). The data are provided as a dictionary of lists; an element's position in the list indicates the IVD level. Some elements are ratings while others are binary indicators. For consistency, each list will have 10 elements, but some IVD levels may not be applicable to every image (which will be indicated with an empty string).
The dataset is split as follows:
- Training set:
- 149 unique patients
- 304 total images
- Sagittal T1: 133 images
- Sagittal T2: 145 images
- Sagittal T2-SPACE: 26 images
- Validation set:
- 37 unique patients
- 75 total images
- Sagittal T1: 34 images
- Sagittal T2: 34 images
- Sagittal T2-SPACE: 7 images
- Test set:
- 32 unique patients
- 68 total images
- Sagittal T1: 29 images
- Sagittal T2: 31 images
- Sagittal T2-SPACE: 8 images
An additional hidden test set provided by the paper authors (i.e., not available via Hugging Face) is available on the SPIDER Grand Challenge.
Standard sagittal T1 and T2 image resolution ranges from 3.3 x 0.33 x 0.33 mm to 4.8 x 0.90 x 0.90 mm. Sagittal T2 SPACE sequence images had a near isotropic spatial resolution with a voxel size of 0.90 x 0.47 x 0.47 mm. (https://spider.grand-challenge.org/data/)
Note that all images are rescaled to have unsigned 16-bit integer resolution
for compatibility with the Hugging Face datasets
library. If you want to use the original resolution, you can
load the original images from the local cache indicated in each example's image_path
and mask_path
features.
See the tutorial for more information.
The dataset is published under a CC-BY 4.0 license.
The tutorials are published under an MIT license.
The data curation code (SPIDER.py) is published under an Apache License, Version 2.0 (mandated by the Hugging Face dataset loading script template).
- van der Graaf, J.W., van Hooff, M.L., Buckens, C.F.M. et al. Lumbar spine segmentation in MR images: a dataset and a public benchmark. Sci Data 11, 264 (2024). https://doi.org/10.1038/s41597-024-03090-w.
I am not affiliated in any way with the aforementioned paper, researchers, or organizations. If you are using this Hugging Face dataset for research or analysis, please validate your findings against the original data provided by the researchers on Zenodo.
- Serializing data into Apache Arrow format is required to make the dataset available via Hugging Face's
datasets
library. However, it can introduce some segmentation mask integer values that do not map exactly to a defined anatomical feature category. See the data loading tutorial for more information and temporary work-arounds.