HASEL (High-Speed Video Annotation Tool for Structured Light Endoscopy in the Human Larynx) is a Deep-Learning-supported tool for generating ground-truth data for High-Speed Video Structured Light Laryngoscopy. This tool enables to robustly and rapidly generate data of:
- Glottal segmentation with different segmentation architectures
- Vocal fold segmentation via frame-wise interpolation
- Semi-automatic and Deep Learning enhanced generation of laserpoint data.
Please follow these instructions to make sure that Hasel runs as intended. In general, we recommend a current nvidia graphics on par with a quadro RTX 4000. First, create the environment and install necessary packages.
conda create --name VFLabel python=3.12
pip install torch torchvision torchaudio
conda install pyqt qtpy
pip install torchmetrics albumentations imageio kornia segmentation-models-pytorch matplotlib flow_vis tensorboard tqdm
Next, install Hasel for development:
git clone https://github.com/Henningson/VFLabel.git
cd VFLabel
python3 -m pip install -e .
Next, we need to create a model folder and install metas cotracker3. For this, you can also follow these instructions.
cd ..
git clone https://github.com/facebookresearch/co-tracker
cd co-tracker
pip install -e .
Next, we need to download the cotracker3 offline version.
cd ../VFLabel
mkdir assets/models
cd assets/models
wget https://huggingface.co/facebook/cotracker3/resolve/main/scaled_offline.pth
cd .../..
Finally, download the glottis segmentation networks from here and move them to assets/models/
.
Make a video here.
Glottal Segmentations can also easily be generated from command line arguments, via the supplied script in the examples:
python examples/scripts/segment_glottis --encoder mobilenet_v2 --image_folder PATH_TO_WHERE_THE_IMAGES_Are --save_folder OUTPUT_FOLDER
We supply four(*) U-Nets with different backbones in this repository.
They can be downloaded here. Make sure to extract the files into assets/models
.
Here is the evaluation of the models.
Generally the resnet based backbones are the best performing, but the other backbones are generally better to use on cpu only systems.
You should generally test out which ones work best for you.
You can find how to use the supplied networks in examples/scripts
.
*: There's also a efficientnet available, but it generally performs worse than the rest. However, I'd advise you to also test it out on some data.
We evaluated the Networks on a combined test-set of the BAGLS and HLE datasets, as well as synthetically created vocal folds using Fireflies.
Backbone | Eval IoU | Eval DICE | Test Dice | Test IoU |
---|---|---|---|---|
mobilenet-v2 | 0.864 | 0.927 | 0.893 | 0.807 |
mobilenetv3_large_100 | 0.845 | 0.916 | 0.789 | 0.65 |
resnet18 | 0.856 | 0.922 | 0.882 | 0.789 |
resnet34 | 0.846 | 0.917 | 0.883 | 0.791 |
To train your own network on a bunch of vocalfold datasets, download the HLE and BAGLS dataset, and put them into a common folder. Next also download the fireflies dataset from here and also extract it into the folder. The final folder structure should look like this:
dataset/
├── BAGLS/
├── HLEDataset/
└── fireflies_dataset_v5/
For training, please follow the code in the example script examples/scripts/train_glottis_segmentation_network.py
.
There, you will fine-tune the decoder of common segmentation model architectures that were pretrained on imagenet.