Skip to content

Video Benchmark Suite: Rapid Evaluation of Video Foundation Models

License

Notifications You must be signed in to change notification settings

OpenVFM/Video_Benchmark_Suite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Video Benchmark Suite: Rapid Evaluation of Video Foundation Models

🎯 Video Benchmark

  • Video Action Recognition (Action Classification)

  • Video Temporal Action Localization (Temporal Action Detection)

  • Video Object Tracking

  • Video Point Tracking

  • Video Depth Estimation

  • Video Camera Pose Estimation

  • Video Instance Segmentation

  • Video Retrieval Text-To-Video (T2V)

  • Video Retrieval Video-To-Text (V2T)

  • Video Temporal Grounding

  • Video Question-Answering

📑 SOTA Video Foundation Models

👏 MODEL_ZOO

We provide the sota foundation model weights in MODEL_ZOO.md.

📊 Video Evaluation Method

  • 🌸 Linear Probing Evaluation
  • Few-shot-data Linear Probing Evaluation
  • 🌻 Full Finetuning Evaluation
  • Few-shot-data Full Finetuning Evaluation
  • 🍁 Zero-shot Evaluation
  • Few-shot-data Zero-shot Evaluation
  • 🌵 Attentive Probing Evaluation
  • Few-shot-data Attentive Probing Evaluation

eval method

Ours Video Evaluation Details

Video Action Recognition

We employ Linear Probing and Attentive Probing method for rapid evaluation of Action Recognition.

Abbreviations used: pt (pretrain), ppt (post-pretrain), ft (finetune).

💾 Evaluation Dataset

🌸 Linear Probing Evaluation

Hyperparameter Info
optimizer LBFGS
regularization L2
max_iter 1000
scan Learning rate 10-6, 10-4, 10-2, 1, 102, 104, 106
spaced steps binary search
num_step 8
batch size 32
train class few-shot nums 10
val class few-shot nums half
input resolution 224×224
mean 0.485, 0.456, 0.406
std 0.229, 0.224, 0.225

🌵 Attentive Probing Evaluation

Hyperparameter Info
training epoch 20
scan Learning rate 1e-3, 3e-4, 1e-4
min Learning rate 1e-7
warmup epoch 5
start Learning rate 0.0
batch size 192
short side size 256
input resolution 224×224
optimizer AdamW
adamW weight decay 1e-4
adamW eps 1e-8
adamW betas 0.9, 0.999
clip grad 5.0
label smoothing 0.1
attentive head 16
attentive out_dim 768
mean 0.485, 0.456, 0.406
std 0.229, 0.224, 0.225
brightness 0.8 probability: delta in [-0.125, 0.125]
saturation 0.8 probability: factor in [0.6, 1.4]
contrast 0.8 probability: factor in [0.6, 1.4]
hue 0.8 probability: delta in [-0.2, 0.2]
color space conversion 0.1 probability: RGB to BGR

🔨 Installation

apt-get install -y ffmpeg libavcodec-dev libavfilter-dev libavformat-dev libavutil-dev

conda create --name videomae python=3.10 -y
conda activate videomae
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install decord==0.6.0
pip install nvidia-dali-cuda120==1.44.0
pip install timm==0.4.12
pip install tensorboardX==2.6.2.2
pip install SciPy==1.11.4
pip install matplotlib
pip install scikit-image==0.24.0
pip install flash-attn==2.7.2
pip install psutil==6.0.0
pip install opencv-python

🚗 Citation

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{deepglint_videobenchmark2025,
      title={Video Benchmark Suite: Rapid Evaluation of Video Foundation Models}, 
      url={ https://github.com/deepglint/Video_Benchmark_Suite },
      author={Yang, Ninghua, Feng, Ziyong},
      year={2025}
}

👀 Acknowledgement

This repository is built based on V-SWIFT, VideoMAE, VideoMAEv2, jepa, unmasked_teacher, CLIP_benchmark, vitookit, BlackVIP, open_clip and InternVideo repository.

📚 License

MIT License.

About

Video Benchmark Suite: Rapid Evaluation of Video Foundation Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published