-
Video Action Recognition (Action Classification)
-
Video Temporal Action Localization (Temporal Action Detection)
-
Video Object Tracking
-
Video Point Tracking
-
Video Depth Estimation
-
Video Camera Pose Estimation
-
Video Instance Segmentation
-
Video Retrieval Text-To-Video (T2V)
-
Video Retrieval Video-To-Text (V2T)
-
Video Temporal Grounding
-
Video Question-Answering
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
-
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
-
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
-
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
-
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
-
VideoPrism: A Foundational Visual Encoder for Video Understanding
-
OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning
We provide the sota foundation model weights in MODEL_ZOO.md.
- 🌸 Linear Probing Evaluation
- Few-shot-data Linear Probing Evaluation
- 🌻 Full Finetuning Evaluation
- Few-shot-data Full Finetuning Evaluation
- 🍁 Zero-shot Evaluation
- Few-shot-data Zero-shot Evaluation
- 🌵 Attentive Probing Evaluation
- Few-shot-data Attentive Probing Evaluation
-
This image is sourced from the paper Unifying Video Self-Supervised Learning across Families of Tasks: A Survey.
-
Attentive Probing Evaluation is sourced from the paper Context Autoencoder for Self-Supervised Representation Learning.
We employ Linear Probing and Attentive Probing method for rapid evaluation of Action Recognition.
Abbreviations used: pt (pretrain), ppt (post-pretrain), ft (finetune).
Hyperparameter | Info |
---|---|
optimizer | LBFGS |
regularization | L2 |
max_iter | 1000 |
scan Learning rate | 10-6, 10-4, 10-2, 1, 102, 104, 106 |
spaced steps | binary search |
num_step | 8 |
batch size | 32 |
train class few-shot nums | 10 |
val class few-shot nums | half |
input resolution | 224×224 |
mean | 0.485, 0.456, 0.406 |
std | 0.229, 0.224, 0.225 |
-
This information is referenced from the paper Learning Transferable Visual Models From Natural Language Supervision.
-
Pytorch LBFGS Warning: Right now all parameters have to be on a single device. This will be improved in the future.
Hyperparameter | Info |
---|---|
training epoch | 20 |
scan Learning rate | 1e-3, 3e-4, 1e-4 |
min Learning rate | 1e-7 |
warmup epoch | 5 |
start Learning rate | 0.0 |
batch size | 192 |
short side size | 256 |
input resolution | 224×224 |
optimizer | AdamW |
adamW weight decay | 1e-4 |
adamW eps | 1e-8 |
adamW betas | 0.9, 0.999 |
clip grad | 5.0 |
label smoothing | 0.1 |
attentive head | 16 |
attentive out_dim | 768 |
mean | 0.485, 0.456, 0.406 |
std | 0.229, 0.224, 0.225 |
brightness | 0.8 probability: delta in [-0.125, 0.125] |
saturation | 0.8 probability: factor in [0.6, 1.4] |
contrast | 0.8 probability: factor in [0.6, 1.4] |
hue | 0.8 probability: delta in [-0.2, 0.2] |
color space conversion | 0.1 probability: RGB to BGR |
- This information is referenced from the paper Scaling 4D Representations.
apt-get install -y ffmpeg libavcodec-dev libavfilter-dev libavformat-dev libavutil-dev
conda create --name videomae python=3.10 -y
conda activate videomae
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install decord==0.6.0
pip install nvidia-dali-cuda120==1.44.0
pip install timm==0.4.12
pip install tensorboardX==2.6.2.2
pip install SciPy==1.11.4
pip install matplotlib
pip install scikit-image==0.24.0
pip install flash-attn==2.7.2
pip install psutil==6.0.0
pip install opencv-python
If you find this repository useful, please use the following BibTeX entry for citation.
@misc{deepglint_videobenchmark2025,
title={Video Benchmark Suite: Rapid Evaluation of Video Foundation Models},
url={ https://github.com/deepglint/Video_Benchmark_Suite },
author={Yang, Ninghua, Feng, Ziyong},
year={2025}
}
This repository is built based on V-SWIFT, VideoMAE, VideoMAEv2, jepa, unmasked_teacher, CLIP_benchmark, vitookit, BlackVIP, open_clip and InternVideo repository.