Skip to content
/ TRACE Public

[ICLR 2025] TRACE: Temporal Grounding Video LLM via Casual Event Modeling

License

Notifications You must be signed in to change notification settings

gyxxyg/TRACE

Repository files navigation

Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Qingbin Liu, Xi Chen

If our project helps you, please give us a star ⭐ and cite our paper!

hf_space trace_checkpoint hf_data arxiv Hits

News

  • 23/01/2025, 🔥TRACE has been accepted to ICLR 2025.
  • 01/11/2024, 🔥We are excited to announce the release of trace-uni, which has been enhanced by incorporating additional general video understanding data from a subset of LLaVA-Video-178k. Our results indicate that (1) the TRACE architecture is still capable of handling general video understanding tasks (53.8 on MVBench and 49.6 on VideoMME); (2) although not adding more VTG data, trace-uni outperforms trace in both VTG tasks and general video understanding tasks.
  • 31/10/2024, 🔥 We evaluated the TRACE nodel on VideoMME benchmark and updated the evaluation code.
  • 25/10/2024, 🔥 We evaluated the TRACE model on the MVBench benchmark and updated the evaluation code accordingly. Our findings indicate that, despite not being trained on extensive multi-task datasets, TRACE is still capable of effectively handling general QA tasks.
  • 19/10/2024, 🔥 We release trace-retrieval by forcing the predicted timestamps to be align with the input frame timestamps. Results show trace-retrieval achieve better performance on dense video captioning tasks
  • 10/10/2024, 🔥 Annotation files of training data are released!
  • 10/10/2024, 🔥 Our model checkpoints and code are released!

TODO

  • Release the model checkpoints
  • Release the inference and evaluation code
  • Release the training and fine-tuning code
  • Release the training data
  • Release the TRACE-Retrieval, which outputs timestamps of input frames instead of predict unseen timestamps.
  • Train TRACE models on more tasks.

Overview

In this work

  • We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
  • We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.
Overview of TRACE
Overview of TRACE.

Enviroments

We use NPU environments for training and fine-tuning, and use V100 GPUs for evaluation. The environment we use can be found in npu-requirements and gpu-requirements.

Model Zoo

Checkpoints Description URL
Initialization Weights initialized from VideoLLaMA2 trace-init
Stage-1 Model checkpoints trained after stage-1 trace-stage1
Stage-2 Model checkpoints trained after stage-2 trace
FT-Charades Fine-tuned on Charades-STA dataset trace-ft-charades
FT-Youcook2 Fine-tuned on Youcook2 dataset trace-ft-youcook2
FT-QVHighlights Fine-tuned on QVHighlights dataset trace-ft-qvhighlights
TRACE-retrieval Forcing the predicted timestamps to be align with input timestamps trace-retrieval
TRACE-uni Incorporating additional general video understanding data from a subset of LLaVA-Video-178k. trace-uni

Inference and Evaluation

Please make sure the model and video paths are correct before running the code.

Data

We have provided the annotation files, and the raw videos can be prepared by the following projects

Training

Stage 1 training

bash TRACE/scripts/train/pretrain-128.sh

Stage 2 training

bash TRACE/scripts/train/sft-128.sh

Fine-tune on downsteam task

bash TRACE/scripts/train/sft-youcook2.sh

Please config the data and model paths before running the scrips.

Results

Youcook2 (Zero-Shot) CIDER METEOR SODA_c F1
TRACE 8.1 2.8 2.2 22.4
TRACE-retrieal 8.3 2.9 2.3 24.1
TRACE-uni 8.6 2.9 2.3 22.4
Charades-STA (Zero-Shot) 0.3 0.5 0.7 mIOU
TRACE 58.6 40.3 19.4 38.7
TRACE-retrieval 57.9 37.4 17.3 37.4
TRACE-uni 63.7 43.7 21.0 41.5
QVHighlights (Zero-Shot) mAP Hit@1
TRACE 26.8 42.7
TRACE-retrieval 27.9 44.3
TRACE-uni 27.5 43.9
ActivityNet-DVC CIDER METEOR SODA_c F1
TRACE 25.9 6.0 6.4 39.3
TRACE-retrieval 25.7 5.9 6.5 40.1
TRACE-uni 29.2 6.9 6.4 40.4
ActivityNet-MR 0.3 0.5 0.7 mIOU
TRACE 54.0 37.7 24.0 39.0
TRACE-retrieval 54.4 39.8 24.9 40.2
TRACE-uni 53.2 38.2 24.7 39.4
MVBench Avg AS AP AA FA UA OE OI OS MD AL ST AC MC MA SC FP CO EN ER CI
TRACE 48.1 61.2 56.5 72.5 46.5 61.0 48.0 69.5 40.0 22.0 31.0 86.5 37.5 37.0 51.0 45.0 40.5 39.0 31.0 43.5 44.5
TRACE-uni 53.8 68.1 58.5 72.5 41.5 73.5 55.1 71.5 40.5 25.0 53.0 88.5 63.5 38.5 51.0 52.5 49.0 59.5 33.5 49.5 32.5
VideoMME (w/o subtitle) Short Midium Long Avg
TRACE 49.5 42.5 39.3 43.8
TRACE-uni 58.2 48.1 42.3 49.6

Demo

Demo of TRACE
Demo of TRACE.

Acknowledgement

We are grateful for the following awesome projects:

Contributors:

  • Yongxin Guo
  • Jingyu Liu
  • Mingda Li

Bibliography

If you find this repository helpful for your project, please consider citing:

@misc{guo2024tracetemporalgroundingvideo,
      title={TRACE: Temporal Grounding Video LLM via Causal Event Modeling}, 
      author={Yongxin Guo and Jingyu Liu and Mingda Li and Xiaoying Tang and Qingbin Liu and Xi Chen},
      year={2024},
      eprint={2410.05643},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.05643}, 
}