Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding


ICLR 2025

University of Central Florida
SSL

Overview of CoSPaL: Tubelet Phrase Grouunding contains two grounding modules namely, spatial and temporal. Spatial module grounds the correct subject tubelet. Temporal module predicts the temporal action boundary. Contextual Referral Grounding (CRG) block shows the breakdown and generation of local and global query to aid TPG.

Abstract

In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to tubelets; (2) Contextual Referral Grounding (CRG), which improves comprehension of complex queries by extracting contextual information to refine object identification over time; and (3) Self-Paced Scene Understanding (SPS), a training paradigm that progressively increases task difficulty, enabling the model to adapt to complex scenarios by transitioning from coarse to fine-grained understanding.

Video (coming soon)

Results

SSL
SSL

Qualitative Analysis

SSL

BibTeX

@article{kumar2025contextual,
      title={Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding},
      author={Kumar, Akash and Kira, Zsolt and Rawat, Yogesh Singh},
      journal={arXiv preprint arXiv:2501.17053},
      year={2025}
    }