HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
- URL: http://arxiv.org/abs/2510.23043v1
- Date: Mon, 27 Oct 2025 06:13:07 GMT
- Title: HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
- Authors: Joungbin An, Kristen Grauman,
- Abstract summary: HieraMamba is a hierarchical architecture that preserves temporal structure and semantic richness across scales.<n>It sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.
- Score: 52.10845971383909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or relying on fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba's selective scanning to produce compact anchor tokens that summarize video content at multiple granularities. Two complementary objectives, anchor-conditioned and segment-pooled contrastive losses, encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.
Related papers
- Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models [61.11154533305096]
Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens.<n>We propose a new perspective that elaborates token textbfAnchors within intra-frame and inter-frame contexts.<n>Our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs.
arXiv Detail & Related papers (2026-03-02T03:06:40Z) - MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos [22.175314789730667]
MultiHateLoc is a framework for weakly-supervised multimodal hate localisation.<n>It produces fine-grained, interpretable frame-level predictions.<n> Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.
arXiv Detail & Related papers (2025-12-11T08:18:22Z) - SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding [6.980340270823506]
We present SceneRAG, a framework to segment videos into narrative-consistent scenes.<n>For each scene, the framework fuses information from both visual and textual modalities to extract entity relations.<n>Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines.
arXiv Detail & Related papers (2025-06-09T10:00:54Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved
Self-Supervised Video Hashing [45.216750448864275]
Learn accurate hash for video retrieval can be challenging due to high local redundancy and complex global video frames.
Our proposed Contrastive Hash-temporal Information (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets.
arXiv Detail & Related papers (2023-10-29T07:36:11Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Controllable Augmentations for Video Representation Learning [34.79719112810065]
We propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as minimization general long-term temporal relations.
Our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.
arXiv Detail & Related papers (2022-03-30T19:34:32Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - Short-Term and Long-Term Context Aggregation Network for Video
Inpainting [126.06302824297948]
Video inpainting aims to restore missing regions of a video and has many applications such as video editing and object removal.
We present a novel context aggregation network to effectively exploit both short-term and long-term frame information for video inpainting.
Experiments show that it outperforms state-of-the-art methods with better inpainting results and fast inpainting speed.
arXiv Detail & Related papers (2020-09-12T03:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.