Related papers: HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling

HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling

URL: http://arxiv.org/abs/2510.23043v1
Date: Mon, 27 Oct 2025 06:13:07 GMT
Title: HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
Authors: Joungbin An, Kristen Grauman,
Abstract summary: HieraMamba is a hierarchical architecture that preserves temporal structure and semantic richness across scales.<n>It sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.
Score: 52.10845971383909
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or relying on fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba's selective scanning to produce compact anchor tokens that summarize video content at multiple granularities. Two complementary objectives, anchor-conditioned and segment-pooled contrastive losses, encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.

Related papers

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models [61.11154533305096]
Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens.<n>We propose a new perspective that elaborates token textbfAnchors within intra-frame and inter-frame contexts.<n>Our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs.
arXiv Detail & Related papers (2026-03-02T03:06:40Z)
MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos [22.175314789730667]
MultiHateLoc is a framework for weakly-supervised multimodal hate localisation.<n>It produces fine-grained, interpretable frame-level predictions.<n> Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.
arXiv Detail & Related papers (2025-12-11T08:18:22Z)
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding [6.980340270823506]
We present SceneRAG, a framework to segment videos into narrative-consistent scenes.<n>For each scene, the framework fuses information from both visual and textual modalities to extract entity relations.<n>Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines.
arXiv Detail & Related papers (2025-06-09T10:00:54Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved Self-Supervised Video Hashing [45.216750448864275]
Learn accurate hash for video retrieval can be challenging due to high local redundancy and complex global video frames. Our proposed Contrastive Hash-temporal Information (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets.
arXiv Detail & Related papers (2023-10-29T07:36:11Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Controllable Augmentations for Video Representation Learning [34.79719112810065]
We propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as minimization general long-term temporal relations. Our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.
arXiv Detail & Related papers (2022-03-30T19:34:32Z)
Context-aware Biaffine Localizing Network for Temporal Sentence Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG) TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z)
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z)
Short-Term and Long-Term Context Aggregation Network for Video Inpainting [126.06302824297948]
Video inpainting aims to restore missing regions of a video and has many applications such as video editing and object removal. We present a novel context aggregation network to effectively exploit both short-term and long-term frame information for video inpainting. Experiments show that it outperforms state-of-the-art methods with better inpainting results and fast inpainting speed.
arXiv Detail & Related papers (2020-09-12T03:50:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.