Related papers: Sim-DETR: Unlock DETR for Temporal Sentence Grounding

Sim-DETR: Unlock DETR for Temporal Sentence Grounding

URL: http://arxiv.org/abs/2509.23867v1
Date: Sun, 28 Sep 2025 13:21:10 GMT
Title: Sim-DETR: Unlock DETR for Temporal Sentence Grounding
Authors: Jiajin Tang, Zhengxuan Wei, Yuchen Zhu, Cheng Shi, Guanbin Li, Liang Lin, Sibei Yang,
Abstract summary: Temporal sentence grounding aims to identify exact moments in a video that correspond to a given textual query.<n>We find that typical strategies designed to enhance DETR do not improve, and may even degrade, its performance in this task.<n>We propose Sim-DETR, which extends the standard DETR with two minor modifications.
Score: 104.78823923373784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Temporal sentence grounding aims to identify exact moments in a video that correspond to a given textual query, typically addressed with detection transformer (DETR) solutions. However, we find that typical strategies designed to enhance DETR do not improve, and may even degrade, its performance in this task. We systematically analyze and identify the root causes of this abnormal behavior: (1) conflicts between queries from similar target moments and (2) internal query conflicts due to the tension between global semantics and local localization. Building on these insights, we propose a simple yet powerful baseline, Sim-DETR, which extends the standard DETR with two minor modifications in the decoder layers: (1) constraining self-attention between queries based on their semantic and positional overlap and (2) adding query-to-frame alignment to bridge the global and local contexts. Experiments demonstrate that Sim-DETR unlocks the full potential of DETR for temporal sentence grounding, offering a strong baseline for future research.

Related papers

Agentic Spatio-Temporal Grounding via Collaborative Reasoning [80.83158605034465]
Temporal Video Grounding aims to retrieve thetemporal tube of a target object or person in a video given a text query.<n>We propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario.<n>Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs)<n>Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin
arXiv Detail & Related papers (2026-02-10T10:16:27Z)
FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation [0.0]
We introduce FAIR-RAG, a novel agentic framework that transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning process.<n>We conduct experiments on challenging multi-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue.<n>Our work demonstrates that a structured, evidence-driven refinement process with explicit gap analysis is crucial for unlocking reliable and accurate reasoning in advanced RAG systems.
arXiv Detail & Related papers (2025-10-25T15:59:33Z)
Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding [30.223279362023337]
Video Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query.<n>Existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles.<n>We propose DualGround, a dual-branch architecture that explicitly separates global and local semantics.
arXiv Detail & Related papers (2025-10-23T05:53:01Z)
TimeExpert: Boosting Long Time Series Forecasting with Temporal Mix of Experts [11.53964887034519]
We propose the Temporal Mix of Experts (TMOE), a novel attention-level mechanism that reimagines key-value (K-V) pairs as local experts.<n>TMOE performs adaptive expert selection for each query via localized filtering of irrelevant timestamps.<n>We then replace the vanilla attention mechanism in popular time-series Transformer frameworks (i.e., PatchTST and Timer) with TMOE, without extra structural modifications.
arXiv Detail & Related papers (2025-09-27T06:22:09Z)
Re3: Learning to Balance Relevance & Recency for Temporal Information Retrieval [10.939002113975706]
Temporal Information Retrieval is a critical yet unresolved task for modern search systems.<n>Re3 is a framework that balances semantic and temporal information through a query-aware gating mechanism.<n>On Re2Bench, Re3 achieves state-of-the-art results, leading in R@1 across all three subsets.
arXiv Detail & Related papers (2025-09-01T09:44:01Z)
Respecting Temporal-Causal Consistency: Entity-Event Knowledge Graphs for Retrieval-Augmented Generation [69.45495166424642]
We develop a robust and discriminative QA benchmark to measure temporal, causal, and character consistency understanding in narrative documents.<n>We then introduce Entity-Event RAG (E2RAG), a dual-graph framework that keeps separate entity and event subgraphs linked by a bipartite mapping.<n>Across ChronoQA, our approach outperforms state-of-the-art unstructured and KG-based RAG baselines, with notable gains on causal and character consistency queries.
arXiv Detail & Related papers (2025-06-06T10:07:21Z)
Diversifying Query: Region-Guided Transformer for Temporal Sentence Grounding [30.33362992577831]
Temporal sentence grounding is a challenging task that aims to localize the moment spans relevant to a language description.<n>Recent DETR-based models have achieved notable progress by leveraging multiple learnable moment queries.<n>We present a Region-Guided TRansformer (RGTR) for temporal sentence grounding.
arXiv Detail & Related papers (2024-05-31T19:13:09Z)
Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection [48.429555904690595]
We introduce spatially decoupled DETR, which includes a task-aware query generation module and a disentangled feature learning process. We demonstrate that our approach achieves a significant improvement in MSCOCO datasets compared to previous work.
arXiv Detail & Related papers (2023-10-24T15:54:11Z)
Semi-DETR: Semi-Supervised Object Detection with Detection Transformers [105.45018934087076]
We analyze the DETR-based framework on semi-supervised object detection (SSOD) We present Semi-DETR, the first transformer-based end-to-end semi-supervised object detector. Our method outperforms all state-of-the-art methods by clear margins.
arXiv Detail & Related papers (2023-07-16T16:32:14Z)
DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object Detection [34.14603473160207]
This paper presents a DETR-based method for cross-domain weakly supervised object detection (CDWSOD) We think DETR has strong potential for CDWSOD due to an insight: the encoder and the decoder in DETR are both based on the attention mechanism. The aggregation results, i.e., image-level predictions, can naturally exploit the weak supervision for domain alignment.
arXiv Detail & Related papers (2023-04-14T12:16:42Z)
Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding. It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z)
Temporal Context Aggregation Network for Temporal Action Proposal Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field. Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval. We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.