Related papers: Diversifying Query: Region-Guided Transformer for Temporal Sentence Grounding

Diversifying Query: Region-Guided Transformer for Temporal Sentence Grounding

URL: http://arxiv.org/abs/2406.00143v2
Date: Thu, 19 Dec 2024 08:58:15 GMT
Title: Diversifying Query: Region-Guided Transformer for Temporal Sentence Grounding
Authors: Xiaolong Sun, Liushuai Shi, Le Wang, Sanping Zhou, Kun Xia, Yabing Wang, Gang Hua,
Abstract summary: Temporal sentence grounding is a challenging task that aims to localize the moment spans relevant to a language description.<n>Recent DETR-based models have achieved notable progress by leveraging multiple learnable moment queries.<n>We present a Region-Guided TRansformer (RGTR) for temporal sentence grounding.
Score: 30.33362992577831
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal sentence grounding is a challenging task that aims to localize the moment spans relevant to a language description. Although recent DETR-based models have achieved notable progress by leveraging multiple learnable moment queries, they suffer from overlapped and redundant proposals, leading to inaccurate predictions. We attribute this limitation to the lack of task-related guidance for the learnable queries to serve a specific mode. Furthermore, the complex solution space generated by variable and open-vocabulary language descriptions complicates optimization, making it harder for learnable queries to distinguish each other adaptively. To tackle this limitation, we present a Region-Guided TRansformer (RGTR) for temporal sentence grounding, which diversifies moment queries to eliminate overlapped and redundant predictions. Instead of using learnable queries, RGTR adopts a set of anchor pairs as moment queries to introduce explicit regional guidance. Each anchor pair takes charge of moment prediction for a specific temporal region, which reduces the optimization difficulty and ensures the diversity of the final predictions. In addition, we design an IoU-aware scoring head to improve proposal quality. Extensive experiments demonstrate the effectiveness of RGTR, outperforming state-of-the-art methods on QVHighlights, Charades-STA and TACoS datasets. Codes are available at https://github.com/TensorsSun/RGTR

Related papers

Length Matters: Length-Aware Transformer for Temporal Sentence Grounding [19.652239319193413]
Temporal sentence grounding is a challenging task aiming to localize the temporal segment within an untrimmed video.<n>We introduce the Length-Aware Transformer (LATR) for TSG, which assigns different queries to handle predictions based on varying temporal lengths.<n>Experiments demonstrate the effectiveness of our LATR, achieving state-of-the-art performance on three public benchmarks.
arXiv Detail & Related papers (2025-08-06T10:34:58Z)
Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks [11.053340674721005]
Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources. This paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval.
arXiv Detail & Related papers (2024-12-20T06:58:32Z)
Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization [60.899082019130766]
We introduce a frame-level detection network (FDN) and a proposal refinement network (PRN) for audio temporal forgery detection and localization. FDN aims to mine informative inconsistency cues between real and fake frames to obtain discriminative features that are beneficial for roughly indicating forgery regions. PRN is responsible for predicting confidence scores and regression offsets to refine the coarse-grained proposals derived from the FDN.
arXiv Detail & Related papers (2024-07-23T15:07:52Z)
Geode: A Zero-shot Geospatial Question-Answering Agent with Explicit Reasoning and Precise Spatio-Temporal Retrieval [0.0]
We introduce a pioneering system designed to tackle zero-shot geospatial question-answering tasks with high precision. Our approach represents a significant improvement in addressing the limitations of current large language models.
arXiv Detail & Related papers (2024-06-26T21:59:54Z)
TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression [25.180317527112372]
normalized coordinate expression is a key factor as reliance on hand-crafted components in query-based detectors for temporal action detection (TAD) We propose modelname, a full end-to-end temporal action detection transformer that integrates time-aligned coordinate expression. Our approach not only simplifies the TAD process by eliminating the need for hand-crafted components but also significantly improves the performance of query-based detectors.
arXiv Detail & Related papers (2024-04-03T02:16:30Z)
Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment. Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Team DETR: Guide Queries as a Professional Team in Detection Transformers [31.521916994653235]
We propose Team DETR, which leverages query collaboration and position constraints to embrace objects of interest more precisely. We also dynamically cater to each query member's prediction preference, offering the query better scale and spatial priors. In addition, the proposed Team DETR is flexible enough to be adapted to other existing DETR variants without increasing parameters and calculations.
arXiv Detail & Related papers (2023-02-14T15:21:53Z)
Action Quality Assessment with Temporal Parsing Transformer [84.1272079121699]
Action Quality Assessment (AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. We propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.
arXiv Detail & Related papers (2022-07-19T13:29:05Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG. Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer. An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z)
Progressive Localization Networks for Language-based Moment Localization [56.54450664871467]
This paper focuses on the task of language-based moment localization. Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, and then match them with the given query to determine the target moment. We propose a novel multi-stage Progressive localization Network (PLN) which progressively localizes the target moment in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-02-02T03:45:59Z)
VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval [21.189093631175425]
Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query. This paper explores methods for performing VMR in a weakly-supervised manner (wVMR) The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.
arXiv Detail & Related papers (2020-08-24T07:54:59Z)
Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers. We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.