Related papers: Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding

Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding

URL: http://arxiv.org/abs/2109.06400v1
Date: Tue, 14 Sep 2021 02:08:23 GMT
Title: Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding
Authors: Daizong Liu, Xiaoye Qu, Pan Zhou
Abstract summary: We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task. We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs. We also devise a calibration module following each attention module to refine the alignment knowledge.
Score: 53.377028000325424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A key solution to temporal sentence grounding (TSG) exists in how to learn effective alignment between vision and language features extracted from an untrimmed video and a sentence description. Existing methods mainly leverage vanilla soft attention to perform the alignment in a single-step process. However, such single-step attention is insufficient in practice, since complicated relations between inter- and intra-modality are usually obtained through multi-step reasoning. In this paper, we propose an Iterative Alignment Network (IA-Net) for TSG task, which iteratively interacts inter- and intra-modal features within multiple steps for more accurate grounding. Specifically, during the iterative reasoning process, we pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs, and enhance the basic co-attention mechanism in a parallel manner. To further calibrate the misaligned attention caused by each reasoning step, we also devise a calibration module following each attention module to refine the alignment knowledge. With such iterative alignment scheme, our IA-Net can robustly capture the fine-grained relations between vision and language domains step-by-step for progressively reasoning the temporal boundaries. Extensive experiments conducted on three challenging benchmarks demonstrate that our proposed model performs better than the state-of-the-arts.

Related papers

Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z)
TS-TCD: Triplet-Level Cross-Modal Distillation for Time-Series Forecasting Using Large Language Models [15.266543423942617]
We present a novel framework, TS-TCD, which introduces a comprehensive three-tiered cross-modal knowledge distillation mechanism. Unlike prior work that focuses on isolated alignment techniques, our framework systematically integrates. Experiments on benchmark time-series demonstrate that TS-TCD achieves state-of-the-art results, outperforming traditional methods in both accuracy and robustness.
arXiv Detail & Related papers (2024-09-23T12:57:24Z)
Introducing Gating and Context into Temporal Action Detection [0.8987776881291144]
Temporal Action Detection (TAD) remains challenging due to action overlaps and variable action durations. Recent findings suggest that TAD performance is dependent on the structural design of transformers rather than on the self-attention mechanism. We propose a refined feature extraction process through lightweight, yet effective operations.
arXiv Detail & Related papers (2024-09-06T11:52:42Z)
Temporally Grounding Instructional Diagrams in Unconstrained Videos [51.85805768507356]
We study the challenging problem of simultaneously localizing a sequence of queries in instructional diagrams in a video. Most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries. We propose composite queries constructed by exhaustively pairing up the visual content features of the step diagrams. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries.
arXiv Detail & Related papers (2024-07-16T05:44:30Z)
RESTORE: Towards Feature Shift for Vision-Language Prompt Learning [33.13407089704543]
We show that prompt tuning along only one branch of CLIP is the reason why the misalignment occurs. Without proper regularization across the learnable parameters in different modalities, prompt learning violates the original pre-training constraints. We propose RESTORE, a multi-modal prompt learning method that exerts explicit constraints on cross-modal consistency.
arXiv Detail & Related papers (2024-03-10T08:52:48Z)
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment. Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z)
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT) To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling. Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z)
Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance. We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z)
Weakly Supervised Temporal Adjacent Network for Language Grounding [96.09453060585497]
We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding. WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm. An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
arXiv Detail & Related papers (2021-06-30T15:42:08Z)
Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.