Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding
- URL: http://arxiv.org/abs/2109.06400v1
- Date: Tue, 14 Sep 2021 02:08:23 GMT
- Title: Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding
- Authors: Daizong Liu, Xiaoye Qu, Pan Zhou
- Abstract summary: We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
- Score: 53.377028000325424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A key solution to temporal sentence grounding (TSG) exists in how to learn
effective alignment between vision and language features extracted from an
untrimmed video and a sentence description. Existing methods mainly leverage
vanilla soft attention to perform the alignment in a single-step process.
However, such single-step attention is insufficient in practice, since
complicated relations between inter- and intra-modality are usually obtained
through multi-step reasoning. In this paper, we propose an Iterative Alignment
Network (IA-Net) for TSG task, which iteratively interacts inter- and
intra-modal features within multiple steps for more accurate grounding.
Specifically, during the iterative reasoning process, we pad multi-modal
features with learnable parameters to alleviate the nowhere-to-attend problem
of non-matched frame-word pairs, and enhance the basic co-attention mechanism
in a parallel manner. To further calibrate the misaligned attention caused by
each reasoning step, we also devise a calibration module following each
attention module to refine the alignment knowledge. With such iterative
alignment scheme, our IA-Net can robustly capture the fine-grained relations
between vision and language domains step-by-step for progressively reasoning
the temporal boundaries. Extensive experiments conducted on three challenging
benchmarks demonstrate that our proposed model performs better than the
state-of-the-arts.
Related papers
- TS-TCD: Triplet-Level Cross-Modal Distillation for Time-Series Forecasting Using Large Language Models [15.266543423942617]
We present a novel framework, TS-TCD, which introduces a comprehensive three-tiered cross-modal knowledge distillation mechanism.
Unlike prior work that focuses on isolated alignment techniques, our framework systematically integrates.
Experiments on benchmark time-series demonstrate that TS-TCD achieves state-of-the-art results, outperforming traditional methods in both accuracy and robustness.
arXiv Detail & Related papers (2024-09-23T12:57:24Z) - Introducing Gating and Context into Temporal Action Detection [0.8987776881291144]
Temporal Action Detection (TAD) remains challenging due to action overlaps and variable action durations.
Recent findings suggest that TAD performance is dependent on the structural design of transformers rather than on the self-attention mechanism.
We propose a refined feature extraction process through lightweight, yet effective operations.
arXiv Detail & Related papers (2024-09-06T11:52:42Z) - Temporally Grounding Instructional Diagrams in Unconstrained Videos [51.85805768507356]
We study the challenging problem of simultaneously localizing a sequence of queries in instructional diagrams in a video.
Most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries.
We propose composite queries constructed by exhaustively pairing up the visual content features of the step diagrams.
We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries.
arXiv Detail & Related papers (2024-07-16T05:44:30Z) - RESTORE: Towards Feature Shift for Vision-Language Prompt Learning [33.13407089704543]
We show that prompt tuning along only one branch of CLIP is the reason why the misalignment occurs.
Without proper regularization across the learnable parameters in different modalities, prompt learning violates the original pre-training constraints.
We propose RESTORE, a multi-modal prompt learning method that exerts explicit constraints on cross-modal consistency.
arXiv Detail & Related papers (2024-03-10T08:52:48Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT)
To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling.
Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Weakly Supervised Temporal Adjacent Network for Language Grounding [96.09453060585497]
We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding.
WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm.
An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
arXiv Detail & Related papers (2021-06-30T15:42:08Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.