Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding
- URL: http://arxiv.org/abs/2109.06400v1
- Date: Tue, 14 Sep 2021 02:08:23 GMT
- Title: Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding
- Authors: Daizong Liu, Xiaoye Qu, Pan Zhou
- Abstract summary: We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
- Score: 53.377028000325424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A key solution to temporal sentence grounding (TSG) exists in how to learn
effective alignment between vision and language features extracted from an
untrimmed video and a sentence description. Existing methods mainly leverage
vanilla soft attention to perform the alignment in a single-step process.
However, such single-step attention is insufficient in practice, since
complicated relations between inter- and intra-modality are usually obtained
through multi-step reasoning. In this paper, we propose an Iterative Alignment
Network (IA-Net) for TSG task, which iteratively interacts inter- and
intra-modal features within multiple steps for more accurate grounding.
Specifically, during the iterative reasoning process, we pad multi-modal
features with learnable parameters to alleviate the nowhere-to-attend problem
of non-matched frame-word pairs, and enhance the basic co-attention mechanism
in a parallel manner. To further calibrate the misaligned attention caused by
each reasoning step, we also devise a calibration module following each
attention module to refine the alignment knowledge. With such iterative
alignment scheme, our IA-Net can robustly capture the fine-grained relations
between vision and language domains step-by-step for progressively reasoning
the temporal boundaries. Extensive experiments conducted on three challenging
benchmarks demonstrate that our proposed model performs better than the
state-of-the-arts.
Related papers
- Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z) - Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment [66.80402022104074]
We propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich.<n>This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA)
arXiv Detail & Related papers (2026-02-01T14:35:46Z) - ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding [6.310226357092042]
Open-vocabulary grounding requires accurate vision-language alignment under weak supervision.<n>We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation.
arXiv Detail & Related papers (2026-01-30T07:38:04Z) - REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport [7.952582509792969]
Real-world instructional data often contains background segments, repeated actions, and steps presented out of order.<n>We introduce REALIGN, a self-supervised framework for procedure learning based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT)<n>In contrast to KOT, our formulation jointly models visual correspondences and temporal relations under a partial alignment scheme.
arXiv Detail & Related papers (2025-09-29T07:32:14Z) - Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation [48.488114831677166]
CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization.<n>We propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention.
arXiv Detail & Related papers (2025-08-27T20:47:03Z) - Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation [62.14692332209628]
"Interaction Distillation" is a novel training framework for more adequate preference modeling through attention-level optimization.<n>It provides more stable and generalizable reward signals compared to state-of-the-art RM optimization methods.
arXiv Detail & Related papers (2025-08-04T17:06:23Z) - OptiCorNet: Optimizing Sequence-Based Context Correlation for Visual Place Recognition [2.3093110834423616]
This paper presents OptiCorNet, a novel sequence modeling framework.<n>It unifies spatial feature extraction and temporal differencing into a differentiable, end-to-end trainable module.<n>Our approach outperforms state-of-the-art baselines under challenging seasonal and viewpoint variations.
arXiv Detail & Related papers (2025-07-19T04:29:43Z) - Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment.
Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z) - TS-TCD: Triplet-Level Cross-Modal Distillation for Time-Series Forecasting Using Large Language Models [15.266543423942617]
We present a novel framework, TS-TCD, which introduces a comprehensive three-tiered cross-modal knowledge distillation mechanism.
Unlike prior work that focuses on isolated alignment techniques, our framework systematically integrates.
Experiments on benchmark time-series demonstrate that TS-TCD achieves state-of-the-art results, outperforming traditional methods in both accuracy and robustness.
arXiv Detail & Related papers (2024-09-23T12:57:24Z) - Introducing Gating and Context into Temporal Action Detection [0.8987776881291144]
Temporal Action Detection (TAD) remains challenging due to action overlaps and variable action durations.
Recent findings suggest that TAD performance is dependent on the structural design of transformers rather than on the self-attention mechanism.
We propose a refined feature extraction process through lightweight, yet effective operations.
arXiv Detail & Related papers (2024-09-06T11:52:42Z) - Temporally Grounding Instructional Diagrams in Unconstrained Videos [51.85805768507356]
We study the challenging problem of simultaneously localizing a sequence of queries in instructional diagrams in a video.
Most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries.
We propose composite queries constructed by exhaustively pairing up the visual content features of the step diagrams.
We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries.
arXiv Detail & Related papers (2024-07-16T05:44:30Z) - RESTORE: Towards Feature Shift for Vision-Language Prompt Learning [33.13407089704543]
We show that prompt tuning along only one branch of CLIP is the reason why the misalignment occurs.
Without proper regularization across the learnable parameters in different modalities, prompt learning violates the original pre-training constraints.
We propose RESTORE, a multi-modal prompt learning method that exerts explicit constraints on cross-modal consistency.
arXiv Detail & Related papers (2024-03-10T08:52:48Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT)
To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling.
Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Weakly Supervised Temporal Adjacent Network for Language Grounding [96.09453060585497]
We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding.
WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm.
An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
arXiv Detail & Related papers (2021-06-30T15:42:08Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.