Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval
- URL: http://arxiv.org/abs/2505.12499v5
- Date: Thu, 23 Oct 2025 10:15:32 GMT
- Title: Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval
- Authors: Jian Xiao, Zijie Song, Jialong Hu, Hao Cheng, Jia Li, Zhenzhen Hu, Richang Hong,
- Abstract summary: Gap-Aware Retrieval framework introduces a learnable, pair-specific increment $Delta_ij$ between text $t_i$ and video $v_j$.<n>A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction.<n>Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness.
- Score: 48.85977777168096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment $\Delta_{ij}$ between text $t_i$ and video $v_j$, redistributing gradients to relieve optimization tension and absorb noise. We derive $\Delta_{ij}$ via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize $\Delta$ through a variational information bottleneck with relaxed compression, enhancing stability and semantic consistency. Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation. Code is available at https://github.com/musicman217/GARE-text-video-retrieval.
Related papers
- DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition [51.80782323686666]
We introduce textbfDynaPURLS, a unified framework that establishes robust, multi-scale visual-semantic correspondences.<n>Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics.<n>Experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art.
arXiv Detail & Related papers (2025-12-12T10:39:10Z) - Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition [41.77490816513839]
We propose a novel method for zero-shot skeleton action recognition, termed $texttt$textbfFlora$$.<n>Specifically, we attune textual semantics by incorporating direction-aware regional semantics, and a cross-modal consistency objective.<n>Experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data.
arXiv Detail & Related papers (2025-11-12T14:54:53Z) - Learning Noise-Resilient and Transferable Graph-Text Alignment via Dynamic Quality Assessment [19.204800655283744]
Pre-training Graph Foundation Models (GFMs) on text-attributed graphs (TAGs) is central to web-scale applications such as search, recommendation, and knowledge discovery.<n> existing CLIP-style graph-text amplifies face two key limitations: they assume strict one-to-one correspondences between nodes and texts, and they rely on static alignment objectives that cannot adapt to varying data quality, making them brittle under noisy supervision.<n>We propose ADAligner, a quality-aware graphtext alignment framework that dynamically adjusts between expressive many-to-many and conservative one-to-one objectives according to supervision quality
arXiv Detail & Related papers (2025-10-22T09:01:17Z) - TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs [7.125400292079228]
Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift.<n>While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures.<n>We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus.
arXiv Detail & Related papers (2025-08-04T05:03:35Z) - GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval [12.483734449829235]
GAID is a framework that integrates audio and visual features under textual guidance.<n>DASP injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference.<n>Experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results with notable efficiency gains.
arXiv Detail & Related papers (2025-08-03T10:44:24Z) - Global Variational Inference Enhanced Robust Domain Adaptation [7.414646586981638]
We propose a framework that learns continuous, class-conditional global priors via variational inference to enable structure-aware cross-domain alignment.<n>GVI-DA minimizes domain gaps through latent feature reconstruction, and mitigates posterior collapse using global codebook learning with randomized sampling.<n>It further improves robustness by discarding low-confidence pseudo-labels and generating reliable target-domain samples.
arXiv Detail & Related papers (2025-07-04T04:43:23Z) - Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration [8.192590936983347]
Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding.<n>They are frequently hampered by hallucination-the generation of text that contradicts visual input.<n>Existing training-free decoding strategies exhibit critical limitations.<n>This paper introduces Dynamic Logits (DLC), a novel training-free decoding framework designed to align text generation with visual evidence at inference time.
arXiv Detail & Related papers (2025-06-26T17:35:40Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z) - Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment [3.2873782624127834]
We present a self-supervised method for representation learning based on aligning temporal video sequences.<n>We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies.<n>We show that our learned representations outperform existing state-of-the-art approaches on action recognition tasks.
arXiv Detail & Related papers (2024-09-06T20:32:53Z) - GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval [60.70901959953688]
We present GMMFormer v2, an uncertainty-aware framework for PRVR.
For clip modeling, we improve a strong baseline GMMFormer with a novel temporal consolidation module.
We propose a novel optimal matching loss for fine-grained text-clip alignment.
arXiv Detail & Related papers (2024-05-22T16:55:31Z) - Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation [11.195959019678314]
Consistency learning is a central strategy to tackle unlabeled data in semi-supervised medical image segmentation.
In this paper, we propose an Adaptive Bidirectional Displacement approach to solve the above challenge.
arXiv Detail & Related papers (2024-05-01T08:17:43Z) - Align, Minimize and Diversify: A Source-Free Unsupervised Domain Adaptation Method for Handwritten Text Recognition [11.080302144256164]
The Align, Minimize and Diversify (AMD) method is a Source-Free Unsupervised Domain Adaptation approach for Handwritten Text Recognition (HTR)
Our method explicitly eliminates the need to revisit the source data during adaptation by incorporating three distinct regularization terms.
Experimental results from several benchmarks demonstrated the effectiveness and robustness of AMD, showing it to be competitive and often outperforming DA methods in HTR.
arXiv Detail & Related papers (2024-04-28T17:50:58Z) - GIFD: A Generative Gradient Inversion Method with Feature Domain
Optimization [52.55628139825667]
Federated Learning (FL) has emerged as a promising distributed machine learning framework to preserve clients' privacy.
Recent studies find that an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge.
We propose textbfGradient textbfInversion over textbfFeature textbfDomains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers.
arXiv Detail & Related papers (2023-08-09T04:34:21Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z) - Bi-level Feature Alignment for Versatile Image Translation and
Manipulation [88.5915443957795]
Generative adversarial networks (GANs) have achieved great success in image translation and manipulation.
High-fidelity image generation with faithful style control remains a grand challenge in computer vision.
This paper presents a versatile image translation and manipulation framework that achieves accurate semantic and style guidance.
arXiv Detail & Related papers (2021-07-07T05:26:29Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z) - Preventing Posterior Collapse with Levenshtein Variational Autoencoder [61.30283661804425]
We propose to replace the evidence lower bound (ELBO) with a new objective which is simple to optimize and prevents posterior collapse.
We show that Levenstein VAE produces more informative latent representations than alternative approaches to preventing posterior collapse.
arXiv Detail & Related papers (2020-04-30T13:27:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.