Related papers: Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval

Related papers

DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition [51.80782323686666]
We introduce textbfDynaPURLS, a unified framework that establishes robust, multi-scale visual-semantic correspondences.<n>Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics.<n>Experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art.
arXiv Detail & Related papers (2025-12-12T10:39:10Z)
Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition [41.77490816513839]
We propose a novel method for zero-shot skeleton action recognition, termed $texttt$textbfFlora$$.<n>Specifically, we attune textual semantics by incorporating direction-aware regional semantics, and a cross-modal consistency objective.<n>Experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data.
arXiv Detail & Related papers (2025-11-12T14:54:53Z)
Learning Noise-Resilient and Transferable Graph-Text Alignment via Dynamic Quality Assessment [19.204800655283744]
Pre-training Graph Foundation Models (GFMs) on text-attributed graphs (TAGs) is central to web-scale applications such as search, recommendation, and knowledge discovery.<n> existing CLIP-style graph-text amplifies face two key limitations: they assume strict one-to-one correspondences between nodes and texts, and they rely on static alignment objectives that cannot adapt to varying data quality, making them brittle under noisy supervision.<n>We propose ADAligner, a quality-aware graphtext alignment framework that dynamically adjusts between expressive many-to-many and conservative one-to-one objectives according to supervision quality
arXiv Detail & Related papers (2025-10-22T09:01:17Z)
TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs [7.125400292079228]
Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift.<n>While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures.<n>We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus.
arXiv Detail & Related papers (2025-08-04T05:03:35Z)
GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval [12.483734449829235]
GAID is a framework that integrates audio and visual features under textual guidance.<n>DASP injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference.<n>Experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results with notable efficiency gains.
arXiv Detail & Related papers (2025-08-03T10:44:24Z)
Global Variational Inference Enhanced Robust Domain Adaptation [7.414646586981638]
We propose a framework that learns continuous, class-conditional global priors via variational inference to enable structure-aware cross-domain alignment.<n>GVI-DA minimizes domain gaps through latent feature reconstruction, and mitigates posterior collapse using global codebook learning with randomized sampling.<n>It further improves robustness by discarding low-confidence pseudo-labels and generating reliable target-domain samples.
arXiv Detail & Related papers (2025-07-04T04:43:23Z)
Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration [8.192590936983347]
Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding.<n>They are frequently hampered by hallucination-the generation of text that contradicts visual input.<n>Existing training-free decoding strategies exhibit critical limitations.<n>This paper introduces Dynamic Logits (DLC), a novel training-free decoding framework designed to align text generation with visual evidence at inference time.
arXiv Detail & Related papers (2025-06-26T17:35:40Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z)
Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment [3.2873782624127834]
We present a self-supervised method for representation learning based on aligning temporal video sequences.<n>We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies.<n>We show that our learned representations outperform existing state-of-the-art approaches on action recognition tasks.
arXiv Detail & Related papers (2024-09-06T20:32:53Z)
GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval [60.70901959953688]
We present GMMFormer v2, an uncertainty-aware framework for PRVR. For clip modeling, we improve a strong baseline GMMFormer with a novel temporal consolidation module. We propose a novel optimal matching loss for fine-grained text-clip alignment.
arXiv Detail & Related papers (2024-05-22T16:55:31Z)
Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation [11.195959019678314]
Consistency learning is a central strategy to tackle unlabeled data in semi-supervised medical image segmentation. In this paper, we propose an Adaptive Bidirectional Displacement approach to solve the above challenge.
arXiv Detail & Related papers (2024-05-01T08:17:43Z)
Align, Minimize and Diversify: A Source-Free Unsupervised Domain Adaptation Method for Handwritten Text Recognition [11.080302144256164]
The Align, Minimize and Diversify (AMD) method is a Source-Free Unsupervised Domain Adaptation approach for Handwritten Text Recognition (HTR) Our method explicitly eliminates the need to revisit the source data during adaptation by incorporating three distinct regularization terms. Experimental results from several benchmarks demonstrated the effectiveness and robustness of AMD, showing it to be competitive and often outperforming DA methods in HTR.
arXiv Detail & Related papers (2024-04-28T17:50:58Z)
GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization [52.55628139825667]
Federated Learning (FL) has emerged as a promising distributed machine learning framework to preserve clients' privacy. Recent studies find that an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. We propose textbfGradient textbfInversion over textbfFeature textbfDomains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers.
arXiv Detail & Related papers (2023-08-09T04:34:21Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training. Key to our approach is the use of both global and local temporal constraints. Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z)
Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains. We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z)
Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task. We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs. We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z)
Bi-level Feature Alignment for Versatile Image Translation and Manipulation [88.5915443957795]
Generative adversarial networks (GANs) have achieved great success in image translation and manipulation. High-fidelity image generation with faithful style control remains a grand challenge in computer vision. This paper presents a versatile image translation and manipulation framework that achieves accurate semantic and style guidance.
arXiv Detail & Related papers (2021-07-07T05:26:29Z)
Boosting Continuous Sign Language Recognition via Cross Modality Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair. We propose a novel architecture with cross modality augmentation. The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
Preventing Posterior Collapse with Levenshtein Variational Autoencoder [61.30283661804425]
We propose to replace the evidence lower bound (ELBO) with a new objective which is simple to optimize and prevents posterior collapse. We show that Levenstein VAE produces more informative latent representations than alternative approaches to preventing posterior collapse.
arXiv Detail & Related papers (2020-04-30T13:27:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.