Related papers: Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization

Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization

URL: http://arxiv.org/abs/2602.03570v1
Date: Tue, 03 Feb 2026 14:14:03 GMT
Title: Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization
Authors: Bixing Wu, Yuhong Zhao, Zongli Ye, Jiachen Lian, Xiangyu Yue, Gopala Anumanchipalli,
Abstract summary: We propose Asymmetric Hierarchical Anchoring (AHA) to enforce directional information allocation.<n>We replace fragile mutual information estimators with a GRL-based adversarial decoupler that explicitly suppresses semantic leakage.<n>AHA consistently outperforms symmetric baselines in cross-modal transfer.
Score: 19.721857318111734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-visual joint representation learning under Cross-Modal Generalization (CMG) aims to transfer knowledge from a labeled source modality to an unlabeled target modality through a unified discrete representation space. Existing symmetric frameworks often suffer from information allocation ambiguity, where the absence of structural inductive bias leads to semantic-specific leakage across modalities. We propose Asymmetric Hierarchical Anchoring (AHA), which enforces directional information allocation by designating a structured semantic anchor within a shared hierarchy. In our instantiation, we exploit the hierarchical discrete representations induced by audio Residual Vector Quantization (RVQ) to guide video feature distillation into a shared semantic space. To ensure representational purity, we replace fragile mutual information estimators with a GRL-based adversarial decoupler that explicitly suppresses semantic leakage in modality-specific branches, and introduce Local Sliding Alignment (LSA) to encourage fine-grained temporal alignment across modalities. Extensive experiments on AVE and AVVP benchmarks demonstrate that AHA consistently outperforms symmetric baselines in cross-modal transfer. Additional analyses on talking-face disentanglement experiment further validate that the learned representations exhibit improved semantic consistency and disentanglement, indicating the broader applicability of the proposed framework.

Related papers

AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models [21.682989096955467]
AG-VAS (Anchor-Guided Visual Anomaly) is a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens.<n>AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.
arXiv Detail & Related papers (2026-03-01T22:25:23Z)
SGHA-Attack: Semantic-Guided Hierarchical Alignment for Transferable Targeted Attacks on Vision-Language Models [73.19044613922911]
Large vision-language models (VLMs) are vulnerable to transfer-based adversarial perturbations.<n>We propose SGHA-Attack, a framework that adopts multiple target references and enforces intermediate-layer consistency.<n>Experiments on open-source and commercial black-box VLMs show that SGHA-Attack achieves stronger targeted transferability than prior methods.
arXiv Detail & Related papers (2026-02-02T03:10:41Z)
Invariance on Manifolds: Understanding Robust Visual Representations for Place Recognition [19.200074425090595]
We propose a Second-Order Geometric Statistics framework that inherently captures geometric stability without training.<n>Our approach introduces a training-free framework built upon fixed, pre-trained backbones, achieving strong zero-shot generalization without parameter updates.
arXiv Detail & Related papers (2026-01-31T18:12:29Z)
Training-Free Representation Guidance for Diffusion Models with a Representation Alignment Projector [14.027059904924135]
We introduce a representation alignment projector that injects representations predicted by a projector into intermediate sampling steps.<n>Experiments on SiTs and REPAs show notable improvements in class-conditional ImageNet synthesis.<n>The proposed method outperforms representative guidance when applied to SiT models.
arXiv Detail & Related papers (2026-01-30T02:29:54Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z)
Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application [11.385703484113552]
We propose a novel semantic communication framework empowered by generative artificial intelligence (GAI)<n>A latent diffusion model (LDM)-based semantic communication framework is proposed that combines a variational autoencoder for semantic features extraction.<n>The proposed system is a training-free framework that supports zero-shot generalization, and achieves superior performance under low-SNR and out-of-distribution conditions.
arXiv Detail & Related papers (2025-06-06T03:20:32Z)
Sparsification and Reconstruction from the Perspective of Representation Geometry [10.834177456685538]
Sparse Autoencoders (SAEs) have emerged as a predominant tool in mechanistic interpretability.<n>This study explains the principles of sparsity from the perspective of representational geometry.<n>Specifically emphasizes the necessity of understanding representations and incorporating representational constraints.
arXiv Detail & Related papers (2025-05-28T15:54:33Z)
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)
Prompt-based Logical Semantics Enhancement for Implicit Discourse Relation Recognition [4.7938839332508945]
We propose a Prompt-based Logical Semantics Enhancement (PLSE) method for Implicit Discourse Relation Recognition (IDRR) Our method seamlessly injects knowledge relevant to discourse relation into pre-trained language models through prompt-based connective prediction. Experimental results on PDTB 2.0 and CoNLL16 datasets demonstrate that our method achieves outstanding and consistent performance against the current state-of-the-art models.
arXiv Detail & Related papers (2023-11-01T08:38:08Z)
Learning Aligned Cross-Modal Representation for Generalized Zero-Shot Classification [17.177622259867515]
We propose an innovative autoencoder network by learning Aligned Cross-Modal Representations (dubbed ACMR) for Generalized Zero-Shot Classification (GZSC) Specifically, we propose a novel Vision-Semantic Alignment (VSA) method to strengthen the alignment of cross-modal latent features on the latent subspaces guided by a learned classifier. In addition, we propose a novel Information Enhancement Module (IEM) to reduce the possibility of latent variables collapse meanwhile encouraging the discriminative ability of latent variables.
arXiv Detail & Related papers (2021-12-24T03:35:37Z)
HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning [74.76431541169342]
Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. We propose a novel hierarchical semantic-visual adaptation (HSVA) framework to align semantic and visual domains. Experiments on four benchmark datasets demonstrate HSVA achieves superior performance on both conventional and generalized ZSL.
arXiv Detail & Related papers (2021-09-30T14:27:50Z)
Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain Adaptation using Structurally Regularized Deep Clustering [119.88565565454378]
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain. We propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one. Our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings.
arXiv Detail & Related papers (2020-12-08T08:52:00Z)
Nonlinear ISA with Auxiliary Variables for Learning Speech Representations [51.9516685516144]
We introduce a theoretical framework for nonlinear Independent Subspace Analysis (ISA) in the presence of auxiliary variables. We propose an algorithm that learns unsupervised speech representations whose subspaces are independent.
arXiv Detail & Related papers (2020-07-25T14:53:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.