Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs
- URL: http://arxiv.org/abs/2601.11995v1
- Date: Sat, 17 Jan 2026 10:13:07 GMT
- Title: Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs
- Authors: Donghuo Zeng, Hao Niu, Yanan Wang, Masato Taya,
- Abstract summary: Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity.<n>We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues.
- Score: 4.563533024286367
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning robust audio-visual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences - background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled "train" might also contain motorcycle audio and visual, because "motorcycle" is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio-Visual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across modalities, assigning nonzero probability to co-occurring but unannotated events and enriching the supervision signal. (2) Inferred Latent Interaction Graph (ILI) applies the GRaSP algorithm to teacher soft labels to infer a sparse, directed dependency graph among classes. This graph highlights directional dependencies (e.g., "Train (visual)" -> "Motorcycle (audio)") that expose likely semantic or conditional relationships between classes; these are interpreted as estimated dependency patterns. (3) Latent Interaction Regularizer (LIR): A student network is trained with both metric loss and a regularizer guided by the ILI graph, pulling together embeddings of dependency-linked but unlabeled pairs in proportion to their soft-label probabilities. Experiments on AVE and VEGAS benchmarks show consistent improvements in mean average precision (mAP), demonstrating that integrating inferred latent interactions into embedding learning enhances robustness and semantic coherence.
Related papers
- Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation [51.67328507400985]
In noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators.<n>This paper focuses on learning from partial labels annotated by pre-trained vision-language models.<n>It proposes an innovative collaborative consistency regularization (Co-Reg) method.
arXiv Detail & Related papers (2025-06-03T12:48:54Z) - Weakly-Supervised Contrastive Learning for Imprecise Class Labels [50.57424331797865]
We introduce the concept of continuous semantic similarity'' to define positive and negative pairs.<n>We propose a graph-theoretic framework for weakly-supervised contrastive learning.<n>Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios.
arXiv Detail & Related papers (2025-05-28T06:50:40Z) - Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning [2.6790916541909437]
Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified.<n>We propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation.<n>We distill audio-visual distribution-based knowledge from annotated labels in a subset of each batch.
arXiv Detail & Related papers (2025-01-16T15:32:41Z) - FECANet: Boosting Few-Shot Semantic Segmentation with Feature-Enhanced
Context-Aware Network [48.912196729711624]
Few-shot semantic segmentation is the task of learning to locate each pixel of a novel class in a query image with only a few annotated support images.
We propose a Feature-Enhanced Context-Aware Network (FECANet) to suppress the matching noise caused by inter-class local similarity.
In addition, we propose a novel correlation reconstruction module that encodes extra correspondence relations between foreground and background and multi-scale context semantic features.
arXiv Detail & Related papers (2023-01-19T16:31:13Z) - Joint Class-Affinity Loss Correction for Robust Medical Image
Segmentation with Noisy Labels [22.721870430220598]
noisy labels prevent medical image segmentation algorithms from learning precise semantic correlations.
We present a novel perspective for noisy mitigation by incorporating both pixel-wise and pair-wise manners.
We propose a robust Joint Class-Affinity (JCAS) framework to combat label noise issues in medical image segmentation.
arXiv Detail & Related papers (2022-06-16T08:19:33Z) - Modelling Neighbor Relation in Joint Space-Time Graph for Video
Correspondence Learning [53.74240452117145]
This paper presents a self-supervised method for learning reliable visual correspondence from unlabeled videos.
We formulate the correspondence as finding paths in a joint space-time graph, where nodes are grid patches sampled from frames, and are linked by two types of edges.
Our learned representation outperforms the state-of-the-art self-supervised methods on a variety of visual tasks.
arXiv Detail & Related papers (2021-09-28T05:40:01Z) - Noise-robust Graph Learning by Estimating and Leveraging Pairwise
Interactions [123.07967420310796]
This paper bridges the gap by proposing a pairwise framework for noisy node classification on graphs.
PI-GNN relies on the PI as a primary learning proxy in addition to the pointwise learning from the noisy node class labels.
Our proposed framework PI-GNN contributes two novel components: (1) a confidence-aware PI estimation model that adaptively estimates the PI labels, and (2) a decoupled training approach that leverages the estimated PI labels.
arXiv Detail & Related papers (2021-06-14T14:23:08Z) - Unified Robust Training for Graph NeuralNetworks against Label Noise [12.014301020294154]
We propose a new framework, UnionNET, for learning with noisy labels on graphs under a semi-supervised setting.
Our approach provides a unified solution for robustly training GNNs and performing label correction simultaneously.
arXiv Detail & Related papers (2021-03-05T01:17:04Z) - Knowledge-Guided Multi-Label Few-Shot Learning for General Image
Recognition [75.44233392355711]
KGGR framework exploits prior knowledge of statistical label correlations with deep neural networks.
It first builds a structured knowledge graph to correlate different labels based on statistical label co-occurrence.
Then, it introduces the label semantics to guide learning semantic-specific features.
It exploits a graph propagation network to explore graph node interactions.
arXiv Detail & Related papers (2020-09-20T15:05:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.