Related papers: SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification

SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification

URL: http://arxiv.org/abs/2506.17694v1
Date: Sat, 21 Jun 2025 12:02:53 GMT
Title: SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification
Authors: Gnana Praveen Rajasekhar, Jahangir Alam,
Abstract summary: We propose a self-supervised learning framework based on contrastive learning with asymmetric masking and masked data modeling.<n>We employ a unified framework for self-supervised audiovisual speaker verification using a single shared backbone for audio and visual inputs.<n>Our method achieves competitive performance without labeled data while reducing computational costs compared to traditional approaches.
Score: 3.380873355096444
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conventional audio-visual methods for speaker verification rely on large amounts of labeled data and separate modality-specific architectures, which is computationally expensive, limiting their scalability. To address these problems, we propose a self-supervised learning framework based on contrastive learning with asymmetric masking and masked data modeling to obtain robust audiovisual feature representations. In particular, we employ a unified framework for self-supervised audiovisual speaker verification using a single shared backbone for audio and visual inputs, leveraging the versatility of vision transformers. The proposed unified framework can handle audio, visual, or audiovisual inputs using a single shared vision transformer backbone during training and testing while being computationally efficient and robust to missing modalities. Extensive experiments demonstrate that our method achieves competitive performance without labeled data while reducing computational costs compared to traditional approaches.

Related papers

Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models [13.63552417613795]
We propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models.<n>Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations.
arXiv Detail & Related papers (2025-06-06T21:06:35Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Sequential Contrastive Audio-Visual Learning [12.848371604063168]
We propose sequential contrastive audiovisual learning (SCAV), which contrasts examples based on their non-aggregated representation space.<n>Experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV.<n>We also show that models trained with SCAV exhibit a significant degree of flexibility regarding the metric employed for retrieval.
arXiv Detail & Related papers (2024-07-08T09:45:20Z)
Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z)
Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation. This framework establishes robust correlations between an object's visual characteristics and its associated sound. We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z)
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z)
ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z)
Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation [32.68710772281511]
We present a self-supervised framework for audio-visual representation learning, to localize the sound source in videos. Our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound. This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications.
arXiv Detail & Related papers (2022-06-26T03:00:02Z)
Learning Audio-Visual Correlations from Variational Cross-Modal Generation [35.07257471319274]
We learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner. The learned correlations can be readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval.
arXiv Detail & Related papers (2021-02-05T21:27:00Z)
Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
Curriculum Audiovisual Learning [113.20920928789867]
We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector. To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
arXiv Detail & Related papers (2020-01-26T07:08:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.