FSSUAVL: A Discriminative Framework using Vision Models for Federated Self-Supervised Audio and Image Understanding
- URL: http://arxiv.org/abs/2504.09516v1
- Date: Sun, 13 Apr 2025 11:04:43 GMT
- Title: FSSUAVL: A Discriminative Framework using Vision Models for Federated Self-Supervised Audio and Image Understanding
- Authors: Yasar Abbas Ur Rehman, Kin Wai Lau, Yuyang Xie, Ma Lan, JiaJun Shen,
- Abstract summary: We introduce textttFSSUAVL, a single deep model pretrained in Federated Learning.<n>Instead of aligning the audio and image modalities, textttFSSUAVL jointly discriminates them by projecting them into a common embedding space.<n>Our experiments with CNN and ViT demonstrate that textttFSSUAVL significantly improves performance across various image- and audio-based downstream tasks.
- Score: 10.897206681590541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have demonstrated that vision models can effectively learn multimodal audio-image representations when paired. However, the challenge of enabling deep models to learn representations from unpaired modalities remains unresolved. This issue is especially pertinent in scenarios like Federated Learning (FL), where data is often decentralized, heterogeneous, and lacks a reliable guarantee of paired data. Previous attempts tackled this issue through the use of auxiliary pretrained encoders or generative models on local clients, which invariably raise computational cost with increasing number modalities. Unlike these approaches, in this paper, we aim to address the task of unpaired audio and image recognition using \texttt{FSSUAVL}, a single deep model pretrained in FL with self-supervised contrastive learning (SSL). Instead of aligning the audio and image modalities, \texttt{FSSUAVL} jointly discriminates them by projecting them into a common embedding space using contrastive SSL. This extends the utility of \texttt{FSSUAVL} to paired and unpaired audio and image recognition tasks. Our experiments with CNN and ViT demonstrate that \texttt{FSSUAVL} significantly improves performance across various image- and audio-based downstream tasks compared to using separate deep models for each modality. Additionally, \texttt{FSSUAVL}'s capacity to learn multimodal feature representations allows for integrating auxiliary information, if available, to enhance recognition accuracy.
Related papers
- Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion [12.212623921747264]
Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems.
We propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE)
This system deployed in production systems, leading to significant business gains.
arXiv Detail & Related papers (2025-03-21T21:55:05Z) - Sequential Contrastive Audio-Visual Learning [12.848371604063168]
We propose sequential contrastive audiovisual learning (SCAV), which contrasts examples based on their non-aggregated representation space.<n>Experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV.<n>We also show that models trained with SCAV exhibit a significant degree of flexibility regarding the metric employed for retrieval.
arXiv Detail & Related papers (2024-07-08T09:45:20Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.<n>The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.<n>The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Semantic Cross Attention for Few-shot Learning [9.529264466445236]
We propose a multi-task learning approach to view semantic features of label text as an auxiliary task.
Our proposed model uses word-embedding representations as semantic features to help train the embedding network and a semantic cross-attention module to bridge the semantic features into the typical visual modal.
arXiv Detail & Related papers (2022-10-12T15:24:59Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.