Accommodating Audio Modality in CLIP for Multimodal Processing
        - URL: http://arxiv.org/abs/2303.06591v1
- Date: Sun, 12 Mar 2023 06:57:01 GMT
- Title: Accommodating Audio Modality in CLIP for Multimodal Processing
- Authors: Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin
- Abstract summary: We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities.
Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
- Score: 48.83906067348211
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Multimodal processing has attracted much attention lately especially with the
success of pre-training. However, the exploration has mainly focused on
vision-language pre-training, as introducing more modalities can greatly
complicate model design and optimization. In this paper, we extend the
stateof-the-art Vision-Language model CLIP to accommodate the audio modality
for Vision-Language-Audio multimodal processing. Specifically, we apply
inter-modal and intra-modal contrastive learning to explore the correlation
between audio and other modalities in addition to the inner characteristics of
the audio modality. Moreover, we further design an audio type token to
dynamically learn different audio information type for different scenarios, as
both verbal and nonverbal heterogeneous information is conveyed in general
audios. Our proposed CLIP4VLA model is validated in different downstream tasks
including video retrieval and video captioning, and achieves the
state-of-the-art performance on the benchmark datasets of MSR-VTT, VATEX, and
Audiocaps.
 
      
        Related papers
        - From Alignment to Advancement: Bootstrapping Audio-Language Alignment   with Synthetic Data [55.2480439325792]
 Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
 arXiv  Detail & Related papers  (2025-05-26T16:08:41Z)
- Semi-Supervised Audio-Visual Video Action Recognition with Audio Source   Localization Guided Mixup [2.80888070977859]
 We propose audio-visual SSL for video action recognition, which uses both visual and audio together.
In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed framework.
 arXiv  Detail & Related papers  (2025-03-04T05:13:56Z)
- Multi-Modal interpretable automatic video captioning [1.9874264019909988]
 We introduce a novel video captioning method trained with multi-modal contrastive loss.
Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions.
 arXiv  Detail & Related papers  (2024-11-11T11:12:23Z)
- Locality-aware Cross-modal Correspondence Learning for Dense   Audio-Visual Events Localization [50.122441710500055]
 Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
 arXiv  Detail & Related papers  (2024-09-12T11:54:25Z)
- Auto-ACD: A Large-scale Dataset for Audio-Language Representation   Learning [50.28566759231076]
 We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
 Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
 arXiv  Detail & Related papers  (2023-09-20T17:59:32Z)
- Exploring the Role of Audio in Video Captioning [59.679122191706426]
 We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
 arXiv  Detail & Related papers  (2023-06-21T20:54:52Z)
- A multimodal dynamical variational autoencoder for audiovisual speech
  representation learning [23.748108659645844]
 multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning.
Experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition.
 arXiv  Detail & Related papers  (2023-05-05T14:37:26Z)
- VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
  Dataset [53.46019570679092]
 We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
 arXiv  Detail & Related papers  (2023-04-17T15:08:15Z)
- Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
  on Modality-Specific Annotated Videos [10.478479158063982]
 We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
 arXiv  Detail & Related papers  (2022-03-06T17:31:06Z)
- Leveraging Uni-Modal Self-Supervised Learning for Multimodal
  Audio-Visual Speech Recognition [23.239078852797817]
 We leverage uni-modal self-supervised learning to promote the multimodal audio-visual speech recognition (AVSR)
In particular, we first train audio and visual encoders on a large-scale uni-modal dataset, then we integrate components of both encoders into a larger multimodal framework.
Our model is experimentally validated on both word-level and sentence-level AVSR tasks.
 arXiv  Detail & Related papers  (2022-02-24T15:12:17Z)
- Joint Learning of Visual-Audio Saliency Prediction and Sound Source
  Localization on Multi-face Videos [101.83513408195692]
 We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
 arXiv  Detail & Related papers  (2021-11-05T14:35:08Z)
- Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
 We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
 arXiv  Detail & Related papers  (2021-04-22T09:31:20Z)
- Curriculum Audiovisual Learning [113.20920928789867]
 We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector.
To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene.
We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
 arXiv  Detail & Related papers  (2020-01-26T07:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.