HAVE-Net: Hallucinated Audio-Visual Embeddings for Few-Shot
Classification with Unimodal Cues
- URL: http://arxiv.org/abs/2309.13470v1
- Date: Sat, 23 Sep 2023 20:05:00 GMT
- Title: HAVE-Net: Hallucinated Audio-Visual Embeddings for Few-Shot
Classification with Unimodal Cues
- Authors: Ankit Jha, Debabrata Pal, Mainak Singha, Naman Agarwal, Biplab
Banerjee
- Abstract summary: Occlusion, intra-class variance, lighting, etc., might arise while training neural networks using unimodal RS visual input.
We propose a novel few-shot generative framework, Hallucinated Audio-Visual Embeddings-Network (HAVE-Net), to meta-train cross-modal features from limited unimodal data.
- Score: 19.800985243540797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognition of remote sensing (RS) or aerial images is currently of great
interest, and advancements in deep learning algorithms added flavor to it in
recent years. Occlusion, intra-class variance, lighting, etc., might arise
while training neural networks using unimodal RS visual input. Even though
joint training of audio-visual modalities improves classification performance
in a low-data regime, it has yet to be thoroughly investigated in the RS
domain. Here, we aim to solve a novel problem where both the audio and visual
modalities are present during the meta-training of a few-shot learning (FSL)
classifier; however, one of the modalities might be missing during the
meta-testing stage. This problem formulation is pertinent in the RS domain,
given the difficulties in data acquisition or sensor malfunctioning. To
mitigate, we propose a novel few-shot generative framework, Hallucinated
Audio-Visual Embeddings-Network (HAVE-Net), to meta-train cross-modal features
from limited unimodal data. Precisely, these hallucinated features are
meta-learned from base classes and used for few-shot classification on novel
classes during the inference phase. The experimental results on the benchmark
ADVANCE and AudioSetZSL datasets show that our hallucinated modality
augmentation strategy for few-shot classification outperforms the classifier
performance trained with the real multimodal information at least by 0.8-2%.
Related papers
- Policy Gradient-Driven Noise Mask [3.69758875412828]
We propose a novel pretraining pipeline that learns to generate conditional noise masks specifically tailored to improve performance on multi-modal and multi-organ datasets.
A key aspect is that the policy network's role is limited to obtaining an intermediate (or heated) model before fine-tuning.
Results demonstrate that fine-tuning the intermediate models consistently outperforms conventional training algorithms on both classification and generalization to unseen concept tasks.
arXiv Detail & Related papers (2024-04-29T23:53:42Z) - V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by
Connecting Foundation Models [14.538853403226751]
Building artificial intelligence systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research.
We propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM.
Our method only requires a quick training of the V2A-Mapper to produce high-fidelity and visually-aligned sound.
arXiv Detail & Related papers (2023-08-18T04:49:38Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits [22.558134249701794]
We propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS)
CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visualworks.
Experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters.
arXiv Detail & Related papers (2022-12-21T03:28:30Z) - Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize.
We propose to utilize the high-frequency noises for face forgery detection.
The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales.
The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z) - Contrastive Prototype Learning with Augmented Embeddings for Few-Shot
Learning [58.2091760793799]
We propose a novel contrastive prototype learning with augmented embeddings (CPLAE) model.
With a class prototype as an anchor, CPL aims to pull the query samples of the same class closer and those of different classes further away.
Extensive experiments on several benchmarks demonstrate that our proposed CPLAE achieves new state-of-the-art.
arXiv Detail & Related papers (2021-01-23T13:22:44Z) - RS-MetaNet: Deep meta metric learning for few-shot remote sensing scene
classification [9.386331325959766]
We propose RS-MetaNet to resolve the issues related to few-shot remote sensing scene classification in the real world.
On the one hand, RS-MetaNet raises the level of learning from the sample to the task by organizing training in a meta way, and it learns to learn a metric space that can well classify remote sensing scenes from a series of tasks.
We also propose a new loss function, called Balance Loss, which maximizes the generalization ability of the model to new samples by maximizing the distance between different categories.
arXiv Detail & Related papers (2020-09-28T14:34:15Z) - One-Shot Object Detection without Fine-Tuning [62.39210447209698]
We introduce a two-stage model consisting of a first stage Matching-FCOS network and a second stage Structure-Aware Relation Module.
We also propose novel training strategies that effectively improve detection performance.
Our method exceeds the state-of-the-art one-shot performance consistently on multiple datasets.
arXiv Detail & Related papers (2020-05-08T01:59:23Z) - Rectified Meta-Learning from Noisy Labels for Robust Image-based Plant
Disease Diagnosis [64.82680813427054]
Plant diseases serve as one of main threats to food security and crop production.
One popular approach is to transform this problem as a leaf image classification task, which can be addressed by the powerful convolutional neural networks (CNNs)
We propose a novel framework that incorporates rectified meta-learning module into common CNN paradigm to train a noise-robust deep network without using extra supervision information.
arXiv Detail & Related papers (2020-03-17T09:51:30Z) - CURE Dataset: Ladder Networks for Audio Event Classification [15.850545634216484]
There are approximately 3M people with hearing loss who can't perceive events happening around them.
This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss.
arXiv Detail & Related papers (2020-01-12T09:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.