Noise Estimation Using Density Estimation for Self-Supervised Multimodal
Learning
- URL: http://arxiv.org/abs/2003.03186v3
- Date: Thu, 10 Dec 2020 14:26:22 GMT
- Title: Noise Estimation Using Density Estimation for Self-Supervised Multimodal
Learning
- Authors: Elad Amrani, Rami Ben-Ari, Daniel Rotman and Alex Bronstein
- Abstract summary: We show that noise estimation for multimodal data can be reduced to a multimodal density estimation task.
We demonstrate how our noise estimation can be broadly integrated and achieves comparable results to state-of-the-art performance.
- Score: 10.151012770913624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the key factors of enabling machine learning models to comprehend and
solve real-world tasks is to leverage multimodal data. Unfortunately,
annotation of multimodal data is challenging and expensive. Recently,
self-supervised multimodal methods that combine vision and language were
proposed to learn multimodal representations without annotation. However, these
methods often choose to ignore the presence of high levels of noise and thus
yield sub-optimal results. In this work, we show that the problem of noise
estimation for multimodal data can be reduced to a multimodal density
estimation task. Using multimodal density estimation, we propose a noise
estimation building block for multimodal representation learning that is based
strictly on the inherent correlation between different modalities. We
demonstrate how our noise estimation can be broadly integrated and achieves
comparable results to state-of-the-art performance on five different benchmark
datasets for two challenging multimodal tasks: Video Question Answering and
Text-To-Video Retrieval. Furthermore, we provide a theoretical probabilistic
error bound substantiating our empirical results and analyze failure cases.
Code: https://github.com/elad-amrani/ssml.
Related papers
- Deep Multimodal Learning with Missing Modality: A Survey [12.873458712005037]
Multimodal learning techniques designed to handle missing modalities can mitigate this.
This survey reviews recent progress in Multimodal Learning with Missing Modality (MLMM)
arXiv Detail & Related papers (2024-09-12T08:15:39Z) - DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data [16.501973201535442]
We reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem.
We propose a novel multi-task denoising framework coined as DiffusionMTL.
It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps.
arXiv Detail & Related papers (2024-03-22T17:59:58Z) - Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it.
We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality.
We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - On Robustness in Multimodal Learning [75.03719000820388]
Multimodal learning is defined as learning over multiple input modalities such as video, audio, and text.
We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods.
arXiv Detail & Related papers (2023-04-10T05:02:07Z) - Generalized Product-of-Experts for Learning Multimodal Representations
in Noisy Environments [18.14974353615421]
We propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique.
In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality.
We attain state-of-the-art performance on two challenging benchmarks: multimodal 3D hand-pose estimation and multimodal surgical video segmentation.
arXiv Detail & Related papers (2022-11-07T14:27:38Z) - Uncertainty-Aware Multi-View Representation Learning [53.06828186507994]
We devise a novel unsupervised multi-view learning approach, termed as Dynamic Uncertainty-Aware Networks (DUA-Nets)
Guided by the uncertainty of data estimated from the generation perspective, intrinsic information from multiple views is integrated to obtain noise-free representations.
Our model achieves superior performance in extensive experiments and shows the robustness to noisy data.
arXiv Detail & Related papers (2022-01-15T07:16:20Z) - Unsupervised Multimodal Language Representations using Convolutional
Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.