Related papers: Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

URL: http://arxiv.org/abs/2003.03186v3
Date: Thu, 10 Dec 2020 14:26:22 GMT
Title: Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
Authors: Elad Amrani, Rami Ben-Ari, Daniel Rotman and Alex Bronstein
Abstract summary: We show that noise estimation for multimodal data can be reduced to a multimodal density estimation task. We demonstrate how our noise estimation can be broadly integrated and achieves comparable results to state-of-the-art performance.
Score: 10.151012770913624
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data. Unfortunately, annotation of multimodal data is challenging and expensive. Recently, self-supervised multimodal methods that combine vision and language were proposed to learn multimodal representations without annotation. However, these methods often choose to ignore the presence of high levels of noise and thus yield sub-optimal results. In this work, we show that the problem of noise estimation for multimodal data can be reduced to a multimodal density estimation task. Using multimodal density estimation, we propose a noise estimation building block for multimodal representation learning that is based strictly on the inherent correlation between different modalities. We demonstrate how our noise estimation can be broadly integrated and achieves comparable results to state-of-the-art performance on five different benchmark datasets for two challenging multimodal tasks: Video Question Answering and Text-To-Video Retrieval. Furthermore, we provide a theoretical probabilistic error bound substantiating our empirical results and analyze failure cases. Code: https://github.com/elad-amrani/ssml.

Related papers

Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning [71.3533541927459]
We propose a novel data selection paradigm termed Activation Reasoning Potential (RAP)<n>RAP identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning.<n>Our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%.
arXiv Detail & Related papers (2025-06-05T08:40:24Z)
Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation [61.64052577026623]
Real-world multi-view datasets are often heterogeneous and imperfect. We propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. In experiments, we employ it in unsupervised multi-view clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval.
arXiv Detail & Related papers (2025-03-06T07:01:08Z)
Deep Multimodal Learning with Missing Modality: A Survey [12.873458712005037]
Multimodal learning techniques designed to handle missing modalities can mitigate this. This survey reviews recent progress in Multimodal Learning with Missing Modality (MLMM)
arXiv Detail & Related papers (2024-09-12T08:15:39Z)
DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data [16.501973201535442]
We reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem. We propose a novel multi-task denoising framework coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps.
arXiv Detail & Related papers (2024-03-22T17:59:58Z)
Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z)
Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data. Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds. We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z)
On Robustness in Multimodal Learning [75.03719000820388]
Multimodal learning is defined as learning over multiple input modalities such as video, audio, and text. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods.
arXiv Detail & Related papers (2023-04-10T05:02:07Z)
Generalized Product-of-Experts for Learning Multimodal Representations in Noisy Environments [18.14974353615421]
We propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique. In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality. We attain state-of-the-art performance on two challenging benchmarks: multimodal 3D hand-pose estimation and multimodal surgical video segmentation.
arXiv Detail & Related papers (2022-11-07T14:27:38Z)
Uncertainty-Aware Multi-View Representation Learning [53.06828186507994]
We devise a novel unsupervised multi-view learning approach, termed as Dynamic Uncertainty-Aware Networks (DUA-Nets) Guided by the uncertainty of data estimated from the generation perspective, intrinsic information from multiple views is integrated to obtain noise-free representations. Our model achieves superior performance in extensive experiments and shows the robustness to noisy data.
arXiv Detail & Related papers (2022-01-15T07:16:20Z)
Unsupervised Multimodal Language Representations using Convolutional Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks. We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets. It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z)
Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction. We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data. Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.