Disentangled Counterfactual Learning for Physical Audiovisual
Commonsense Reasoning
- URL: http://arxiv.org/abs/2310.19559v2
- Date: Thu, 2 Nov 2023 02:36:12 GMT
- Title: Disentangled Counterfactual Learning for Physical Audiovisual
Commonsense Reasoning
- Authors: Changsheng Lv and Shuai Zhang and Yapeng Tian and Mengshi Qi and
Huadong Ma
- Abstract summary: We propose a Disentangled Counterfactual Learning approach for physical audiovisual commonsense reasoning.
Our proposed method is a plug-and-play module that can be incorporated into any baseline.
- Score: 48.559572337178686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a Disentangled Counterfactual Learning~(DCL)
approach for physical audiovisual commonsense reasoning. The task aims to infer
objects' physics commonsense based on both video and audio input, with the main
challenge is how to imitate the reasoning ability of humans. Most of the
current methods fail to take full advantage of different characteristics in
multi-modal data, and lacking causal reasoning ability in models impedes the
progress of implicit physical knowledge inferring. To address these issues, our
proposed DCL method decouples videos into static (time-invariant) and dynamic
(time-varying) factors in the latent space by the disentangled sequential
encoder, which adopts a variational autoencoder (VAE) to maximize the mutual
information with a contrastive loss function. Furthermore, we introduce a
counterfactual learning module to augment the model's reasoning ability by
modeling physical knowledge relationships among different objects under
counterfactual intervention. Our proposed method is a plug-and-play module that
can be incorporated into any baseline. In experiments, we show that our
proposed method improves baseline methods and achieves state-of-the-art
performance. Our source code is available at https://github.com/Andy20178/DCL.
Related papers
- Learning Physics-Consistent Material Behavior Without Prior Knowledge [6.691537914484337]
We introduce a machine learning approach called uLED, which overcomes the limitations by using the convex input neural network (ICNN) as a surrogate model.
We demonstrate that it is robust to a significant level of noise and that it converges to the ground truth with increasing data resolution.
arXiv Detail & Related papers (2024-07-25T08:24:04Z) - DynaMMo: Dynamic Model Merging for Efficient Class Incremental Learning for Medical Images [0.8213829427624407]
Continual learning, the ability to acquire knowledge from new data while retaining previously learned information, is a fundamental challenge in machine learning.
We propose Dynamic Model Merging, DynaMMo, a method that merges multiple networks at different stages of model training to achieve better computational efficiency.
We evaluate DynaMMo on three publicly available datasets, demonstrating its effectiveness compared to existing approaches.
arXiv Detail & Related papers (2024-04-22T11:37:35Z) - Towards Principled Representation Learning from Videos for Reinforcement Learning [23.877731515619868]
We study pre-training representations for decision-making using video data.
We focus on learning the latent state representations of the underlying MDP using video data.
arXiv Detail & Related papers (2024-03-20T17:28:17Z) - Diffusion-Generative Multi-Fidelity Learning for Physical Simulation [24.723536390322582]
We develop a diffusion-generative multi-fidelity learning method based on differential equations (SDE), where the generation is a continuous denoising process.
By conditioning on additional inputs (temporal or spacial variables), our model can efficiently learn and predict multi-dimensional solution arrays.
arXiv Detail & Related papers (2023-11-09T18:59:05Z) - MinT: Boosting Generalization in Mathematical Reasoning via Multi-View
Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs)
We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles.
Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z) - Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing [107.031903351176]
Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
arXiv Detail & Related papers (2023-07-05T05:55:10Z) - Accelerating exploration and representation learning with offline
pre-training [52.6912479800592]
We show that exploration and representation learning can be improved by separately learning two different models from a single offline dataset.
We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward can significantly improve the sample efficiency on the challenging NetHack benchmark.
arXiv Detail & Related papers (2023-03-31T18:03:30Z) - Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data.
Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z) - Deep Active Learning with Noise Stability [24.54974925491753]
Uncertainty estimation for unlabeled data is crucial to active learning.
We propose a novel algorithm that leverages noise stability to estimate data uncertainty.
Our method is generally applicable in various tasks, including computer vision, natural language processing, and structural data analysis.
arXiv Detail & Related papers (2022-05-26T13:21:01Z) - Adaptive Discrete Communication Bottlenecks with Dynamic Vector
Quantization [76.68866368409216]
We propose learning to dynamically select discretization tightness conditioned on inputs.
We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.
arXiv Detail & Related papers (2022-02-02T23:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.