Causality-based Cross-Modal Representation Learning for
Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2403.03405v1
- Date: Wed, 6 Mar 2024 02:01:38 GMT
- Title: Causality-based Cross-Modal Representation Learning for
Vision-and-Language Navigation
- Authors: Liuyi Wang, Zongtao He, Ronghao Dang, Huiyi Chen, Chengju Liu, Qijun
Chen
- Abstract summary: Vision-and-Language Navigation (VLN) has gained significant research interest in recent years due to its potential applications in real-world scenarios.
Existing VLN methods struggle with the issue of spurious associations, resulting in poor generalization with a significant performance gap between seen and unseen environments.
We propose a unified framework CausalVLN based on the causal learning paradigm to train a robust navigator capable of learning unbiased feature representations.
- Score: 15.058687283978077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN) has gained significant research interest
in recent years due to its potential applications in real-world scenarios.
However, existing VLN methods struggle with the issue of spurious associations,
resulting in poor generalization with a significant performance gap between
seen and unseen environments. In this paper, we tackle this challenge by
proposing a unified framework CausalVLN based on the causal learning paradigm
to train a robust navigator capable of learning unbiased feature
representations. Specifically, we establish reasonable assumptions about
confounders for vision and language in VLN using the structured causal model
(SCM). Building upon this, we propose an iterative backdoor-based
representation learning (IBRL) method that allows for the adaptive and
effective intervention on confounders. Furthermore, we introduce the visual and
linguistic backdoor causal encoders to enable unbiased feature expression for
multi-modalities during training and validation, enhancing the agent's
capability to generalize across different environments. Experiments on three
VLN datasets (R2R, RxR, and REVERIE) showcase the superiority of our proposed
method over previous state-of-the-art approaches. Moreover, detailed
visualization analysis demonstrates the effectiveness of CausalVLN in
significantly narrowing down the performance gap between seen and unseen
environments, underscoring its strong generalization capability.
Related papers
- Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation [4.506099292980221]
We evaluate and analyze the OoDD capabilities of various proprietary and open-source LVLMs.
We propose a self-guided prompting approach, termed emphReflexive Guidance (ReGuide), aimed at enhancing the OoDD capability of LVLMs.
Experimental results demonstrate that our ReGuide enhances the performance of current LVLMs in both image classification and OoDD tasks.
arXiv Detail & Related papers (2024-10-19T04:46:51Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - Vision-and-Language Navigation via Causal Learning [13.221880074458227]
Cross-modal causal transformer (GOAT) is a pioneering solution rooted in the paradigm of causal inference.
BACL and FACL modules promote unbiased learning by comprehensively mitigating potential spurious correlations.
To capture global confounder features, we propose a cross-modal feature pooling module supervised by contrastive learning.
arXiv Detail & Related papers (2024-04-16T02:40:35Z) - Visual In-Context Learning for Large Vision-Language Models [62.5507897575317]
In Large Visual Language Models (LVLMs) the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities.
We introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition.
Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations.
arXiv Detail & Related papers (2024-02-18T12:43:38Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - Causal Reasoning Meets Visual Representation Learning: A Prospective
Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models.
Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms.
This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z) - Confounder Identification-free Causal Visual Feature Learning [84.28462256571822]
We propose a novel Confounder Identification-free Causal Visual Feature Learning (CICF) method, which obviates the need for identifying confounders.
CICF models the interventions among different samples based on front-door criterion, and then approximates the global-scope intervening effect upon the instance-level interventions.
We uncover the relation between CICF and the popular meta-learning strategy MAML, and provide an interpretation of why MAML works from the theoretical perspective.
arXiv Detail & Related papers (2021-11-26T10:57:47Z) - SASRA: Semantically-aware Spatio-temporal Reasoning Agent for
Vision-and-Language Navigation in Continuous Environments [7.5606260987453116]
This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments.
Existing end-to-end learning-based methods struggle at this task as they focus mostly on raw visual observations.
We present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method.
arXiv Detail & Related papers (2021-08-26T17:57:02Z) - Farewell to Mutual Information: Variational Distillation for Cross-Modal
Person Re-Identification [41.02729491273057]
The Information Bottleneck (IB) provides an information theoretic principle for representation learning.
We present a new strategy, Variational Self-Distillation (VSD), which provides a scalable, flexible and analytic solution.
We also introduce two other strategies, Variational Cross-Distillation (VCD) and Variational Mutual-Learning (VML)
arXiv Detail & Related papers (2021-04-07T02:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.