Using Left and Right Brains Together: Towards Vision and Language
Planning
- URL: http://arxiv.org/abs/2402.10534v1
- Date: Fri, 16 Feb 2024 09:46:20 GMT
- Title: Using Left and Right Brains Together: Towards Vision and Language
Planning
- Authors: Jun Cen, Chenfei Wu, Xiao Liu, Shengming Yin, Yixuan Pei, Jinglong
Yang, Qifeng Chen, Nan Duan, Jianguo Zhang
- Abstract summary: We introduce a novel vision-language planning framework to perform concurrent visual and language planning for tasks with inputs of any form.
We evaluate the effectiveness of our framework across vision-language tasks, vision-only tasks, and language-only tasks.
- Score: 95.47128850991815
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) and Large Multi-modality Models (LMMs) have
demonstrated remarkable decision masking capabilities on a variety of tasks.
However, they inherently operate planning within the language space, lacking
the vision and spatial imagination ability. In contrast, humans utilize both
left and right hemispheres of the brain for language and visual planning during
the thinking process. Therefore, we introduce a novel vision-language planning
framework in this work to perform concurrent visual and language planning for
tasks with inputs of any form. Our framework incorporates visual planning to
capture intricate environmental details, while language planning enhances the
logical coherence of the overall system. We evaluate the effectiveness of our
framework across vision-language tasks, vision-only tasks, and language-only
tasks. The results demonstrate the superior performance of our approach,
indicating that the integration of visual and language planning yields better
contextually aware task execution.
Related papers
- VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - Contextual Emotion Recognition using Large Vision Language Models [0.6749750044497732]
Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision.
In this paper, we examine two major approaches enabled by recent large vision language models.
We demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines.
arXiv Detail & Related papers (2024-05-14T23:24:12Z) - MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.
MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z) - Learning Concept-Based Causal Transition and Symbolic Reasoning for Visual Planning [36.131648635051334]
Visual planning simulates how humans make decisions to achieve desired goals.
We propose an interpretable and generalizable visual planning framework.
We show that our framework can generalize to unseen task trajectories, unseen object categories, and real-world data.
arXiv Detail & Related papers (2023-10-05T05:41:21Z) - Tackling Vision Language Tasks Through Learning Inner Monologues [10.795616787372625]
We propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems.
IMMO simulates inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves.
The results suggest IMMO can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models.
arXiv Detail & Related papers (2023-08-19T10:10:49Z) - VisionLLM: Large Language Model is also an Open-Ended Decoder for
Vision-Centric Tasks [81.32968995346775]
VisionLLM is a framework for vision-centric tasks that can be flexibly defined and managed using language instructions.
Our model can achieve over 60% mAP on COCO, on par with detection-specific models.
arXiv Detail & Related papers (2023-05-18T17:59:42Z) - Augmenting Vision Language Pretraining by Learning Codebook with Visual
Semantics [29.393661499333284]
We propose to "discretize" the visual representation by joint learning a codebook that imbues each visual token a semantic.
We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective.
Experiments validate the effectiveness of our approach across common vision-language benchmarks.
arXiv Detail & Related papers (2022-07-31T17:36:09Z) - Context-Aware Language Modeling for Goal-Oriented Dialogue Systems [84.65707332816353]
We formulate goal-oriented dialogue as a partially observed Markov decision process.
We derive a simple and effective method to finetune language models in a goal-aware way.
We evaluate our method on a practical flight-booking task using AirDialogue.
arXiv Detail & Related papers (2022-04-18T17:23:11Z) - Vision and Language: from Visual Perception to Content Creation [100.36776435627962]
"vision to language" is probably one of the most popular topics in the past five years.
This paper reviews the recent advances along these two dimensions: "vision to language" and "language to vision"
arXiv Detail & Related papers (2019-12-26T14:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.