Image Manipulation via Multi-Hop Instructions -- A New Dataset and
Weakly-Supervised Neuro-Symbolic Approach
- URL: http://arxiv.org/abs/2305.14410v2
- Date: Tue, 24 Oct 2023 20:44:09 GMT
- Title: Image Manipulation via Multi-Hop Instructions -- A New Dataset and
Weakly-Supervised Neuro-Symbolic Approach
- Authors: Harman Singh, Poorva Garg, Mohit Gupta, Kevin Shah, Ashish Goswami,
Satyam Modi, Arnab Kumar Mondal, Dinesh Khandelwal, Dinesh Garg, Parag Singla
- Abstract summary: We are interested in image manipulation via natural language text.
Our system referred to as NeuroSIM can perform complex multi-hop reasoning over multi-object scenes.
- Score: 31.380435286215757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We are interested in image manipulation via natural language text -- a task
that is useful for multiple AI applications but requires complex reasoning over
multi-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning
(NSCL), which has been quite effective for the task of Visual Question
Answering (VQA), for the task of image manipulation. Our system referred to as
NeuroSIM can perform complex multi-hop reasoning over multi-object scenes and
only requires weak supervision in the form of annotated data for VQA. NeuroSIM
parses an instruction into a symbolic program, based on a Domain Specific
Language (DSL) comprising of object attributes and manipulation operations,
that guides its execution. We create a new dataset for the task, and extensive
experiments demonstrate that NeuroSIM is highly competitive with or beats SOTA
baselines that make use of supervised data for manipulation.
Related papers
- VoxelPrompt: A Vision-Language Agent for Grounded Medical Image Analysis [9.937830036053871]
VoxelPrompt tackles diverse radiological tasks through joint modeling of natural language, image volumes, and analytical metrics.
We show that VoxelPrompt can delineate hundreds of anatomical and pathological features, measure many complex morphological properties, and perform open-language analysis of lesion characteristics.
arXiv Detail & Related papers (2024-10-10T22:11:43Z) - A Survey on Vision-Language-Action Models for Embodied AI [71.16123093739932]
Vision-language-action models (VLAs) have become a foundational element in robot learning.
Various methods have been proposed to enhance traits such as versatility, dexterity, and generalizability.
VLAs serve as high-level task planners capable of decomposing long-horizon tasks into executable subtasks.
arXiv Detail & Related papers (2024-05-23T01:43:54Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - The Role of Foundation Models in Neuro-Symbolic Learning and Reasoning [54.56905063752427]
Neuro-Symbolic AI (NeSy) holds promise to ensure the safe deployment of AI systems.
Existing pipelines that train the neural and symbolic components sequentially require extensive labelling.
New architecture, NeSyGPT, fine-tunes a vision-language foundation model to extract symbolic features from raw data.
arXiv Detail & Related papers (2024-02-02T20:33:14Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - SNeL: A Structured Neuro-Symbolic Language for Entity-Based Multimodal
Scene Understanding [0.0]
We introduce SNeL (Structured Neuro-symbolic Language), a versatile query language designed to facilitate nuanced interactions with neural networks processing multimodal data.
SNeL's expressive interface enables the construction of intricate queries, supporting logical and arithmetic operators, comparators, nesting, and more.
Our evaluations demonstrate SNeL's potential to reshape the way we interact with complex neural networks.
arXiv Detail & Related papers (2023-06-09T17:01:51Z) - Visual Programming: Compositional visual reasoning without training [24.729624386851388]
VISPROG is a neuro-symbolic approach to solving complex and compositional visual tasks.
It uses the in-context learning ability of large language models to generate python-like modular programs.
arXiv Detail & Related papers (2022-11-18T18:50:09Z) - Learning Neuro-Symbolic Skills for Bilevel Planning [63.388694268198655]
Decision-making is challenging in robotics environments with continuous object-centric states, continuous actions, long horizons, and sparse feedback.
Hierarchical approaches, such as task and motion planning (TAMP), address these challenges by decomposing decision-making into two or more levels of abstraction.
Our main contribution is a method for learning parameterized polices in combination with operators and samplers.
arXiv Detail & Related papers (2022-06-21T19:01:19Z) - Unified Multimodal Pre-training and Prompt-based Tuning for
Vision-Language Understanding and Generation [86.26522210882699]
We propose Unified multimodal pre-training for both Vision-Language understanding and generation.
The proposed UniVL is capable of handling both understanding tasks and generative tasks.
Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model.
arXiv Detail & Related papers (2021-12-10T14:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.