Related papers: A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment

A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment

URL: http://arxiv.org/abs/2101.07891v1
Date: Tue, 19 Jan 2021 23:05:43 GMT
Title: A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment
Authors: Homagni Saha, Fateme Fotouhif, Qisai Liu, Soumik Sarkar
Abstract summary: MoViLan is a new framework for execution of visually grounded natural language instructions. We propose a modular approach to deal with the combined navigation and object interaction problem. Specifically, we propose a novel geometry-aware mapping technique for cluttered indoor environments.
Score: 9.159670926457975
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we propose a new framework - MoViLan (Modular Vision and Language) for execution of visually grounded natural language instructions for day to day indoor household tasks. While several data-driven, end-to-end learning frameworks have been proposed for targeted navigation tasks based on the vision and language modalities, performance on recent benchmark data sets revealed the gap in developing comprehensive techniques for long horizon, compositional tasks (involving manipulation and navigation) with diverse object categories, realistic instructions and visual scenarios with non-reversible state changes. We propose a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data (e.g., in the form of expert demonstrated trajectories). Such an approach is a significant departure from the traditional end-to-end techniques in this space and allows for a more tractable training process with separate vision and language data sets. Specifically, we propose a novel geometry-aware mapping technique for cluttered indoor environments, and a language understanding model generalized for household instruction following. We demonstrate a significant increase in success rates for long-horizon, compositional tasks over the baseline on the recently released benchmark data set-ALFRED.

Related papers

A Navigation Framework Utilizing Vision-Language Models [0.0]
Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI.<n>Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding.<n>We propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning.
arXiv Detail & Related papers (2025-06-11T20:51:58Z)
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface [25.898592418636603]
ours is a framework that textbfUnifies textbfFine-grained visual perception tasks through an textbfOpen-ended language interface. ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies.
arXiv Detail & Related papers (2025-03-03T09:27:24Z)
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models [19.054780489639793]
This paper introduces Progressive multi-granular Vision-Language alignments (PromViL) Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning.
arXiv Detail & Related papers (2024-12-11T06:21:33Z)
Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies. Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors. We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z)
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios. The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors. We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z)
Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study. iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector. We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z)
LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN) Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z)
How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers [94.46825166907831]
We present a training-free solution to tackle the object goal navigation problem in Embodied AI. Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework. Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers.
arXiv Detail & Related papers (2023-05-26T13:38:33Z)
CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions. We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations. Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z)
Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution [0.0]
We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description. Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains. We introduce a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations.
arXiv Detail & Related papers (2022-05-24T14:12:32Z)
Are We There Yet? Learning to Localize in Embodied Instruction Following [1.7300690315775575]
Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem. Key challenges for this task include localizing target locations and navigating to them through visual inputs. We augment the agent's field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep.
arXiv Detail & Related papers (2021-01-09T21:49:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.