Related papers: Core Challenges in Embodied Vision-Language Planning

Core Challenges in Embodied Vision-Language Planning

URL: http://arxiv.org/abs/2304.02738v1
Date: Wed, 5 Apr 2023 20:37:13 GMT
Title: Core Challenges in Embodied Vision-Language Planning
Authors: Jonathan Francis, Nariaki Kitamura, Felix Labelle, Xiaopeng Lu, Ingrid Navarro, Jean Oh
Abstract summary: Embodied Vision-Language Planning tasks leverage computer vision and natural language for interaction in physical environments. We propose a taxonomy to unify these tasks and provide an analysis and comparison of the current and new algorithmic approaches. We advocate for task construction that enables model generalisability and furthers real-world deployment.
Score: 11.896110519868545
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in the areas of Multimodal Machine Learning and Artificial Intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Robotics. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly leverage computer vision and natural language for interaction in physical environments. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the current and new algorithmic approaches, metrics, simulators, and datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalisability and furthers real-world deployment.

Related papers

Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems [2.1797343876622097]
This survey systematically reviews the emerging field of LLM-augmented image segmentation.<n>We highlight how these innovations can enhance road scene understanding for autonomous driving, traffic monitoring, and infrastructure maintenance.<n>We identify key challenges, including real-time performance and safety-critical reliability.
arXiv Detail & Related papers (2025-06-17T01:20:50Z)
Vision Generalist Model: A Survey [87.49797517847132]
We provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field.<n>We take a brief excursion into related domains, shedding light on their interconnections and potential synergies.
arXiv Detail & Related papers (2025-06-11T17:23:41Z)
GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering [23.459190671283487]
In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. We propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration.
arXiv Detail & Related papers (2024-12-19T03:04:34Z)
VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks [48.67062958311173]
VL-GLUE is a multitask benchmark for natural language understanding. We show that this benchmark is quite challenging for existing large-scale vision-language models.
arXiv Detail & Related papers (2024-10-17T15:27:17Z)
Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges. We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow. We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z)
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI [129.08019405056262]
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial Intelligence (AGI) MLMs andWMs have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities. In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI.
arXiv Detail & Related papers (2024-07-09T14:14:47Z)
A Survey on Vision-Language-Action Models for Embodied AI [71.16123093739932]
Embodied AI is widely recognized as a key element of artificial general intelligence. A new category of multimodal models has emerged to address language-conditioned robotic tasks in embodied AI. We present the first survey on vision-language-action models for embodied AI.
arXiv Detail & Related papers (2024-05-23T01:43:54Z)
A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond [84.95530356322621]
This survey presents a systematic review of the advancements in code intelligence. It covers over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works. Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence.
arXiv Detail & Related papers (2024-03-21T08:54:56Z)
Vision-Language Navigation with Embodied Intelligence: A Survey [19.049590467248255]
Vision-language navigation (VLN) is a critical research path to achieve embodied intelligence. VLN integrates artificial intelligence, natural language processing, computer vision, and robotics. This survey systematically reviews the research progress of VLN and details the research direction of VLN with embodied intelligence.
arXiv Detail & Related papers (2024-02-22T05:45:17Z)
A Survey on Robotics with Foundation Models: toward Embodied AI [30.999414445286757]
Recent advances in computer vision, natural language processing, and multi-modality learning have shown that the foundation models have superhuman capabilities for specific tasks. This survey aims to provide a comprehensive and up-to-date overview of foundation models in robotics, focusing on autonomous manipulation and encompassing high-level planning and low-level control.
arXiv Detail & Related papers (2024-02-04T07:55:01Z)
Interactive Natural Language Processing [67.87925315773924]
Interactive Natural Language Processing (iNLP) has emerged as a novel paradigm within the field of NLP. This paper offers a comprehensive survey of iNLP, starting by proposing a unified definition and framework of the concept.
arXiv Detail & Related papers (2023-05-22T17:18:29Z)
AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges [60.56413461109281]
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions.
arXiv Detail & Related papers (2023-04-10T15:38:12Z)
VQA and Visual Reasoning: An Overview of Recent Datasets, Methods and Challenges [1.565870461096057]
The integration of vision and language has sparked a lot of attention as a result of this. The tasks have been created in such a way that they properly exemplify the concepts of deep learning.
arXiv Detail & Related papers (2022-12-26T20:56:01Z)
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models [32.142076223602906]
This paper presents a comprehensive survey of vision-language intelligence from the perspective of time. We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training methods, and larger models empowered by large-scale weakly-labeled data.
arXiv Detail & Related papers (2022-03-03T18:54:59Z)
Core Challenges in Embodied Vision-Language Planning [9.190245973578698]
We discuss Embodied Vision-Language Planning tasks, a family of prominent embodied navigation and manipulation problems. We propose a taxonomy to unify these tasks and provide an analysis and comparison of the new and current algorithmic approaches. We advocate for task construction that enables model generalizability and furthers real-world deployment.
arXiv Detail & Related papers (2021-06-26T05:18:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.