Core Challenges in Embodied Vision-Language Planning
- URL: http://arxiv.org/abs/2106.13948v1
- Date: Sat, 26 Jun 2021 05:18:58 GMT
- Title: Core Challenges in Embodied Vision-Language Planning
- Authors: Jonathan Francis, Nariaki Kitamura, Felix Labelle, Xiaopeng Lu, Ingrid
Navarro, Jean Oh
- Abstract summary: We discuss Embodied Vision-Language Planning tasks, a family of prominent embodied navigation and manipulation problems.
We propose a taxonomy to unify these tasks and provide an analysis and comparison of the new and current algorithmic approaches.
We advocate for task construction that enables model generalizability and furthers real-world deployment.
- Score: 9.190245973578698
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in the areas of multimodal machine learning and artificial
intelligence (AI) have led to the development of challenging tasks at the
intersection of Computer Vision, Natural Language Processing, and Embodied AI.
Whereas many approaches and previous survey pursuits have characterised one or
two of these dimensions, there has not been a holistic analysis at the center
of all three. Moreover, even when combinations of these topics are considered,
more focus is placed on describing, e.g., current architectural methods, as
opposed to also illustrating high-level challenges and opportunities for the
field. In this survey paper, we discuss Embodied Vision-Language Planning
(EVLP) tasks, a family of prominent embodied navigation and manipulation
problems that jointly use computer vision and natural language. We propose a
taxonomy to unify these tasks and provide an in-depth analysis and comparison
of the new and current algorithmic approaches, metrics, simulated environments,
as well as the datasets used for EVLP tasks. Finally, we present the core
challenges that we believe new EVLP works should seek to address, and we
advocate for task construction that enables model generalizability and furthers
real-world deployment.
Related papers
- VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks [48.67062958311173]
VL-GLUE is a multitask benchmark for natural language understanding.
We show that this benchmark is quite challenging for existing large-scale vision-language models.
arXiv Detail & Related papers (2024-10-17T15:27:17Z) - Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges.
We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow.
We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z) - Recent Advances in Generative AI and Large Language Models: Current Status, Challenges, and Perspectives [10.16399860867284]
The emergence of Generative Artificial Intelligence (AI) and Large Language Models (LLMs) has marked a new era of Natural Language Processing (NLP)
This paper explores the current state of these cutting-edge technologies, demonstrating their remarkable advancements and wide-ranging applications.
arXiv Detail & Related papers (2024-07-20T18:48:35Z) - Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI [129.08019405056262]
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial Intelligence (AGI)
MLMs andWMs have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities.
In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI.
arXiv Detail & Related papers (2024-07-09T14:14:47Z) - A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond [84.95530356322621]
This survey presents a systematic review of the advancements in code intelligence.
It covers over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works.
Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence.
arXiv Detail & Related papers (2024-03-21T08:54:56Z) - A Survey on Robotics with Foundation Models: toward Embodied AI [30.999414445286757]
Recent advances in computer vision, natural language processing, and multi-modality learning have shown that the foundation models have superhuman capabilities for specific tasks.
This survey aims to provide a comprehensive and up-to-date overview of foundation models in robotics, focusing on autonomous manipulation and encompassing high-level planning and low-level control.
arXiv Detail & Related papers (2024-02-04T07:55:01Z) - Towards AGI in Computer Vision: Lessons Learned from GPT and Large
Language Models [98.72986679502871]
Chat systems powered by large language models (LLMs) emerge and rapidly become a promising direction to achieve artificial general intelligence (AGI)
But the path towards AGI in computer vision (CV) remains unclear.
We imagine a pipeline that puts a CV algorithm in world-scale, interactable environments, pre-trains it to predict future frames with respect to its action, and then fine-tunes it with instruction to accomplish various tasks.
arXiv Detail & Related papers (2023-06-14T17:15:01Z) - AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities
and Challenges [60.56413461109281]
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes.
We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful.
We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions.
arXiv Detail & Related papers (2023-04-10T15:38:12Z) - Core Challenges in Embodied Vision-Language Planning [11.896110519868545]
Embodied Vision-Language Planning tasks leverage computer vision and natural language for interaction in physical environments.
We propose a taxonomy to unify these tasks and provide an analysis and comparison of the current and new algorithmic approaches.
We advocate for task construction that enables model generalisability and furthers real-world deployment.
arXiv Detail & Related papers (2023-04-05T20:37:13Z) - VQA and Visual Reasoning: An Overview of Recent Datasets, Methods and
Challenges [1.565870461096057]
The integration of vision and language has sparked a lot of attention as a result of this.
The tasks have been created in such a way that they properly exemplify the concepts of deep learning.
arXiv Detail & Related papers (2022-12-26T20:56:01Z) - Vision-Language Intelligence: Tasks, Representation Learning, and Large
Models [32.142076223602906]
This paper presents a comprehensive survey of vision-language intelligence from the perspective of time.
We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training methods, and larger models empowered by large-scale weakly-labeled data.
arXiv Detail & Related papers (2022-03-03T18:54:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.