Related papers: Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

URL: http://arxiv.org/abs/2407.07035v2
Date: Sun, 29 Dec 2024 23:16:37 GMT
Title: Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models
Authors: Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, Parisa Kordjamshidi,
Abstract summary: Vision-and-Language Navigation (VLN) has gained increasing attention over recent years.<n>Foundation models have shaped the challenges and proposed methods for VLN research.
Score: 79.04590934264235
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges. We hope our in-depth discussions could provide valuable resources and insights: on one hand, to milestone the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers.

Related papers

Spatio-Temporal Foundation Models: Vision, Challenges, and Opportunities [48.45951497996322]
Foundation models (STFMs) have revolutionized artificial intelligence, setting new benchmarks in performance and enabling transformative capabilities across a wide range of vision and language tasks. In this paper, we articulate a vision for the future of STFMs, outlining their essential characteristics and generalization capabilities necessary for broad applicability. We explore potential opportunities and directions to advance research towards the aim of effective and broadly applicable STFMs.
arXiv Detail & Related papers (2025-01-15T08:52:28Z)
How to Enable Effective Cooperation Between Humans and NLP Models: A Survey of Principles, Formalizations, and Beyond [73.5546464126465]
We present a thorough review of human-model cooperation, exploring its principles, formalizations, and open challenges. We introduce a new taxonomy that provides a unified perspective to summarize existing approaches. Also, we discuss potential frontier areas and their corresponding challenges.
arXiv Detail & Related papers (2025-01-10T05:15:14Z)
How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey [59.23394353614928]
In recent years, the rise of pre-trained models is driving the research on vision-language tasks. Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges.
arXiv Detail & Related papers (2024-12-11T07:29:04Z)
A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications [52.42860559005861]
Direct Preference Optimization (DPO) has emerged as a promising approach for alignment. Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature.
arXiv Detail & Related papers (2024-10-21T02:27:24Z)
Fine-Grained Zero-Shot Learning: Advances, Challenges, and Prospects [84.36935309169567]
We present a broad review of recent advances for fine-grained analysis in zero-shot learning (ZSL) We first provide a taxonomy of existing methods and techniques with a thorough analysis of each category. Then, we summarize the benchmark, covering publicly available datasets, models, implementations, and some more details as a library.
arXiv Detail & Related papers (2024-01-31T11:51:24Z)
A Survey on 3D Skeleton Based Person Re-Identification: Approaches, Designs, Challenges, and Future Directions [71.99165135905827]
Person re-identification via 3D skeletons is an important emerging research area that triggers great interest in the pattern recognition community. With distinctive advantages for many application scenarios, a great diversity of 3D skeleton based person re-identification methods have been proposed in recent years. This paper provides a systematic survey on current SRID approaches, model designs, challenges, and future directions.
arXiv Detail & Related papers (2024-01-27T04:52:24Z)
A Survey of Reasoning with Foundation Models [235.7288855108172]
Reasoning plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. We introduce seminal foundation models proposed or adaptable for reasoning. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models.
arXiv Detail & Related papers (2023-12-17T15:16:13Z)
Foundation Models Meet Visualizations: Challenges and Opportunities [23.01218856618978]
This paper divides visualizations for foundation models (VIS4FM) and foundation models for visualizations (FM4VIS) In VIS4FM, we explore the primary role of visualizations in understanding, refining, and evaluating these intricate models. In FM4VIS, we highlight how foundation models can be utilized to advance the visualization field itself.
arXiv Detail & Related papers (2023-10-09T14:57:05Z)
Survey of Social Bias in Vision-Language Models [65.44579542312489]
Survey aims to provide researchers with a high-level insight into the similarities and differences of social bias studies in pre-trained models across NLP, CV, and VL. The findings and recommendations presented here can benefit the ML community, fostering the development of fairer and non-biased AI models.
arXiv Detail & Related papers (2023-09-24T15:34:56Z)
Towards Reasoning in Large Language Models: A Survey [11.35055307348939]
It is not yet clear to what extent large language models (LLMs) are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs.
arXiv Detail & Related papers (2022-12-20T16:29:03Z)
Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions [23.389491536958772]
Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal. VLN receives increasing attention from natural language processing, computer vision, robotics, and machine learning communities. This paper serves as a thorough reference for the VLN research community.
arXiv Detail & Related papers (2022-03-22T16:58:10Z)
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends [41.07256031348454]
We present a detailed overview of the latest trends in research pertaining to visual and language modalities. We look at its applications in their task formulations and how to solve various problems related to semantic perception and content generation. We shed some light on multi-disciplinary patterns and insights that have emerged in the recent past, directing this field towards more modular and transparent intelligent systems.
arXiv Detail & Related papers (2020-10-19T13:55:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.