Towards AGI in Computer Vision: Lessons Learned from GPT and Large
Language Models
- URL: http://arxiv.org/abs/2306.08641v1
- Date: Wed, 14 Jun 2023 17:15:01 GMT
- Title: Towards AGI in Computer Vision: Lessons Learned from GPT and Large
Language Models
- Authors: Lingxi Xie, Longhui Wei, Xiaopeng Zhang, Kaifeng Bi, Xiaotao Gu,
Jianlong Chang, Qi Tian
- Abstract summary: Chat systems powered by large language models (LLMs) emerge and rapidly become a promising direction to achieve artificial general intelligence (AGI)
But the path towards AGI in computer vision (CV) remains unclear.
We imagine a pipeline that puts a CV algorithm in world-scale, interactable environments, pre-trains it to predict future frames with respect to its action, and then fine-tunes it with instruction to accomplish various tasks.
- Score: 98.72986679502871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The AI community has been pursuing algorithms known as artificial general
intelligence (AGI) that apply to any kind of real-world problem. Recently, chat
systems powered by large language models (LLMs) emerge and rapidly become a
promising direction to achieve AGI in natural language processing (NLP), but
the path towards AGI in computer vision (CV) remains unclear. One may owe the
dilemma to the fact that visual signals are more complex than language signals,
yet we are interested in finding concrete reasons, as well as absorbing
experiences from GPT and LLMs to solve the problem. In this paper, we start
with a conceptual definition of AGI and briefly review how NLP solves a wide
range of tasks via a chat system. The analysis inspires us that unification is
the next important goal of CV. But, despite various efforts in this direction,
CV is still far from a system like GPT that naturally integrates all tasks. We
point out that the essential weakness of CV lies in lacking a paradigm to learn
from environments, yet NLP has accomplished the task in the text world. We then
imagine a pipeline that puts a CV algorithm (i.e., an agent) in world-scale,
interactable environments, pre-trains it to predict future frames with respect
to its action, and then fine-tunes it with instruction to accomplish various
tasks. We expect substantial research and engineering efforts to push the idea
forward and scale it up, for which we share our perspectives on future research
directions.
Related papers
- ChatGPT Alternative Solutions: Large Language Models Survey [0.0]
Large Language Models (LLMs) have ignited a surge in research contributions within this domain.
Recent years have witnessed a dynamic synergy between academia and industry, propelling the field of LLM research to new heights.
This survey furnishes a well-rounded perspective on the current state of generative AI, shedding light on opportunities for further exploration, enhancement, and innovation.
arXiv Detail & Related papers (2024-03-21T15:16:50Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - Look Before You Leap: Unveiling the Power of GPT-4V in Robotic
Vision-Language Planning [32.045840007623276]
We introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning.
ViLa directly integrates perceptual data into its reasoning and planning process.
Our evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners.
arXiv Detail & Related papers (2023-11-29T17:46:25Z) - Brain in a Vat: On Missing Pieces Towards Artificial General
Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents.
We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations.
We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - OpenAGI: When LLM Meets Domain Experts [51.86179657467822]
Human Intelligence (HI) excels at combining basic skills to solve complex tasks.
This capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI Agents.
We introduce OpenAGI, an open-source platform designed for solving multi-step, real-world tasks.
arXiv Detail & Related papers (2023-04-10T03:55:35Z) - Core Challenges in Embodied Vision-Language Planning [11.896110519868545]
Embodied Vision-Language Planning tasks leverage computer vision and natural language for interaction in physical environments.
We propose a taxonomy to unify these tasks and provide an analysis and comparison of the current and new algorithmic approaches.
We advocate for task construction that enables model generalisability and furthers real-world deployment.
arXiv Detail & Related papers (2023-04-05T20:37:13Z) - Vision-Language Intelligence: Tasks, Representation Learning, and Large
Models [32.142076223602906]
This paper presents a comprehensive survey of vision-language intelligence from the perspective of time.
We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training methods, and larger models empowered by large-scale weakly-labeled data.
arXiv Detail & Related papers (2022-03-03T18:54:59Z) - Core Challenges in Embodied Vision-Language Planning [9.190245973578698]
We discuss Embodied Vision-Language Planning tasks, a family of prominent embodied navigation and manipulation problems.
We propose a taxonomy to unify these tasks and provide an analysis and comparison of the new and current algorithmic approaches.
We advocate for task construction that enables model generalizability and furthers real-world deployment.
arXiv Detail & Related papers (2021-06-26T05:18:58Z) - Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike
Common Sense [142.53911271465344]
We argue that the next generation of AI must embrace "dark" humanlike common sense for solving novel tasks.
We identify functionality, physics, intent, causality, and utility (FPICU) as the five core domains of cognitive AI with humanlike common sense.
arXiv Detail & Related papers (2020-04-20T04:07:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.