Thinker: A vision-language foundation model for embodied intelligence
- URL: http://arxiv.org/abs/2601.21199v1
- Date: Thu, 29 Jan 2026 02:52:08 GMT
- Title: Thinker: A vision-language foundation model for embodied intelligence
- Authors: Baiyu Pan, Daqin Luo, Junpeng Yang, Jiyuan Wang, Yixuan Zhang, Hailin Shi, Jichao Jiao,
- Abstract summary: We propose Thinker, a large vision-language foundation model for embodied intelligence.<n>We construct a large-scale dataset tailored for robotic perception and reasoning.<n>We introduce a simple yet effective approach that substantially enhances the model's capacity for video comprehension.
- Score: 9.661713829767605
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: When large vision-language models are applied to the field of robotics, they encounter problems that are simple for humans yet error-prone for models. Such issues include confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during temporal reasoning. To address these challenges, we propose Thinker, a large vision-language foundation model designed for embodied intelligence. We tackle the aforementioned issues from two perspectives. Firstly, we construct a large-scale dataset tailored for robotic perception and reasoning, encompassing ego-view videos, visual grounding, spatial understanding, and chain-of-thought data. Secondly, we introduce a simple yet effective approach that substantially enhances the model's capacity for video comprehension by jointly incorporating key frames and full video sequences as inputs. Our model achieves state-of-the-art results on two of the most commonly used benchmark datasets in the field of task planning.
Related papers
- MentisOculi: Revealing the Limits of Reasoning with Mental Imagery [63.285794947638614]
We develop MentisOculi, a suite of multi-step reasoning problems amenable to visual solution.<n> evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance.<n>Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning.
arXiv Detail & Related papers (2026-02-02T18:49:06Z) - Video models are zero-shot learners and reasoners [33.694362486721865]
Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for.<n>Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.
arXiv Detail & Related papers (2025-09-24T17:17:27Z) - Learning Primitive Embodied World Models: Towards Scalable Robotic Learning [50.32986780156215]
We propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM)<n>By restricting video generation to fixed short horizons, our approach enables fine-grained alignment between linguistic concepts and visual representations of robotic actions.<n>Our framework bridges the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
arXiv Detail & Related papers (2025-08-28T14:31:48Z) - EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z) - Can Generative Video Models Help Pose Estimation? [42.10672365565019]
Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision.<n>Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose.<n>We propose a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition.
arXiv Detail & Related papers (2024-12-20T18:58:24Z) - LATTE: Learning to Think with Vision Specialists [110.43838069105998]
We propose LATTE, a family of vision-language models that offload perception to state-of-the-art vision models.<n>By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information.
arXiv Detail & Related papers (2024-12-07T00:42:04Z) - Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning [53.45295657891099]
This paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework.
It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models.
Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity.
arXiv Detail & Related papers (2024-10-04T11:18:41Z) - Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models [37.44286562901589]
We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning.
We conduct a comprehensive evaluation of competitive language and vision-language models.
Our findings reveal several counter-intuitive insights that have been overlooked in the literature.
arXiv Detail & Related papers (2024-06-21T03:53:37Z) - Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences.
This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content.
Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z) - ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.