Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
- URL: http://arxiv.org/abs/2503.15558v2
- Date: Wed, 02 Apr 2025 17:11:13 GMT
- Title: Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
- Authors: NVIDIA, :, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin, Yen-Chen Lin, Ming-Yu Liu, Alice Luo, Andrew Mathau, Yun Ni, Lindsey Pavao, Wei Ping, David W. Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z. Wang, Boxin Wang, Haoxiang Wang, Fangyin Wei, Jiashu Xu, Yao Xu, Xiaodong Yang, Zhuolin Yang, Xiaohui Zeng, Zhe Zhang,
- Abstract summary: Physical AI systems need to perceive, understand, and perform complex actions in the physical world.<n>We present models that can understand the physical world generate appropriate embodied decisions.<n>We use a hierarchical ontology that captures fundamental knowledge about space, time, and physics.<n>For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments.
- Score: 76.94237859217469
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-8B and Cosmos-Reason1-56B. We curate data and train our models in four stages: vision pre-training, general supervised fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL) as the post-training. To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and reinforcement learning bring significant improvements. To facilitate the development of Physical AI, we will make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.
Related papers
- Digital Gene: Learning about the Physical World through Analytic Concepts [54.21005370169846]
AI systems still struggle when it comes to understanding and interacting with the physical world.
This research introduces the idea of analytic concept.
It provides machine intelligence a portal to perceive, reason about, and interact with the physical world.
arXiv Detail & Related papers (2025-04-05T13:22:11Z) - Neural Force Field: Learning Generalized Physical Representation from a Few Examples [24.651024239605288]
Current AI models, despite extensive training, still struggle to achieve similar generalization.<n>We present Neural Force Field (NFF) a modeling framework built on Neural Ordinary Differential Equation (NODE)<n>NFF captures fundamental physical concepts such as gravity, support, and collision in an interpretable manner.
arXiv Detail & Related papers (2025-02-13T05:50:13Z) - Generative Physical AI in Vision: A Survey [25.867330158975932]
Generative Artificial Intelligence (AI) has rapidly advanced the field of computer vision by enabling machines to create and interpret visual data with unprecedented sophistication.<n>As generative AI evolves to increasingly integrate physical realism and dynamic simulation, its potential to function as a "world simulator"<n>This survey systematically reviews this emerging field of physics-aware generative AI in computer vision.
arXiv Detail & Related papers (2025-01-19T03:19:47Z) - Discover physical concepts and equations with machine learning [7.565272546753481]
We propose a model that combines Variational Autoencoders (VAE) with Neural Ordinary Differential Equations (Neural ODEs)
This allows us to simultaneously discover physical concepts and governing equations from simulated experimental data.
We apply the model to several examples inspired by the history of physics, including Copernicus' heliocentrism, Newton's law of gravity, Schr"odinger's wave mechanics, and Pauli's spin-magnetic formulation.
arXiv Detail & Related papers (2024-12-11T15:30:21Z) - Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts.
However, the capacity of these models to accurately represent intuitive physics remains largely unexplored.
We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z) - ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [86.63174804149216]
ContPhy is a novel benchmark for assessing machine physical commonsense.
We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy.
We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models.
arXiv Detail & Related papers (2024-02-09T01:09:21Z) - Visual cognition in multimodal large language models [12.603212933816206]
Recent advancements have rekindled interest in the potential to emulate human-like cognitive abilities.
This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology.
arXiv Detail & Related papers (2023-11-27T18:58:34Z) - Intrinsic Physical Concepts Discovery with Object-Centric Predictive
Models [86.25460882547581]
We introduce the PHYsical Concepts Inference NEtwork (PHYCINE), a system that infers physical concepts in different abstract levels without supervision.
We show that object representations containing the discovered physical concepts variables could help achieve better performance in causal reasoning tasks.
arXiv Detail & Related papers (2023-03-03T11:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.