Related papers: RynnEC: Bringing MLLMs into Embodied World

RynnEC: Bringing MLLMs into Embodied World

URL: http://arxiv.org/abs/2508.14160v1
Date: Tue, 19 Aug 2025 18:00:01 GMT
Title: RynnEC: Bringing MLLMs into Embodied World
Authors: Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Xin Li, Fan Wang, Deli Zhao,
Abstract summary: We introduce RynnEC, a video multimodal large language model designed for embodied cognition.<n>RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction.<n>RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning.
Score: 20.393755405283365
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC

Related papers

VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model [29.52176445302312]
3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation.<n>We propose VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities.<n>Our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities.
arXiv Detail & Related papers (2026-02-10T10:36:57Z)
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation [69.30586607892842]
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation.<n>GE integrates policy learning, evaluation, and simulation within a single video-generative framework.
arXiv Detail & Related papers (2025-08-07T17:59:44Z)
EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World? [52.99661576320663]
multimodal large language models (MLLMs) have driven breakthroughs in egocentric vision applications.<n>EOC-Bench is an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios.<n>We conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs based on EOC-Bench.
arXiv Detail & Related papers (2025-06-05T17:44:12Z)
ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark [44.64084739916821]
ECBench is a benchmark designed to systematically evaluate the embodied cognitive abilities of large vision-language models (LVLMs)<n>ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition.<n>We conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs.
arXiv Detail & Related papers (2025-01-09T07:43:49Z)
Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.<n>We introduce Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes.<n>We also present USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects.
arXiv Detail & Related papers (2024-12-02T11:33:55Z)
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos [53.658928180166534]
We propose Spherical World-Locking as a general framework for egocentric scene representation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion. We design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation.
arXiv Detail & Related papers (2024-08-09T22:29:04Z)
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model [9.762722976833581]
Current models rely extensively on instance-level alignment between video and language modalities. We take an inspiration from human perception and explore a compositional approach for ego video representation.
arXiv Detail & Related papers (2024-06-01T05:41:12Z)
Monocular Per-Object Distance Estimation with Masked Object Modeling [33.59920084936913]
Our paper draws inspiration from Masked Image Modeling (MiM) and extends it to multi-object tasks.<n>Our strategy, termed Masked Object Modeling (MoM), enables a novel application of masking techniques.<n>We evaluate the effectiveness of MoM on a novel reference architecture (DistFormer) on the standard KITTI, NuScenes, and MOT Synth datasets.
arXiv Detail & Related papers (2024-01-06T10:56:36Z)
REACT: Recognize Every Action Everywhere All At Once [8.10024991952397]
Group Activity Decoder (GAR) is a fundamental problem in computer vision, with diverse applications in sports analysis, surveillance, and social scene understanding. We present REACT, an architecture inspired by the transformer encoder-decoder model. Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities.
arXiv Detail & Related papers (2023-11-27T20:48:54Z)
NeRF-SOS: Any-View Self-supervised Object Segmentation from Complex Real-World Scenes [80.59831861186227]
This paper carries out the exploration of self-supervised learning for object segmentation using NeRF for complex real-world scenes. Our framework, called NeRF with Self-supervised Object NeRF-SOS, encourages NeRF models to distill compact geometry-aware segmentation clusters. It consistently surpasses other 2D-based self-supervised baselines and predicts finer semantics masks than existing supervised counterparts.
arXiv Detail & Related papers (2022-09-19T06:03:17Z)
Dense Interaction Learning for Video-based Person Re-identification [75.03200492219003]
We propose a hybrid framework, Dense Interaction Learning (DenseIL), to tackle video-based person re-ID difficulties. DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder. Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
arXiv Detail & Related papers (2021-03-16T12:22:08Z)
See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks [184.4379622593225]
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos.
arXiv Detail & Related papers (2020-01-19T11:10:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.