Related papers: Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning

Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning

URL: http://arxiv.org/abs/2511.12365v1
Date: Sat, 15 Nov 2025 21:57:25 GMT
Title: Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning
Authors: Yiqing Shen, Mathias Unberath,
Abstract summary: We propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex visual inputs.<n>We show that DT-R1 consistently achieves improvements over state-of-the-art task-specific models.
Score: 9.529907786822115
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and output accuracy. Evaluations in six visual reasoning benchmarks, covering two modalities and four task types, demonstrate that DT-R1 consistently achieves improvements over state-of-the-art task-specific models. DT-R1 opens a new direction where visual reasoning emerges from reinforcement learning with digital twin representations.

Related papers

ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z)
Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations [48.98219448782818]
Reasoning (RS) is a multimodal vision-text task that requires segmenting objects based on implicit text queries.<n>Current RS approaches rely on fine-tuning vision-language models (VLMs) for both perception and reasoning.<n>We introduce DTwinSeger, a novel RS approach that leverages Digital Twin representation as an intermediate layer to decouple perception from reasoning.
arXiv Detail & Related papers (2025-06-09T17:05:02Z)
RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning [88.14234949860105]
RePrompt is a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning.<n>Our approach enables end-to-end training without human-annotated data.
arXiv Detail & Related papers (2025-05-23T06:44:26Z)
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning [47.592351387052545]
GoT-R1 is a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation.<n>We propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output.<n> Experimental results demonstrate significant improvements on T2I-CompBench benchmark.
arXiv Detail & Related papers (2025-05-22T17:59:58Z)
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning [37.194825644787294]
We train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs.<n>Our model, named Visionary-R1, outperforms strong multimodal models on multiple visual reasoning benchmarks.
arXiv Detail & Related papers (2025-05-20T17:58:35Z)
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [26.757458496178437]
We introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning.<n>We construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains.<n>We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning.<n> Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL.
arXiv Detail & Related papers (2025-03-13T17:56:05Z)
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP. We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z)
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language. We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language. We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
Referring Transformer: A One-step Approach to Multi-task Visual Grounding [45.42959940733406]
We propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. We show that our model benefits greatly from contextualized information and multi-task training.
arXiv Detail & Related papers (2021-06-06T10:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.