Related papers: TRACE: Textual Reasoning for Affordance Coordinate Extraction

TRACE: Textual Reasoning for Affordance Coordinate Extraction

URL: http://arxiv.org/abs/2511.01999v1
Date: Mon, 03 Nov 2025 19:13:26 GMT
Title: TRACE: Textual Reasoning for Affordance Coordinate Extraction
Authors: Sangyun Park, Jin Kim, Yuchen Cui, Matthew S. Brown,
Abstract summary: Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation.<n>We introduce TRACE, a novel methodology that integrates a textual Chain of Reasoning into the affordance prediction process.<n>Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place benchmark.
Score: 4.374024319540872
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation. While visual Chain-of-Thought (CoT) methods exist, they are often computationally intensive. In this work, we introduce TRACE (Textual Reasoning for Affordance Coordinate Extraction), a novel methodology that integrates a textual Chain of Reasoning (CoR) into the affordance prediction process. We use this methodology to create the TRACE dataset, a large-scale collection created via an autonomous pipeline that pairs instructions with explicit textual rationales. By fine-tuning a VLM on this data, our model learns to externalize its spatial reasoning before acting. Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place (W2P) benchmark (a 9.6% relative improvement) and 55.0% on the more challenging W2P(h) subset. Crucially, an ablation study demonstrates that performance scales directly with the amount of reasoning data used, confirming the CoR's effectiveness. Furthermore, analysis of the model's attention maps reveals an interpretable reasoning process where focus shifts dynamically across reasoning steps. This work shows that training VLMs to generate a textual CoR is an effective and robust strategy for enhancing the precision, reliability, and interpretability of VLM-based robot control. Our dataset and code are available at https://github.com/jink-ucla/TRACE

Related papers

Model Specific Task Similarity for Vision Language Model Selection via Layer Conductance [92.72779885657373]
We propose a framework that grounds model selection in the internal functional dynamics of the visual encoder.<n>Our approach represents each task via layer wise conductance and derives a target-conditioned block importance distribution through entropy regularized alignment.<n>Building on this, we introduce Directional Conductance Divergence (DCD), an asymmetric metric that quantifies how effectively a source task covers the target's salient functional blocks.
arXiv Detail & Related papers (2026-02-01T17:29:43Z)
What Language Models Know But Don't Say: Non-Generative Prior Extraction for Generalization [5.663538370244175]
We propose LoID, a deterministic method for extracting informative prior distributions for Bayesian logistic regression.<n>Rather than relying on generated text, we probe the model's confidence in opposing semantic directions through carefully constructed sentences.<n>We evaluate LoID on ten real-world datasets under synthetic out-of-distribution (OOD) settings.
arXiv Detail & Related papers (2026-01-24T22:05:01Z)
AI Generated Text Detection [0.0]
This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures.<n>We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage.<n>Results indicate that contextual modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization.
arXiv Detail & Related papers (2026-01-07T11:18:10Z)
Detect Anything via Next Point Prediction [51.55967987350882]
Rex- Omni is a 3B-scale MLLM that achieves state-of-the-art object perception performance.<n>On benchmarks like COCO and LVIS, Rex- Omni attains performance comparable to or exceeding regression-based models.
arXiv Detail & Related papers (2025-10-14T17:59:54Z)
4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration [31.111439909825627]
Existing methods typically model the dataset's action distribution using simple observations as inputs.<n>We propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to these sources of chaos.<n>Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.
arXiv Detail & Related papers (2025-06-27T14:09:29Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z)
Where Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methods [29.141145775835106]
Given a fixed FLOP budget, what are the best datasets, models, and (self-supervised) training methods for obtaining high accuracy on representative visual tasks? We examine five large-scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, and COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and supervised) Our results call into question the commonly-held assumption that self-supervised methods inherently scale to large, uncurated data.
arXiv Detail & Related papers (2022-09-30T17:04:55Z)
Scene Text Recognition with Permuted Autoregressive Sequence Models [15.118059441365343]
Context-aware STR methods typically use internal autoregressive (AR) language models (LM) Our method, PARSeq, learns an ensemble of internal AR LMs with shared weights using Permutation Language Modeling. It achieves context-free non-AR and context-aware AR inference, and iterative refinement using bidirectional context.
arXiv Detail & Related papers (2022-07-14T14:51:50Z)
Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning [105.70602423944148]
We propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making. Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values. It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.
arXiv Detail & Related papers (2022-06-25T03:02:25Z)
Injecting Knowledge in Data-driven Vehicle Trajectory Predictors [82.91398970736391]
Vehicle trajectory prediction tasks have been commonly tackled from two perspectives: knowledge-driven or data-driven. In this paper, we propose to learn a "Realistic Residual Block" (RRB) which effectively connects these two perspectives. Our proposed method outputs realistic predictions by confining the residual range and taking into account its uncertainty.
arXiv Detail & Related papers (2021-03-08T16:03:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.