Related papers: LTD-Bench: Evaluating Large Language Models by Letting Them Draw

LTD-Bench: Evaluating Large Language Models by Letting Them Draw

URL: http://arxiv.org/abs/2511.02347v1
Date: Tue, 04 Nov 2025 08:11:23 GMT
Title: LTD-Bench: Evaluating Large Language Models by Letting Them Draw
Authors: Liuhao Lin, Ke Li, Zihan Xu, Yuchen Shi, Yulei Qin, Yan Zhang, Xing Sun, Rongrong Ji,
Abstract summary: LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
Score: 57.237152905238084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research--relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept--a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.

Related papers

FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models [20.47311573790516]
We propose FRISM (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging.<n>Experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities.
arXiv Detail & Related papers (2026-01-29T02:36:19Z)
LVLM-Aided Alignment of Task-Specific Vision Models [49.96265491629163]
Small task-specific vision models are crucial in high-stakes domains.<n>We introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge.<n>Our method demonstrates substantial improvement in aligning model behavior with human specifications.
arXiv Detail & Related papers (2025-12-26T11:11:25Z)
KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge [1.5833270109954136]
We propose KnowDR-REC, built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image.<n>We evaluate 16 state-of-the-art multimodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks.
arXiv Detail & Related papers (2025-08-12T19:43:44Z)
Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM-Based Reasoning [12.054910727620154]
Eye-tracking data reveals valuable insights into users' cognitive states but is difficult to analyze due to its structured, non-linguistic nature.<n>This paper presents a multimodal human-AI collaborative framework designed to enhance cognitive pattern extraction from eye-tracking signals.
arXiv Detail & Related papers (2025-07-24T09:49:53Z)
SpatialViz-Bench: An MLLM Benchmark for Spatial Visualization [44.427830927596204]
SpatialViz-Bench is a comprehensive benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems.<n>Our evaluation of 33 state-of-the-art MLLMs reveals wide performance variations and uncovers counter-intuitive findings.
arXiv Detail & Related papers (2025-07-10T10:27:20Z)
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation [53.452699232071495]
We introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) through the medium of crossword puzzles.<n>Our evaluation reveals that reasoning LLMs substantially outperform non-reasoning models by effectively leveraging crossing-letter constraints.<n>Our findings highlight limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
arXiv Detail & Related papers (2025-03-30T20:03:36Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.