TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning
- URL: http://arxiv.org/abs/2602.14482v1
- Date: Mon, 16 Feb 2026 05:46:47 GMT
- Title: TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning
- Authors: Hao Ding, Zhichuan Yang, Weijie Ge, Ziqin Gao, Chaoyi Lu, Lei Zhao,
- Abstract summary: We introduce TikArt, an aperture-guided agent that casts vision-language reasoning as a decision process over regions of interest.<n>TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions.<n>After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory.
- Score: 6.656456191281567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address fine-grained visual reasoning in multimodal large language models (MLLMs), where key evidence may reside in tiny objects, cluttered regions, or subtle markings that are lost under a single global image encoding. We introduce TikArt (Thinking Aperture), an aperture-guided agent that casts multi-step vision-language reasoning as a decision process over regions of interest. TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions: Zoom extracts rectangular crops, while Segment invokes SAM2 to obtain mask-based crops for irregular targets. After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory. Built on Qwen3-VL-8B, TikArt optimizes its reasoning policy with AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum: it warms up segmentation actions and then jointly optimizes visual math, fine-grained VQA, and segmentation, using rewards that couple task success with purposeful aperture use. Experiments on V*, HR-Bench-4K/8K, MME-RealWorld-Lite, MMStar, RefCOCO, and ReasonSeg show consistent gains over the backbone and yield interpretable aperture trajectories for high-resolution reasoning.
Related papers
- IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation [44.89730606641666]
IBISAgent reformulates segmentation as a vision-centric, multi-step decision-making process.<n>IBISAgent consistently outperforms both closed-source and open-source SOTA methods.<n>All datasets, code, and trained models will be released publicly.
arXiv Detail & Related papers (2026-01-06T14:37:50Z) - FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z) - OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion [89.98812408058336]
We introduce textbfOpenInsGaussian, an textbfOpen-vocabulary textbfInstance textbfGaussian segmentation framework with Context-aware Cross-view Fusion.<n>OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin.
arXiv Detail & Related papers (2025-10-21T03:24:12Z) - GRASP: Geospatial pixel Reasoning viA Structured Policy learning [16.023628299873494]
GRASP is a structured policy-learning framework that integrates a multimodal large language model with a pretrained segmentation model in a cascaded manner.<n> PRIME is a training paradigm that replaces supervised fine-tuning with reinforcement learning to better align reasoning and grounding behaviors with task objectives.<n>We release GRASP-1k, a fully out-of-domain benchmark with reasoning-intensive queries, reasoning traces, and fine-grained masks.
arXiv Detail & Related papers (2025-08-23T18:05:06Z) - LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs [4.478610052538001]
We propose LLaVA-SP, which only adds six spatial visual tokens to the original visual tokens to enhance the visual representation.<n>We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling.
arXiv Detail & Related papers (2025-07-01T07:20:11Z) - VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought [51.43082554363725]
We introduce textbfVLM-R$3$ (textbfVisual textbfLanguage textbfModel with textbfRegion textbfRecognition and textbfReasoning), a framework that equips an MLLM with the ability to decide emph when additional visual evidence is needed.<n>Experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$3$ sets a new
arXiv Detail & Related papers (2025-05-22T03:50:13Z) - TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.<n>Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.<n>Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z) - Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation [56.001484215308075]
We present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP.<n>We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process.<n> Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.
arXiv Detail & Related papers (2024-11-28T19:00:03Z) - ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration [39.2654025469784]
We propose Zoom Eye, a training-free, model-agnostic tree search algorithm tailored for vision-level reasoning.<n>The algorithm enables MLLMs to simulate human-like zooming behavior by navigating from root to leaf nodes in search of task-relevant visual evidence.
arXiv Detail & Related papers (2024-11-25T02:15:30Z) - EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models [80.00303150568696]
We propose a novel Multimodal Large Language Models (MLLM) that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches.
Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM.
We also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM's region-level comprehension with the specific formats of referring visual prompts.
arXiv Detail & Related papers (2024-09-25T08:22:00Z) - Fine-Grained Visual Prompting [35.032567257651515]
Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions.
It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset.
arXiv Detail & Related papers (2023-06-07T11:39:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.