Related papers: AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning

AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning

URL: http://arxiv.org/abs/2603.00171v1
Date: Thu, 26 Feb 2026 15:41:26 GMT
Title: AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning
Authors: Yuxiang Shen, Hailong Huang, Zhenkun Gao, Xueheng Li, Chengjun Xie, Xuanhua He, Jie Zhang,
Abstract summary: We propose AdaFocus, a training-free framework for adaptive visual reasoning.<n>AdaFocus follows a two-stage pipeline: a confidence-based module decides when to crop, and a semantic-guided localization module determines where to crop.<n> Experimentally, AdaFocus delivers substantial performance gains while achieving approximately 4.0times speedup inference speedup.
Score: 17.455916323311683
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which adds overhead and noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose AdaFocus, a novel training-free framework designed for adaptive visual reasoning. AdaFocus follows a two-stage pipeline: a confidence-based module decides when to crop, and a semantic-guided localization module determines where to crop. This enables adaptive visual reasoning without additional training. Experimentally, AdaFocus delivers substantial performance gains while achieving approximately 4.0\times speedup inference speedup than the SOTA method ZoomEyes, representing a significant advance in both accuracy and efficiency.

Related papers

Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception [93.20637973889434]
We introduce AdaptiveNN, a general framework aiming to drive a paradigm shift from 'passive' to 'active' vision models.<n> AdaptiveNN formulates visual perception as a coarse-to-fine sequential decision-making process.<n>We assess AdaptiveNN on 17 benchmarks spanning 9 tasks, including large-scale visual recognition, fine-grained discrimination, visual search, and processing images from real driving and medical scenarios.
arXiv Detail & Related papers (2025-09-18T18:25:43Z)
ContextFusion and Bootstrap: An Effective Approach to Improve Slot Attention-Based Object-Centric Learning [53.19029595226767]
Slot attention-based framework has emerged as a leading approach in object-centric learning.<n>Current methods require a stable feature space throughout training to enable reconstruction from slots.<n>We propose a novel ContextFusion stage and a Bootstrap Branch, both of which can be seamlessly integrated into existing slot attention models.
arXiv Detail & Related papers (2025-09-02T07:19:25Z)
SIFThinker: Spatially-Aware Image Focus for Visual Reasoning [22.922568123298934]
We introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception.<n>SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language.<n>In experiments, SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception.
arXiv Detail & Related papers (2025-08-08T12:26:20Z)
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [22.907814548315468]
We propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception.<n>By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations.<n>Our method consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs.
arXiv Detail & Related papers (2025-03-10T16:49:35Z)
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
SparseFocus: Learning-based One-shot Autofocus for Microscopy with Sparse Content [21.268550523841117]
Autofocus is necessary for high- throughput and real-time scanning in microscopic imaging.<n>Recent learning-based approaches have demonstrated remarkable efficacy in a one-shot setting.<n>We propose a content-based solution, named SparseFocus, featuring a novel two-stage pipeline.
arXiv Detail & Related papers (2025-02-10T13:31:32Z)
Learning 1D Causal Visual Representation with De-focus Attention Networks [108.72931590504406]
This paper explores the feasibility of representing images using 1D causal modeling. We propose De-focus Attention Networks, which employ learnable bandpass filters to create varied attention patterns.
arXiv Detail & Related papers (2024-06-06T17:59:56Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition [23.12743642910384]
This work reformulates the training of AdaFocus as a simple one-stage algorithm. We present an improved training scheme to address the issues introduced by the one-stage formulation. Our model significantly outperforms the original AdaFocus and other competitive baselines.
arXiv Detail & Related papers (2021-12-28T17:53:38Z)
Activation to Saliency: Forming High-Quality Labels for Unsupervised Salient Object Detection [54.92703325989853]
We propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues. No human annotations are involved in our framework during the whole training process. Our framework reports significant performance compared with existing USOD methods.
arXiv Detail & Related papers (2021-12-07T11:54:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.