Related papers: Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

URL: http://arxiv.org/abs/2603.03762v1
Date: Wed, 04 Mar 2026 06:18:45 GMT
Title: Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding
Authors: Junhan Chen, Zilu Zhou, Yujun Tong, Dongliang Chang, Yitao Luo, Zhanyu Ma,
Abstract summary: We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA)<n>KFRA operates through a three-stage closed reasoning loop that emulates expert analysis.<n>It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses.<n>It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence.
Score: 30.498502211349386
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-grained visual understanding is shifting from static classification to knowledge-augmented reasoning, where models must justify as well as recognise. Existing approaches remain limited by closed-set taxonomies and single-label prediction, leading to significant degradation under open-set or context-dependent conditions. We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning. KFRA operates through a three-stage closed reasoning loop that emulates expert analysis. It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses. It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence through a global-to-local focusing mechanism. Finally, it integrates all multimodal evidence within a large multimodal model to perform interpretable reasoning. Unlike existing agents that treat retrieval and reasoning as independent processes, KFRA establishes a retrieval-grounding coupling that converts retrieved knowledge into spatially grounded evidence for verification. This design enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios. To evaluate this capability, we construct FGExpertBench, a benchmark designed to assess reasoning depth and cross-task generalisation across six knowledge dimensions. Extensive experiments demonstrate that KFRA consistently surpasses both standalone large multimodal models and current agent frameworks, achieving up to 19 percent improvement in reasoning accuracy and delivering evidence-grounded interpretability in open-set fine-grained visual understanding.

Related papers

Specificity-aware reinforcement learning for fine-grained open-world classification [54.85385270439992]
Classifying fine-grained visual concepts under open-world settings demands models to be both accurate and specific.<n>We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification.
arXiv Detail & Related papers (2026-03-03T17:52:39Z)
Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation [5.191980417814362]
LLM agents excel when environments are mostly static and the needed information fits in a model's context window.<n>ReAct-style agents are especially brittle in this regime.<n>We propose EoG, a framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages, state, and belief propagation to compute a minimal explanatory frontier.
arXiv Detail & Related papers (2026-01-25T17:27:19Z)
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs [85.69785384599827]
Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them.<n>Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set.<n>We propose GRASP-HO, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem.
arXiv Detail & Related papers (2025-12-19T14:41:50Z)
Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking [0.2864713389096699]
This extended abstract introduces Self-Explaining Contrastive Evidence Re-Ranking (CER)<n>CER restructures retrieval around factual evidence by fine-tuning embeddings with contrastive learning and generating token-level attribution rationales for each retrieved passage.<n>We evaluated our method on clinical trial reports, and initial experimental results show that CER improves retrieval accuracy, mitigates the potential for hallucinations in RAG systems, and provides transparent, evidence-based retrieval that enhances reliability, especially in safety-critical domains.
arXiv Detail & Related papers (2025-12-04T17:24:35Z)
Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation [46.03923254984181]
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance the factuality of Large Language Models (LLMs)<n>Existing approaches to improving contextual faithfulness rely on external interventions, such as prompt engineering, decoding constraints, or reward-based fine-tuning.<n>We propose CLEAR (Conflict-Localized and Enhanced Attention for RAG), a framework that decomposes context into fine-grained sentence-level knowledge.
arXiv Detail & Related papers (2025-10-14T12:48:24Z)
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [69.10441885629787]
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge.<n>It falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts.<n>This survey synthesizes both strands under a unified reasoning-retrieval perspective.
arXiv Detail & Related papers (2025-07-13T03:29:41Z)
CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection [60.98964268961243]
We propose that guiding models to perform a systematic and comprehensive reasoning process allows models to execute much finer-grained and accurate entailment decisions.<n>We define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection.
arXiv Detail & Related papers (2025-06-05T17:02:52Z)
ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search [46.7782420285593]
ARise is a novel framework that integrates risk assessment of intermediate reasoning states with dynamic retrieval--augmented generation (RAG)<n> Experimental results show that ARise significantly outperforms the state--the-art KAR methods by up to 23.10%.
arXiv Detail & Related papers (2025-04-15T06:06:50Z)
Disentangling Representations through Multi-task Learning [0.0]
We provide experimental and theoretical results guaranteeing the emergence of disentangled representations in agents that optimally solve classification tasks.<n>We experimentally validate these predictions in RNNs trained to multi-task, which learn disentangled representations in the form of continuous attractors.<n>We find that transformers are particularly suited for disentangling representations, which might explain their unique world understanding abilities.
arXiv Detail & Related papers (2024-07-15T21:32:58Z)
Nested Counterfactual Identification from Arbitrary Surrogate Experiments [95.48089725859298]
We study the identification of nested counterfactuals from an arbitrary combination of observations and experiments. Specifically, we prove the counterfactual unnesting theorem (CUT), which allows one to map arbitrary nested counterfactuals to unnested ones.
arXiv Detail & Related papers (2021-07-07T12:51:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.