Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs
- URL: http://arxiv.org/abs/2510.10426v1
- Date: Sun, 12 Oct 2025 03:22:33 GMT
- Title: Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs
- Authors: Suyang Xi, Chenxi Yang, Hong Ding, Yiqing Ni, Catherine C. Liu, Yunhao Liu, Chengqi Zhang,
- Abstract summary: Multimodal large language models (MLLMs) often fail in fine-grained visual question answering.<n>We present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a what--where--reweight'' cascade.
- Score: 23.638717678491986
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what--where--reweight'' cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.
Related papers
- ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval [64.14282916266998]
Composed Image Retrieval aims to retrieve target images based on a hybrid query comprising a reference image and a modification text.<n>We propose ReCALL, a model-agnostic framework that follows a diagnose-generate-refine pipeline.<n>Experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-02T04:52:54Z) - Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs [85.69785384599827]
Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them.<n>Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set.<n>We propose GRASP-HO, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem.
arXiv Detail & Related papers (2025-12-19T14:41:50Z) - Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models [49.435669307386156]
Multi-stage Prompt Refinement (MPR) is a framework designed to systematically improve ill-formed prompts across multiple stages.<n>MPR iteratively enhances the clarity of prompts with additional context and employs a self-reflection mechanism with ranking to prioritize the most relevant input.<n>Results on hallucination benchmarks show that MPR achieve over an 85% win rate compared to their original forms.
arXiv Detail & Related papers (2025-10-14T00:31:36Z) - Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection [5.0106565473767075]
Large Language Models (LLMs) have demonstrated effectiveness across a wide variety of tasks involving natural language.<n>A fundamental problem of hallucinations still plagues these models, limiting their trustworthiness in generating consistent, truthful information.<n>We propose a novel approach inspired by ROUGE that constructs an N-Gram frequency tensor from LLM-generated text.<n>This tensor captures richer semantic structure by encoding co-occurrence patterns, enabling better differentiation between factual and hallucinated content.
arXiv Detail & Related papers (2025-09-03T18:52:24Z) - Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization [55.543583937522804]
Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks.<n>Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate.<n>In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations.
arXiv Detail & Related papers (2025-08-27T18:02:04Z) - Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models [0.0]
This paper introduces Semantic Divergence Metrics (SDM), a novel framework for detecting Faithfulness Hallucinations.<n>A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue.
arXiv Detail & Related papers (2025-08-13T20:55:26Z) - PostAlign: Multimodal Grounding as a Corrective Lens for MLLMs [23.69973859198496]
Multimodal Large Language Models (MLLMs) excel in vision-language tasks, such as image captioning and visual question answering.<n>They often suffer from over-reliance on spurious correlations, primarily due to linguistic priors that distract the model from leveraging actual visual information.<n>We introduce MMed-PostAlign, a post-multimodal alignment framework designed to enhance the visual understanding capabilities and mitigate the hallucinations of MLLMs.
arXiv Detail & Related papers (2025-06-22T05:11:46Z) - MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM [58.2298313720146]
Multimodal hallucinations are multi-sourced and arise from diverse causes.<n>Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations.
arXiv Detail & Related papers (2025-05-30T05:54:36Z) - Towards General Visual-Linguistic Face Forgery Detection(V2) [90.6600794602029]
Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust.<n>Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection.<n>We propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification.
arXiv Detail & Related papers (2025-02-28T04:15:36Z) - See, Say, and Segment: Teaching LMMs to Overcome False Premises [67.36381001664635]
We propose a cascading and joint training approach for LMMs to solve this task.
Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, and finally "segment" by outputting the mask of the desired objects if they exist.
arXiv Detail & Related papers (2023-12-13T18:58:04Z) - Detecting and Preventing Hallucinations in Large Vision Language Models [4.7264116948935975]
M-HalDetect is the first multi-modal hallucination detection dataset for detailed image descriptions.
We train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling.
We find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively.
arXiv Detail & Related papers (2023-08-11T21:35:20Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.