Related papers: Unveiling the Tapestry of Consistency in Large Vision-Language Models

Related papers

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs [100.02824137397464]
We investigate how Large Language Models adapt their internal representations when encountering inputs of increasing difficulty.<n>We reveal a consistent and quantifiable phenomenon: as task difficulty increases, the last hidden states of LLMs become substantially sparser.<n>This sparsity--difficulty relation is observable across diverse models and domains.
arXiv Detail & Related papers (2026-03-03T18:48:15Z)
Do Reasoning Models Ask Better Questions? A Formal Information-Theoretic Analysis on Multi-Turn LLM Games [0.0]
Large Language Models (LLMs) excel at many tasks but struggle with a critical ability for resolving ambiguity in user requests.<n>We propose a multi-turn dialogue framework that quantitatively measures how effectively LLMs gather information through yes/no questions.<n>Our experiments demonstrate that, among the evaluated models, the ones with explicit reasoning capabilities achieve higher IG per turn and reach solutions in fewer steps.
arXiv Detail & Related papers (2026-01-25T06:38:15Z)
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens [69.97021957331326]
We propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization.<n>We also introduce a VAE branch with linear projection to recover fine-grained image details.
arXiv Detail & Related papers (2025-12-02T09:02:20Z)
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z)
LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization [9.410181019585822]
We operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs.<n>We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them.<n>Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context.
arXiv Detail & Related papers (2025-10-05T03:14:05Z)
Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection [71.8243083897721]
Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability.<n>We present a novel framework that leverages the model's self-consistency between long responses and short answers to generate preference pairs for training.
arXiv Detail & Related papers (2025-09-27T10:37:11Z)
Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models [2.393011821499345]
We investigate the presence and nature of selection bias in Large Vision-Language Models (LVLMs)<n>We propose an inference-time logit-level debiasing method that estimates an ensemble bias vector from general and contextual prompts.<n>Our method mitigates bias without retraining and is compatible with frozen LVLMs.
arXiv Detail & Related papers (2025-09-20T20:45:47Z)
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models [69.68265487134686]
Video SimpleQA is the first comprehensive benchmark tailored for factuality evaluation of LVLMs. Our work distinguishes from existing video benchmarks through the following key features. Answers are crafted as unambiguous and definitively correct in a short format.
arXiv Detail & Related papers (2025-03-24T17:46:09Z)
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models [8.477985931416303]
We show that Large Vision-Language Models (LVLMs) struggle to robustly utilize information across multiple images. We propose SoFt Attention (SoFA), a training-free approach that mitigates this bias. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.
arXiv Detail & Related papers (2025-03-18T00:45:02Z)
Debiased Prompt Tuning in Vision-Language Model without Annotations [14.811475313694041]
Vision-Language Models (VLMs) may suffer from the problem of spurious correlations. By leveraging pseudo-spurious attribute annotations, we propose a method to automatically adjust the training weights of different groups. Our approach efficiently improves the worst-group accuracy on CelebA, Waterbirds, and MetaShift datasets.
arXiv Detail & Related papers (2025-03-11T12:24:54Z)
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [52.478956204238315]
We study the spatial reasoning challenge from the lens of mechanistic interpretability. We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations. Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models [6.52323086990482]
Vision language models (VLM) demonstrate sophisticated multimodal reasoning yet are prone to hallucination when confronted with knowledge conflicts.<n>This research introduces segsub, a framework for applying targeted image perturbations to investigate VLM resilience against knowledge conflicts.
arXiv Detail & Related papers (2025-02-19T00:26:38Z)
Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration [22.39558434131574]
Large Vision-Language Models (LVLMs) generate responses that are not factually aligned with the visual content. We introduce a training-free solution, Uniform Attention (UAC), that estimates the bias from single meaningless input image. We also introduce a fine-tuning solution, Dynamic Attention (DAC), that enforces the consistent outputs wherever the object locates in the image.
arXiv Detail & Related papers (2025-02-04T03:27:38Z)
Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models [33.76903352835436]
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for capturing and reasoning over multimodal inputs. These models are prone to parametric knowledge conflicts, which arise from inconsistencies of represented knowledge between their vision and language components. We present a systematic approach to detect, interpret, and mitigate them.
arXiv Detail & Related papers (2024-10-04T17:59:28Z)
Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models [55.332004960574004]
Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established. This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt. We propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty.
arXiv Detail & Related papers (2024-07-20T11:19:58Z)
Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts [27.66626125248612]
We empirically investigate visual fairness in several mainstream large vision-language models (LVLMs) Our fairness evaluation framework employs direct and single-choice question prompt on visual question-answering/classification tasks. We propose a potential multi-modal Chain-of-thought (CoT) based strategy for bias mitigation, applicable to both open-source and closed-source LVLMs.
arXiv Detail & Related papers (2024-06-25T23:11:39Z)
MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs [38.93090238335506]
Spurious bias, a tendency to use spurious correlations between non-essential input attributes and target variables for predictions, has revealed a severe pitfall in deep learning models trained on single modality data. We introduce MM-SpuBench, a comprehensive visual question-answering (VQA) benchmark designed to evaluate MLLMs' reliance on nine distinct categories of spurious correlations. Our findings illuminate the persistence of the reliance on spurious correlations from these models and underscore the urge for new methodologies to mitigate spurious biases.
arXiv Detail & Related papers (2024-06-24T20:29:16Z)
Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective [9.633811630889237]
We propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. We introduce a novel dataset with 12,000 challenging VQA instances requiring multi-hop reasoning. Our experiments show that MLLMs perform poorly on MORE, indicating strong unimodal biases and limited semantic understanding.
arXiv Detail & Related papers (2024-03-27T08:38:49Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge. We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z)
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning. This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation. We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z)
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub) Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.