Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization
- URL: http://arxiv.org/abs/2412.16232v3
- Date: Sat, 08 Feb 2025 22:04:37 GMT
- Title: Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization
- Authors: Yue Zhang, Liqiang Jing, Vibhav Gogate,
- Abstract summary: We introduce a new task called Defeasible Visual Entailment (DVE)<n>The goal is to allow the modification of the entailment relationship between an image premise and a text hypothesis based on an additional update.<n>At a high level, DVE enables models to refine their initial interpretations, leading to improved accuracy and reliability in various applications.
- Score: 19.32714581384729
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new task called Defeasible Visual Entailment (DVE), where the goal is to allow the modification of the entailment relationship between an image premise and a text hypothesis based on an additional update. While this concept is well-established in Natural Language Inference, it remains unexplored in visual entailment. At a high level, DVE enables models to refine their initial interpretations, leading to improved accuracy and reliability in various applications such as detecting misleading information in images, enhancing visual question answering, and refining decision-making processes in autonomous systems. Existing metrics do not adequately capture the change in the entailment relationship brought by updates. To address this, we propose a novel inference-aware evaluator designed to capture changes in entailment strength induced by updates, using pairwise contrastive learning and categorical information learning. Additionally, we introduce a reward-driven update optimization method to further enhance the quality of updates generated by multimodal models. Experimental results demonstrate the effectiveness of our proposed evaluator and optimization method.
Related papers
- Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation [26.580361841501514]
Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration.
This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information.
We propose a novel Confidence through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for object-centric queries.
arXiv Detail & Related papers (2025-04-21T04:01:22Z) - HF4Rec: Human-Like Feedback-Driven Optimization Framework for Explainable Recommendation [8.532115411106068]
We propose a novel human-like feedback-driven optimization framework for explainable recommendations.
This framework employs a dynamic interactive optimization mechanism for achieving human-centered explainable requirements without incurring high labor costs.
In particular, we propose to utilize large language models (LLMs) as human simulators to predict human-like feedback for guiding the learning process.
arXiv Detail & Related papers (2025-04-19T02:46:10Z) - A Meaningful Perturbation Metric for Evaluating Explainability Methods [55.09730499143998]
We introduce a novel approach, which harnesses image generation models to perform targeted perturbation.
Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model's predictions while preserving image fidelity.
This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results.
arXiv Detail & Related papers (2025-04-09T11:46:41Z) - Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.
We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models [25.51735861729728]
Modern neural networks often exhibit overconfidence for misclassified predictions, highlighting the need for confidence estimation to detect errors.
We exploit vision language model (VLM) leveraging text information to establish an efficient and general-purpose misclassification detection framework.
By harnessing the power of VLM, we construct FSMisD, a Few-Shot prompt learning framework for MisD to refrain from training from scratch and therefore improve tuning efficiency.
arXiv Detail & Related papers (2025-03-26T12:31:04Z) - Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training [66.48331530995786]
We propose syMmetry-ENhanceD (MEND) Data Augmentation, a data-centric approach that improves the model's ability to extract useful information from context.
Unlike existing methods that emphasize reasoning chain augmentation, our approach improves model robustness at the knowledge extraction stage.
Experiments on both logical and arithmetic reasoning tasks show that MEND enhances reasoning performance across diverse query variations.
arXiv Detail & Related papers (2025-02-25T03:03:35Z) - Enhancing Recommendation Explanations through User-Centric Refinement [7.640281193938638]
We propose a novel paradigm that refines initial explanations generated by existing explainable recommender models.
Specifically, we introduce a multi-agent collaborative refinement framework based on large language models.
arXiv Detail & Related papers (2025-02-17T12:08:18Z) - Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning [0.0]
Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data.<n>This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training model through adaptive prompt tuning.
arXiv Detail & Related papers (2024-12-19T08:51:01Z) - Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.
By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z) - Knowledge Editing in Language Models via Adapted Direct Preference Optimization [50.616875565173274]
Large Language Models (LLMs) can become outdated over time.
Knowledge Editing aims to overcome this challenge using weight updates that do not require expensive retraining.
arXiv Detail & Related papers (2024-06-14T11:02:21Z) - Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks.
Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results.
We propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies.
arXiv Detail & Related papers (2024-05-24T23:09:27Z) - Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning.
LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image.
We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
arXiv Detail & Related papers (2024-05-23T14:30:33Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Mean Embeddings with Test-Time Data Augmentation for Ensembling of
Representations [8.336315962271396]
We look at the ensembling of representations and propose mean embeddings with test-time augmentation (MeTTA)
MeTTA significantly boosts the quality of linear evaluation on ImageNet for both supervised and self-supervised models.
We believe that spreading the success of ensembles to inference higher-quality representations is the important step that will open many new applications of ensembling.
arXiv Detail & Related papers (2021-06-15T10:49:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.