CF-VLM:CounterFactual Vision-Language Fine-tuning
- URL: http://arxiv.org/abs/2506.17267v1
- Date: Tue, 10 Jun 2025 20:20:05 GMT
- Title: CF-VLM:CounterFactual Vision-Language Fine-tuning
- Authors: Jusheng Zhang, Kaitong Cai, Yijia Fan, Jian Wang, Keze Wang,
- Abstract summary: CounterFactual Vision-Language Fine-tuning (CF-VLM) is a novel framework that enhances causal reasoning capabilities of vision-language models.<n>CF-VLM introduces three complementary training objectives: maintaining foundational cross-modal alignment, reinforcing the uniqueness and stability of factual scene representations, and sharpening the model's sensitivity to minimal but critical causal edits.
- Score: 10.299136720220416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in vision-language models (VLMs) have greatly improved cross-modal semantic understanding, yet significant limitations remain in fine-grained discrimination and deep causal reasoning tasks. Existing VLMs often rely on superficial statistical correlations, lacking the ability to capture the underlying causal logic between visual and textual content. To address this, we propose CounterFactual Vision-Language Fine-tuning (CF-VLM), a novel framework that enhances the causal reasoning capabilities of VLMs through the targeted use of counterfactual samples. CF-VLM introduces three complementary training objectives: maintaining foundational cross-modal alignment, reinforcing the uniqueness and stability of factual scene representations against coherent counterfactuals, and sharpening the model's sensitivity to minimal but critical causal edits. Extensive experiments demonstrate that CF-VLM consistently outperforms strong baselines and state-of-the-art methods on compositional reasoning and generalization benchmarks. Furthermore, it shows promise in mitigating visual hallucinations, indicating improved factual consistency. Our CF-VLM provides a robust foundation for deploying VLMs in high-stakes, real-world scenarios requiring reliable reasoning and interpretability.
Related papers
- Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought [11.538345159297839]
Chain-of-thought (CoT) prompting has been adapted for large vision-language models (LLMs) to enhance multi-modal reasoning.<n>Existing LVLMs often ignore the contents of generated rationales in CoT reasoning.<n>We propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy.
arXiv Detail & Related papers (2025-07-10T12:07:13Z) - Caption This, Reason That: VLMs Caught in the Middle [3.4820139118440676]
Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years.<n>They still lag behind human capabilities in specific visual tasks such as counting or relational reasoning.<n>We analyze VLM performance along core cognitive axes: Perception, Attention, and Memory.
arXiv Detail & Related papers (2025-05-24T14:25:48Z) - Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning [23.7096338281261]
This paper shows that Vision Language Models can achieve surprisingly strong decision-making performance when visual scenes are represented as text-only descriptions.<n>We propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making.
arXiv Detail & Related papers (2025-03-21T09:25:23Z) - REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models [59.445672459851274]
REVAL is a comprehensive benchmark designed to evaluate the textbfREliability and textbfVALue of Large Vision-Language Models.<n>REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability and Values.<n>We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro.
arXiv Detail & Related papers (2025-03-20T07:54:35Z) - Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z) - Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction [0.0]
This paper introduces textitContextualized Vision-Language Alignment (CoVLA), a discriminative framework designed to address the challenges of contextual ambiguity and modality discrepancy inherent in this task.<n>Experiments on a benchmark dataset demonstrate that CoVLA significantly outperforms state-of-the-art methods, achieving improvements of 2.3% in accuracy and 2.5% in F1-score.
arXiv Detail & Related papers (2024-12-13T05:29:37Z) - VladVA: Discriminative Fine-tuning of LVLMs [67.14293827774827]
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning.<n>We propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs.
arXiv Detail & Related papers (2024-12-05T17:54:27Z) - Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.<n>By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z) - ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs [95.15814662348245]
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order.
Recent Vision-Language Models (VLMs) have demonstrated remarkable proficiency in such reasoning tasks.
arXiv Detail & Related papers (2024-06-12T12:54:27Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.