SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner
- URL: http://arxiv.org/abs/2503.04858v1
- Date: Thu, 06 Mar 2025 08:33:11 GMT
- Title: SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner
- Authors: Kejia Chen, Jiawen Zhang, Jiacong Hu, Jiazhen Yang, Jian Lou, Zunlei Feng, Mingli Song,
- Abstract summary: Large Visual Language Models (LVLMs) increasingly rely on preference alignment to ensure reliability.<n>We present projectname, a self-supervised framework capable of transforming the already abundant supervised text-image pairs into holistic preference triplets.
- Score: 35.843587407696006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Visual Language Models (LVLMs) increasingly rely on preference alignment to ensure reliability, which steers the model behavior via preference fine-tuning on preference data structured as ``image - winner text - loser text'' triplets. However, existing approaches often suffer from limited diversity and high costs associated with human-annotated preference data, hindering LVLMs from fully achieving their intended alignment capabilities. We present \projectname, a self-supervised framework capable of transforming the already abundant supervised text-image pairs into holistic preference triplets for more effective and cheaper LVLM alignment, eliminating the need for human preference annotations. Our approach facilitates LVLMs in progressively enhancing alignment capabilities through iterative self-improvement. The key design rationale is to devise preference triplets where the winner text consistently improves in holisticness and outperforms the loser response in quality, thereby pushing the model to ``strive to the utmost'' of alignment performance through preference fine-tuning. For each given text-image pair, SHAPE introduces multiple visual augmentations and pairs them with a summarized text to serve as the winner response, while designating the original text as the loser response. Experiments across \textbf{12} benchmarks on various model architectures and sizes, including LLaVA and DeepSeek-VL, show that SHAPE achieves significant gains, for example, achieving +11.3\% on MMVet (comprehensive evaluation), +1.4\% on MMBench (general VQA), and +8.0\% on POPE (hallucination robustness) over baselines in 7B models. Notably, qualitative analyses confirm enhanced attention to visual details and better alignment with human preferences for holistic descriptions.
Related papers
- From Captions to Rewards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models [58.16075709485292]
CAREVL is a novel method for preference reward modeling by reliably using both high- and low-confidence data.
CAREVL achieves performance improvements over traditional distillation-based methods on VL-RewardBench and MLLM-as-a-Judge benchmark.
arXiv Detail & Related papers (2025-03-08T16:13:18Z) - Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization [19.37373012848517]
Large Vision Language Models (VLMs) are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies.<n>We introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset.<n>We also introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning.
arXiv Detail & Related papers (2025-02-18T18:59:57Z) - Modality-Fair Preference Optimization for Trustworthy MLLM Alignment [11.796170286878056]
Direct Preference Optimization (DPO) is effective for aligning large language models (LLMs)
It often favors text over image information, leading to unreliable outputs and visual hallucinations.
We propose Modality-Fair Preference Optimization (MFPO) to balance text and image preferences.
arXiv Detail & Related papers (2024-10-20T08:56:52Z) - Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.
By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z) - VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment [55.7956150385255]
We investigate the efficacy of AI feedback to scale supervision for aligning vision-language models.
We introduce VLFeedback, the first large-scale vision-language feedback dataset.
We train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback.
arXiv Detail & Related papers (2024-10-12T07:56:47Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks.<n>Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results.<n>We propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies.
arXiv Detail & Related papers (2024-05-24T23:09:27Z) - Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning.
LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image.
We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
arXiv Detail & Related papers (2024-05-23T14:30:33Z) - Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models [85.96013373385057]
Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent.
However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models.
We propose TextNorm, a method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts.
arXiv Detail & Related papers (2024-04-02T11:40:38Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - Text Counterfactuals via Latent Optimization and Shapley-Guided Search [15.919650185010491]
We study the problem of generating counterfactual text for a classification model.
We aim to minimally alter the text to change the model's prediction.
White-box approaches have been successfully applied to similar problems in vision.
arXiv Detail & Related papers (2021-10-22T05:04:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.