VideoSAVi: Self-Aligned Video Language Models without Human Supervision
- URL: http://arxiv.org/abs/2412.00624v2
- Date: Sun, 30 Mar 2025 01:19:52 GMT
- Title: VideoSAVi: Self-Aligned Video Language Models without Human Supervision
- Authors: Yogesh Kulkarni, Pooyan Fazli,
- Abstract summary: VideoSAVi is a self-training pipeline that enables Video-LLMs to reason over video content without external supervision.<n>VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements.<n>Our model-agnostic approach is computationally efficient, requiring only 32 frames.
- Score: 0.6854849895338531
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in video-large language models (Video-LLMs) have led to significant progress in video understanding. Current preference optimization methods often rely on proprietary APIs or ground-truth captions to generate preference data (i.e., pairs of model outputs ranked based on their quality or alignment with human judgment), which is then used to train models for video-language alignment. This approach is both costly and labor-intensive. To address this limitation, we introduce VideoSAVi (Self-Aligned Video Language Model), a self-training pipeline that enables Video-LLMs to reason over video content without external supervision. Our approach includes a self-critiquing mechanism that identifies reasoning errors in the model's initial responses and generates improved alternatives, creating preference pairs directly from video content. VideoSAVi then applies Direct Preference Optimization (DPO), which uses the preference data to iteratively train the model, enhancing temporal and spatial reasoning in video understanding. Experiments show that VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements across other benchmarks, including a 3.9% gain on PerceptionTest and a substantial 6.8% improvement on the challenging EgoSchema dataset compared to baseline models. Our model-agnostic approach is computationally efficient, requiring only 32 frames, offering a promising direction for self-aligned video understanding without reliance on external models or annotations.
Related papers
- VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment [0.6854849895338531]
Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity.
We introduce VideoPASTA, a framework that enhances Video-LLMs through targeted preference optimization.
arXiv Detail & Related papers (2025-04-18T22:28:03Z) - Learning from Streaming Video with Orthogonal Gradients [62.51504086522027]
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner.
This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch.
We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks.
arXiv Detail & Related papers (2025-04-02T17:59:57Z) - VPO: Aligning Text-to-Video Generation Models with Prompt Optimization [80.86205966195593]
Video generation models are typically trained on text-to-video pairs with highly detailed and carefully crafted descriptions.
We introduce VPO, a principled framework that optimize prompts based on three core principles: harmlessness, accuracy, and helpfulness.
Our experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods.
arXiv Detail & Related papers (2025-03-26T12:28:20Z) - Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [26.866184981409607]
We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead.
Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders.
Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:56Z) - OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization [30.6130504613716]
We introduce OnlineVPO, a preference learning approach tailored specifically for video diffusion models.
By employing the video reward model to offer concise video feedback on the fly, OnlineVPO offers effective and efficient preference guidance.
arXiv Detail & Related papers (2024-12-19T18:34:50Z) - VideoDPO: Omni-Preference Alignment for Video Diffusion Generation [48.36302380755874]
Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation.
We propose a VideoDPO pipeline by making several key adjustments.
Our experiments demonstrate substantial improvements in both visual quality and semantic alignment.
arXiv Detail & Related papers (2024-12-18T18:59:49Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward [118.65089648651308]
This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content.
We show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video Question Answering (QA) tasks.
arXiv Detail & Related papers (2024-04-01T17:28:16Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video Annotator: A framework for efficiently building video classifiers
using vision-language models and active learning [0.0]
Video Annotator (VA) is a framework for annotating, managing, and iterating on video classification datasets.
VA allows for a continuous annotation process, seamlessly integrating data collection and model training.
VA achieves a median 6.8 point improvement in Average Precision relative to the most competitive baseline.
arXiv Detail & Related papers (2024-02-09T17:19:05Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - Distilling Vision-Language Models on Millions of Videos [62.92789440875999]
We fine-tune a video-language model from a strong image-language baseline with synthesized instructional data.
The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions.
As a side product, we generate the largest video caption dataset to date.
arXiv Detail & Related papers (2024-01-11T18:59:53Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.