video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
- URL: http://arxiv.org/abs/2502.11775v1
- Date: Mon, 17 Feb 2025 13:07:40 GMT
- Title: video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
- Authors: Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei Li, Zejun MA, Chao Zhang,
- Abstract summary: We propose video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks.<n>We develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions.<n>We also introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs.
- Score: 33.70837005629285
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding.This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.
Related papers
- Video-R1: Reinforcing Video Reasoning in MLLMs [27.99261687064233]
Video-R1 is the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models.
We first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning.
Instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process.
arXiv Detail & Related papers (2025-03-27T17:59:51Z) - ReasVQA: Advancing VideoQA with Imperfect Reasoning Process [38.4638171723351]
textbfReasVQA (Reasoning-enhanced Video Question Answering) is a novel approach that leverages reasoning processes generated by Multimodal Large Language Models (MLLMs) to improve the performance of VideoQA models.
We evaluate ReasVQA on three popular benchmarks, and our results establish new state-of-the-art performance with significant improvements of +2.9 on NExT-QA, +7.3 on STAR, and +5.9 on IntentQA.
arXiv Detail & Related papers (2025-01-23T10:35:22Z) - Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM [54.2320450886902]
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs.<n>Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware.<n>We introduce Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model.
arXiv Detail & Related papers (2024-12-19T18:32:21Z) - VideoSAVi: Self-Aligned Video Language Models without Human Supervision [0.6854849895338531]
VideoSAVi is a novel self-training pipeline for vision-language models (VLMs)<n>It generates its own training data without extensive manual annotation.<n>VideoSAVi shows significant improvements across multiple benchmarks.
arXiv Detail & Related papers (2024-12-01T00:33:05Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward [118.65089648651308]
This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content.
We show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video Question Answering (QA) tasks.
arXiv Detail & Related papers (2024-04-01T17:28:16Z) - VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding [65.12464615430036]
This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs)
Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework.
arXiv Detail & Related papers (2024-03-21T18:00:00Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.