Related papers: AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

URL: http://arxiv.org/abs/2506.05328v2
Date: Tue, 22 Jul 2025 07:00:35 GMT
Title: AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
Authors: Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, Tong Lu,
Abstract summary: We introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos.<n>It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting.<n>We propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks.
Score: 22.357762402346403
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been released on https://av-reasoner.github.io.

Related papers

MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos [22.10711693948861]
We propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos.<n>The benchmark is characterized by the following features.<n>Experiments reveal that current models still struggle with multi-modal reasoning.
arXiv Detail & Related papers (2025-06-04T16:33:41Z)
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation [32.91687961164014]
Large multimodal models (LMMs) have emerged as a powerful tool for long video understanding (LVU)<n>Most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer.<n>We propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer.
arXiv Detail & Related papers (2025-05-20T17:26:32Z)
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs [72.5567678952768]
AURELIA is a novel actor-critic based audio-visual (AV) reasoning framework.<n>It distills structured, step-by-step reasoning into AVLLMs at test time.<n>Using AURELIA, we achieve up to a 100% relative improvement, demonstrating its effectiveness.
arXiv Detail & Related papers (2025-03-29T20:42:29Z)
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.<n>Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.<n>We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z)
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension [86.0749609778104]
We propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models.<n>QuoTA strategically allocates frame-level importance scores based on query relevance.<n>We decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring.
arXiv Detail & Related papers (2025-03-11T17:59:57Z)
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding [56.78088668917983]
We introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains.<n>We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos.<n>Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding.
arXiv Detail & Related papers (2025-02-15T14:29:44Z)
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding [43.858197893052115]
CG-Bench is a novel benchmark for clue-grounded question answering in long videos.<n>It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories.<n>The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination.
arXiv Detail & Related papers (2024-12-16T18:46:45Z)
VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding [48.26536049440913]
Video large multimodal models (LMMs) have significantly improved their video understanding and reasoning capabilities.<n>Their performance drops on out-of-distribution (OOD) tasks that are underrepresented in training data.<n>Traditional methods like fine-tuning on OOD datasets are impractical due to high computational costs.<n>We propose VideoICL, a novel video in-context learning framework for OOD tasks.
arXiv Detail & Related papers (2024-12-03T05:54:43Z)
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)<n>We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.<n>We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z)
Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models. The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.