MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration
- URL: http://arxiv.org/abs/2510.08508v1
- Date: Thu, 09 Oct 2025 17:42:51 GMT
- Title: MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration
- Authors: Lu Liu, Chunlei Cai, Shaocheng Shen, Jianfeng Liang, Weimin Ouyang, Tianxiao Ye, Jian Mao, Huiyu Duan, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai,
- Abstract summary: Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions.<n>We propose MoA-VR, which mimics the reasoning and processing procedures of human professionals through three coordinated agents.<n>Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier.
- Score: 62.929029990341796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underline{Res}tored \underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.
Related papers
- Restoration Adaptation for Semantic Segmentation on Low Quality Images [29.60165376603045]
In real-world scenarios, the performance of semantic segmentation often deteriorates when processing low-quality (LQ) images.<n>We propose a Semantic-trained Restoration (SCR) model, which injects segmentation priors into the restoration model.<n>Then, RASS transfers semantic restoration knowledge into segmentation through LoRA-based module merging and task-specific fine-tuning.
arXiv Detail & Related papers (2026-02-15T08:13:23Z) - LP-LLM: End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models [4.497411606350301]
Real-world License Plate Recognition (LPR) faces significant challenges from severe degradations such as motion blur, low resolution, and complex illumination.<n>The prevailing "restoration-then-recognition" two-stage paradigm suffers from a fundamental flaw: the pixel-level optimization objectives of image restoration models are misaligned with the semantic goals of character recognition.<n>We propose an end-to-end structure-aware multimodal reasoning framework based on Qwen3-VL.
arXiv Detail & Related papers (2026-01-14T03:32:55Z) - SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z) - Towards Unified Video Quality Assessment [9.94708780294855]
Unified-VQA is a framework that provides a single, unified quality model applicable to various distortion types within multiple video formats.<n>A novel multi-proxy expert training strategy is designed to optimize each expert using a ranking-inspired loss.<n> Unified-VQA demonstrates consistent and superior performance compared to over 18 benchmark methods for both generic VQA and diagnostic artifact detection tasks.
arXiv Detail & Related papers (2025-12-01T21:43:38Z) - CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video [9.172799792564009]
We propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large models.<n>Our approach introduces a quality-aware video metadata mechanism that integrates key fragments extracted from inter-frame variations.<n>Our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations.
arXiv Detail & Related papers (2025-11-10T16:37:47Z) - OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation [91.45421429922506]
OneCAT is a unified multimodal model that seamlessly integrates understanding, generation, and editing.<n>Our framework eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference.
arXiv Detail & Related papers (2025-09-03T17:29:50Z) - Team of One: Cracking Complex Video QA with Model Synergy [24.75732964829523]
We propose a novel framework for open-ended video question answering that enhances reasoning depth and robustness in complex real-world scenarios.<n>Existing Video-Large Multimodal Models (Video-LMMs) often exhibit limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries.
arXiv Detail & Related papers (2025-07-18T11:12:44Z) - EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment [68.77813885751308]
EyeSimVQA is a novel VQA framework that incorporates free-energy-based self-repair.<n>We show EyeSimVQA achieves competitive or superior performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-06-13T08:00:54Z) - VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z) - Mixed Degradation Image Restoration via Local Dynamic Optimization and Conditional Embedding [67.57487747508179]
Multiple-in-one image restoration (IR) has made significant progress, aiming to handle all types of single degraded image restoration with a single model.
In this paper, we propose a novel multiple-in-one IR model that can effectively restore images with both single and mixed degradations.
arXiv Detail & Related papers (2024-11-25T09:26:34Z) - Training-Free Large Model Priors for Multiple-in-One Image Restoration [24.230376300759573]
Large Model Driven Image Restoration framework (LMDIR)
Our architecture comprises a query-based prompt encoder, degradation-aware transformer block injecting global degradation knowledge.
This design facilitates single-stage training paradigm to address various degradations while supporting both automatic and user-guided restoration.
arXiv Detail & Related papers (2024-07-18T05:40:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.