From Waveforms to Pixels: A Survey on Audio-Visual Segmentation
- URL: http://arxiv.org/abs/2508.03724v1
- Date: Tue, 29 Jul 2025 22:20:51 GMT
- Title: From Waveforms to Pixels: A Survey on Audio-Visual Segmentation
- Authors: Jia Li, Yapeng Tian,
- Abstract summary: Audio-Visual aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities.<n>We present a comprehensive overview of the AVS field, covering its problem formulation, benchmark datasets, evaluation metrics, and the progression of methodologies.
- Score: 43.79010208565961
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained object-level understanding. In this survey, we present a comprehensive overview of the AVS field, covering its problem formulation, benchmark datasets, evaluation metrics, and the progression of methodologies. We analyze a wide range of approaches, including architectures for unimodal and multimodal encoding, key strategies for audio-visual fusion, and various decoder designs. Furthermore, we examine major training paradigms, from fully supervised learning to weakly supervised and training-free methods. Notably, we provide an extensive comparison of AVS methods across standard benchmarks, highlighting the impact of different architectural choices, fusion strategies, and training paradigms on performance. Finally, we outline the current challenges, such as limited temporal modeling, modality bias toward vision, lack of robustness in complex environments, and high computational demands, and propose promising future directions, including improving temporal reasoning and multimodal fusion, leveraging foundation models for better generalization and few-shot learning, reducing reliance on labeled data through selfand weakly supervised learning, and incorporating higher-level reasoning for more intelligent AVS systems.
Related papers
- Advanced Unsupervised Learning: A Comprehensive Overview of Multi-View Clustering Techniques [10.97758170701855]
Multi-view clustering (MVC) is a class of unsupervised multi-view learning.<n>MVC compensates for the shortcomings of single-view methods.<n>The semantically rich nature of multi-view data increases its practical utility.
arXiv Detail & Related papers (2025-12-04T16:32:02Z) - Chunking Strategies for Multimodal AI Systems [0.0]
This survey provides a comprehensive taxonomy and technical analysis of chunking strategies tailored for each modality.<n>We examine classical and modern approaches such as fixed-size token windowing, object-centric visual chunking, silence-based audio segmentation, and scene detection in videos.<n>We explore emerging cross-modal chunking strategies that aim to preserve alignment and semantic consistency across disparate data types.
arXiv Detail & Related papers (2025-11-28T19:48:14Z) - Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models [78.32948112203228]
Video understanding represents the most challenging frontier in computer vision.<n>Recent emergence of Video-Large Multitemporal Models has demonstrated remarkable capabilities in video understanding tasks.<n>Survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities.
arXiv Detail & Related papers (2025-10-06T17:10:44Z) - Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis [11.373305523732718]
Affective video facial analysis (AVFA) has emerged as a key research field for building emotion-aware intelligent systems.<n>Masked Autoencoders (MAE) has gained momentum, with growing adaptations in its audio-visual contexts.<n>AVF-MAE++ is a family of audio-visual MAE models designed to efficiently investigate the scaling properties in AVFA.
arXiv Detail & Related papers (2025-09-29T02:53:49Z) - A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects [53.15503034595476]
Video Scene Parsing (VSP) has emerged as a cornerstone in computer vision.<n>VSP has emerged as a cornerstone in computer vision, facilitating the simultaneous segmentation, recognition, and tracking of diverse visual entities in dynamic scenes.
arXiv Detail & Related papers (2025-06-16T14:39:03Z) - A Systematic Investigation on Deep Learning-Based Omnidirectional Image and Video Super-Resolution [30.62413133817583]
This paper presents a systematic review of recent progress in omnidirectional image and video super-resolution.<n>We introduce a new dataset, 360Insta, that comprises authentically degraded omnidirectional images and videos.<n>We conduct comprehensive qualitative and quantitative evaluations of existing methods on both public datasets and our proposed dataset.
arXiv Detail & Related papers (2025-06-07T08:24:44Z) - Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models [13.63552417613795]
We propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models.<n>Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations.
arXiv Detail & Related papers (2025-06-06T21:06:35Z) - Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives [36.297745473653166]
Vision-language modeling (VLM) aims to bridge the information gap between images and natural language.<n>Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress.
arXiv Detail & Related papers (2025-05-20T13:47:40Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision [49.073964142139495]
We systematically review the applications and advancements of multimodal fusion methods and vision-language models.<n>For semantic scene understanding tasks, we categorize fusion approaches into encoder-decoder frameworks, attention-based architectures, and graph neural networks.<n>We identify key challenges in current research, including cross-modal alignment, efficient fusion, real-time deployment, and domain adaptation.
arXiv Detail & Related papers (2025-04-03T10:53:07Z) - Multimodal Alignment and Fusion: A Survey [11.3029945633295]
This survey provides a comprehensive overview of advances in multimodal alignment and fusion within the field of machine learning.<n>We systematically categorize and analyze key approaches to alignment and fusion through both structural perspectives.<n>This survey highlights critical challenges such as cross-modal misalignment, computational bottlenecks, data quality issues, and the modality gap.
arXiv Detail & Related papers (2024-11-26T02:10:27Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Diffusion-based Visual Counterfactual Explanations -- Towards Systematic
Quantitative Evaluation [64.0476282000118]
Latest methods for visual counterfactual explanations (VCE) harness the power of deep generative models to synthesize new examples of high-dimensional images of impressive quality.
It is currently difficult to compare the performance of these VCE methods as the evaluation procedures largely vary and often boil down to visual inspection of individual examples and small scale user studies.
We propose a framework for systematic, quantitative evaluation of the VCE methods and a minimal set of metrics to be used.
arXiv Detail & Related papers (2023-08-11T12:22:37Z) - Single-Layer Vision Transformers for More Accurate Early Exits with Less
Overhead [88.17413955380262]
We introduce a novel architecture for early exiting based on the vision transformer architecture.
We show that our method works for both classification and regression problems.
We also introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis.
arXiv Detail & Related papers (2021-05-19T13:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.