Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
- URL: http://arxiv.org/abs/2602.03891v2
- Date: Thu, 05 Feb 2026 03:35:12 GMT
- Title: Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
- Authors: Seohyun Joo, Yoori Oh,
- Abstract summary: We propose a novel framework, Dual-Pathway AudioSums for Video Highlight Detection (DAViHD)<n>DAViHD is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics.<n>We achieve new state-of-the-art performance on the large-scale MrHi benchmark.
- Score: 3.6453477876255502
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale MrHiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
Related papers
- Semantic visually-guided acoustic highlighting with large vision-language models [34.707752102338816]
Current audio mixing remains largely manual and labor-intensive.<n>It remains unclear which visual aspects are most effective as conditioning signals.<n>We identify which visual-semantic cues most strongly support coherent and visually aligned audio remixing.
arXiv Detail & Related papers (2026-01-12T01:30:15Z) - SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z) - Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics [26.399212357764576]
We propose Dynamic Derivation and Elimination (DDESeg): a novel audio-visual segmentation framework.<n>To mitigate feature confusion, DDESeg reconstructs the semantic content of the mixed audio signal.<n>To reduce the matching difficulty, we introduce a discriminative feature learning module.
arXiv Detail & Related papers (2025-03-17T05:38:05Z) - AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation [49.6922496382879]
We propose a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation.<n>The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models.
arXiv Detail & Related papers (2024-12-19T18:57:21Z) - MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.<n>MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z) - Relevance-guided Audio Visual Fusion for Video Saliency Prediction [23.873134951154704]
We propose a novel relevance-guided audio-visual saliency prediction network dubbedSP.
The Fusion module dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements.
The Multi-scale feature Synergy (MS) module integrates visual features from different encoding stages, enhancing the network's ability to represent objects at various scales.
arXiv Detail & Related papers (2024-11-18T10:42:27Z) - Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis [28.172213291270868]
Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience.
Video-to-Audio (V2A) presents inherent challenges related to audio-visual synchronization.
We construct a controllable video-to-audio model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals.
arXiv Detail & Related papers (2024-09-10T01:07:20Z) - Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition [29.414663568089292]
Audio-visual speech recognition aims to transcribe human speech using both audio and video modalities.
In this study, we strengthen the video features by learning three temporal dynamics in video data.
We achieve the state-of-the-art performance on the LRS2 and LRS3 AVSR benchmarks for the noise-dominant settings.
arXiv Detail & Related papers (2024-07-04T01:25:20Z) - EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving [64.58258341591929]
Auditory Referring Multi-Object Tracking (AR-MOT) is a challenging problem in autonomous driving.
We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers.
We establish the first set of large-scale AR-MOT benchmarks.
arXiv Detail & Related papers (2024-02-28T12:50:16Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Audiovisual SlowFast Networks for Video Recognition [140.08143162600354]
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception.
We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts.
We report results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features.
arXiv Detail & Related papers (2020-01-23T18:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.