Related papers: AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

URL: http://arxiv.org/abs/2406.11619v1
Date: Mon, 17 Jun 2024 15:04:15 GMT
Title: AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling
Authors: Vahid Ahmadi Kalkhorani, Cheng Yu, Anurag Kumar, Ke Tan, Buye Xu, DeLiang Wang,
Abstract summary: This paper introduces AV-CrossNet, an glsav system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.
Score: 48.23652686272613
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an \gls{av} system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.

Related papers

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning [44.518249924335045]
Perception Audiovisual, PE-AV, is a new family of encoders for audio and video understanding trained with scaled contrastive learning.<n>Built on PE, PE-AV makes several key contributions to extend representations to audio, and supports joint embeddings across audio-video, audio-text, and video-text modalities.
arXiv Detail & Related papers (2025-12-22T18:59:07Z)
Learning Visual Affordance from Audio [29.90423475741895]
We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds.<n>To support this task, we construct the first AV-AG dataset, comprising a large collection of action sounds, object images, and pixel-level affordance annotations.<n>We also propose AVAGFormer, a model equipped with a semantic-conditioned cross-modal mixer and a dual-head decoder.
arXiv Detail & Related papers (2025-12-01T18:58:56Z)
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [62.682428307810525]
We introduce AVS-Mamba, a selective state space model to address the audio-visual segmentation task. Our framework incorporates two key components for video understanding and cross-modal learning. Our approach achieves new state-of-the-art results on the AVSBench-object and AVS-semantic datasets.
arXiv Detail & Related papers (2025-01-14T03:20:20Z)
Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
AVSegFormer: Audio-Visual Segmentation with Transformer [42.24135756439358]
A new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. We propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.
arXiv Detail & Related papers (2023-07-03T16:37:10Z)
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z)
AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation [30.756247389435803]
Segment Anything Model (SAM) has recently shown its powerful effectiveness in visual segmentation tasks. We propose a framework based on AV-SAM that can generate sounding object masks corresponding to the audio. We conduct extensive experiments on Flickr-SoundNet and AVSBench datasets.
arXiv Detail & Related papers (2023-05-03T00:33:52Z)
Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification [15.559827597608466]
We present an end-to-end framework, namely attentional graph convolutional network (AGCN) for structure-aware audio-visual scene representation. To well represent the salient regions and contextual information of audio-visual inputs, the salient acoustic graph (SAG) and contextual acoustic graph (CAG) are constructed. Finally, the constructed graphs pass through a graph convolutional network for structure-aware audio-visual scene recognition.
arXiv Detail & Related papers (2022-12-31T07:56:00Z)
Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z)
A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information. We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF) The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.