Related papers: Real-Time Target Sound Extraction

Real-Time Target Sound Extraction

URL: http://arxiv.org/abs/2211.02250v3
Date: Wed, 19 Apr 2023 09:43:32 GMT
Title: Real-Time Target Sound Extraction
Authors: Bandhav Veluri, Justin Chan, Malek Itani, Tuochao Chen, Takuya Yoshioka, Shyamnath Gollakota
Abstract summary: We present the first neural network model to achieve real-time and streaming target sound extraction. We propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder.
Score: 13.526450617545537
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner while also leveraging the generalization performance of transformer-based architectures. Our evaluations show as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2-4x smaller model size and a 1.5-2x lower runtime. We provide code, dataset, and audio samples: https://waveformer.cs.washington.edu/.

Related papers

Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z)
Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z)
I3D: Transformer architectures with input-dependent dynamic depth for speech recognition [41.35563331283372]
We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs. We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.
arXiv Detail & Related papers (2023-03-14T04:47:00Z)
Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with Very Low Computational Complexity [23.49462995118466]
Framewise WaveGAN vocoder achieves higher quality than auto-regressive maximum-likelihood vocoders such as LPCNet at a very low complexity of 1.2 GFLOPS. This makes GAN vocoders more practical on edge and low-power devices.
arXiv Detail & Related papers (2022-12-08T19:38:34Z)
FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency. Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z)
Yformer: U-Net Inspired Transformer Architecture for Far Horizon Time Series Forecasting [0.0]
Yformer model is based on a novel Y-shaped encoder-decoder architecture that uses direct connection from the downscaled encoder layer to the corresponding upsampled decoder layer in a U-Net inspired architecture. Experiments have been conducted with relevant baselines on four benchmark datasets, demonstrating an average improvement of 19.82, 18.41 percentage MSE and 13.62, 11.85 percentage MAE.
arXiv Detail & Related papers (2021-10-13T13:35:54Z)
Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring [83.32560748324667]
This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models. We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.
arXiv Detail & Related papers (2021-09-09T16:50:16Z)
Scalable and Efficient Neural Speech Coding [24.959825692325445]
This work presents a scalable and efficient neural waveform (NWC) for speech compression. The proposed CNN autoencoder also defines quantization and coding as a trainable module. Compared to the other autoregressive decoder-based neural speech, our decoder has significantly smaller architecture.
arXiv Detail & Related papers (2021-03-27T00:10:16Z)
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR) SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR. We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.