Rethinking Speech Recognition with A Multimodal Perspective via Acoustic
and Semantic Cooperative Decoding
- URL: http://arxiv.org/abs/2305.14049v1
- Date: Tue, 23 May 2023 13:25:44 GMT
- Title: Rethinking Speech Recognition with A Multimodal Perspective via Acoustic
and Semantic Cooperative Decoding
- Authors: Tian-Hao Zhang, Hai-Bo Qin, Zhi-Hao Lai, Song-Lu Chen, Qi Liu, Feng
Chen, Xinyuan Qian, Xu-Cheng Yin
- Abstract summary: We propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR.
Unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively.
We show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively.
- Score: 29.80299587861207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based encoder-decoder (AED) models have shown impressive
performance in ASR. However, most existing AED methods neglect to
simultaneously leverage both acoustic and semantic features in decoder, which
is crucial for generating more accurate and informative semantic states. In
this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for
ASR. In particular, unlike vanilla decoders that process acoustic and semantic
features in two separate stages, ASCD integrates them cooperatively. To prevent
information leakage during training, we design a Causal Multimodal Mask.
Moreover, a variant Semi-ASCD is proposed to balance accuracy and computational
cost. Our proposal is evaluated on the publicly available AISHELL-1 and
aidatatang_200zh datasets using Transformer, Conformer, and Branchformer as
encoders, respectively. The experimental results show that ASCD significantly
improves the performance by leveraging both the acoustic and semantic
information cooperatively.
Related papers
- Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.
To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.
Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates.
arXiv Detail & Related papers (2024-07-09T07:15:56Z) - Agent-driven Generative Semantic Communication with Cross-Modality and Prediction [57.335922373309074]
We propose a novel agent-driven generative semantic communication framework based on reinforcement learning.
In this work, we develop an agent-assisted semantic encoder with cross-modality capability, which can track the semantic changes, channel condition, to perform adaptive semantic extraction and sampling.
The effectiveness of the designed models has been verified using the UA-DETRAC dataset, demonstrating the performance gains of the overall A-GSC framework.
arXiv Detail & Related papers (2024-04-10T13:24:27Z) - An Effective Mixture-Of-Experts Approach For Code-Switching Speech
Recognition Leveraging Encoder Disentanglement [9.28943772676672]
Codeswitching phenomenon remains a major obstacle that hinders automatic speech recognition.
We introduce a novel disentanglement loss to enable the lower-layer of the encoder to capture inter-lingual acoustic information.
We verify that our proposed method outperforms the prior-art methods using pretrained dual-encoders.
arXiv Detail & Related papers (2024-02-27T04:08:59Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Encoder-decoder multimodal speaker change detection [15.290910973040152]
Speaker change detection (SCD) is essential for several applications.
multimodal SCD models, which utilise text modality in addition to audio, have shown improved performance.
This study builds upon two main proposals, a novel mechanism for modality fusion and the adoption of a encoder-decoder architecture.
arXiv Detail & Related papers (2023-06-01T13:55:23Z) - Hybrid Transducer and Attention based Encoder-Decoder Modeling for
Speech-to-Text Tasks [28.440232737011453]
We propose a solution by combining Transducer and Attention based AED-Decoder (TAED) for speech-to-text tasks.
The new method leverages Transducer's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property.
We evaluate the proposed approach on the textscMuST-C dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks.
arXiv Detail & Related papers (2023-05-04T18:34:50Z) - String-based Molecule Generation via Multi-decoder VAE [56.465033997245776]
We investigate the problem of string-based molecular generation via variational autoencoders (VAEs)
We propose a simple, yet effective idea to improve the performance of VAE for the task.
In our experiments, the proposed VAE model particularly performs well for generating a sample from out-of-domain distribution.
arXiv Detail & Related papers (2022-08-23T03:56:30Z) - Joint Encoder-Decoder Self-Supervised Pre-training for ASR [0.0]
Self-supervised learning has shown tremendous success in various speech-related downstream tasks.
In this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning.
arXiv Detail & Related papers (2022-06-09T12:45:29Z) - LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text
Retrieval [117.15862403330121]
We propose LoopITR, which combines dual encoders and cross encoders in the same network for joint learning.
Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder.
arXiv Detail & Related papers (2022-03-10T16:41:12Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - On the Encoder-Decoder Incompatibility in Variational Text Modeling and
Beyond [82.18770740564642]
Variational autoencoders (VAEs) combine latent variables with amortized variational inference.
We observe the encoder-decoder incompatibility that leads to poor parameterizations of the data manifold.
We propose Coupled-VAE, which couples a VAE model with a deterministic autoencoder with the same structure.
arXiv Detail & Related papers (2020-04-20T10:34:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.