Related papers: A novel multimodal dynamic fusion network for disfluency detection in spoken utterances

A novel multimodal dynamic fusion network for disfluency detection in spoken utterances

URL: http://arxiv.org/abs/2211.14700v1
Date: Sun, 27 Nov 2022 01:54:22 GMT
Title: A novel multimodal dynamic fusion network for disfluency detection in spoken utterances
Authors: Sreyan Ghosh and Utkarsh Tyagi and Sonal Kumar and Manan Suri and Rajiv Ratn Shah
Abstract summary: We propose a novel multimodal architecture for disfluency detection from individual utterances. Our architecture leverages a multimodal dynamic fusion network that adds minimal parameters over an existing text encoder. We show that our proposed model achieves state-of-the-art results on the widely used English Switchboard for disfluency detection.
Score: 43.79216238760557
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Disfluency, though originating from human spoken utterances, is primarily studied as a uni-modal text-based Natural Language Processing (NLP) task. Based on early-fusion and self-attention-based multimodal interaction between text and acoustic modalities, in this paper, we propose a novel multimodal architecture for disfluency detection from individual utterances. Our architecture leverages a multimodal dynamic fusion network that adds minimal parameters over an existing text encoder commonly used in prior art to leverage the prosodic and acoustic cues hidden in speech. Through experiments, we show that our proposed model achieves state-of-the-art results on the widely used English Switchboard for disfluency detection and outperforms prior unimodal and multimodal systems in literature by a significant margin. In addition, we make a thorough qualitative analysis and show that, unlike text-only systems, which suffer from spurious correlations in the data, our system overcomes this problem through additional cues from speech signals. We make all our codes publicly available on GitHub.

Related papers

Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance [10.079930398169205]
Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms.<n> extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances.<n> multimodal fusion often suffers from redundancy and imbalance.
arXiv Detail & Related papers (2026-02-11T05:44:30Z)
Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection [71.59834293521074]
We develop a framework to distinguish between human-authored and machine-generated text.<n>Our method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset.<n>Code, pretrained weights, and demo will be released.
arXiv Detail & Related papers (2025-10-07T08:14:45Z)
From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training [19.396162898865864]
Text-to-Talk (TtT) is a unified audio-text framework that integrates autoregressive (AR) text generation with non-autoregressive (NAR) audio diffusion in a single Transformer.<n>To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text.<n>During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs.
arXiv Detail & Related papers (2025-09-24T12:44:26Z)
MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation [10.828717295018123]
We propose a unified embedding framework that eliminates the need for intermediate text representations. Our model reduces pipeline latency by 50% while achieving higher retrieval accuracy compared to traditional two-stage methods.
arXiv Detail & Related papers (2025-01-26T15:04:02Z)
Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation [5.528860524494717]
This paper presents an innovative approach called BGTAI to simplify multimodal understanding by utilizing gloss-based annotation. By representing text and audio as gloss notations that omit complex semantic nuances, a better alignment with images can potentially be achieved.
arXiv Detail & Related papers (2024-10-04T04:59:50Z)
TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis [34.28164104577455]
Multimodal Sentiment Analysis (MSA) endeavors to understand human sentiment by leveraging language, visual, and acoustic modalities. Past research predominantly focused on improving representation learning techniques and feature fusion strategies. We introduce a Text-oriented Cross-Attention Network (TCAN) emphasizing the predominant role of the text modality in MSA.
arXiv Detail & Related papers (2024-04-06T07:56:09Z)
AI-generated text boundary detection with RoFT [7.2286849324485445]
We study how to detect the boundary between human-written and machine-generated parts of texts. In particular, we find that perplexity-based approaches to boundary detection tend to be more robust to peculiarities of domain-specific data than supervised fine-tuning of the RoBERTa model.
arXiv Detail & Related papers (2023-11-14T17:48:19Z)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts. Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment. We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z)
Cross-stitched Multi-modal Encoders [17.387919594858463]
We combine pretrained speech and text encoders using multi-headed cross-modal attention. The resultant architecture can be used for continuous token-level classification or utterance-level prediction. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.
arXiv Detail & Related papers (2022-04-20T05:09:36Z)
End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned. We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z)
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z)
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks. Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it. In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.