Related papers: An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering

An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering

URL: http://arxiv.org/abs/2005.12142v1
Date: Mon, 25 May 2020 14:41:28 GMT
Title: An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering
Authors: Chia-Chih Kuo, Shang-Bao Luo, Kuan-Yu Chen
Abstract summary: In a spoken multiple-choice question answering (SMCQA) task, given a passage, a question, and multiple choices all in the form of speech, the machine needs to pick the correct choice to answer the question. This study concentrates on designing a BERT-based SMCQA framework, which not only inherits the advantages of contextualized language representations learned by BERT, but integrates the complementary acoustic-level information distilled from audio with the text-level information.
Score: 11.307739925111944
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In a spoken multiple-choice question answering (SMCQA) task, given a passage, a question, and multiple choices all in the form of speech, the machine needs to pick the correct choice to answer the question. While the audio could contain useful cues for SMCQA, usually only the auto-transcribed text is utilized in system development. Thanks to the large-scaled pre-trained language representation models, such as the bidirectional encoder representations from transformers (BERT), systems with only auto-transcribed text can still achieve a certain level of performance. However, previous studies have evidenced that acoustic-level statistics can offset text inaccuracies caused by the automatic speech recognition systems or representation inadequacy lurking in word embedding generators, thereby making the SMCQA system robust. Along the line of research, this study concentrates on designing a BERT-based SMCQA framework, which not only inherits the advantages of contextualized language representations learned by BERT, but integrates the complementary acoustic-level information distilled from audio with the text-level information. Consequently, an audio-enriched BERT-based SMCQA framework is proposed. A series of experiments demonstrates remarkable improvements in accuracy over selected baselines and SOTA systems on a published Chinese SMCQA dataset.

Related papers

Language Modelling for Speaker Diarization in Telephonic Interviews [13.851959980488529]
Combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.
arXiv Detail & Related papers (2025-01-28T18:18:04Z)
Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT [81.99600765234285]
We propose an end-to-end framework to predict the pronunciation of a polyphonic character. The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier.
arXiv Detail & Related papers (2025-01-02T06:51:52Z)
Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems [0.0]
This paper presents an overview of rule-based system for automatic accentuation and phonemic transcription of Russian texts. Two parts of the developed system, accentuation and transcription, use different approaches to achieve correct phonemic representations of input phrases. The developed toolkit is written in the Python language and is accessible on GitHub for any researcher interested.
arXiv Detail & Related papers (2024-10-03T14:43:43Z)
Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models. We thus obtain an audio-textual (multimodal) representation for each data sample. We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z)
Topic Identification For Spontaneous Speech: Enriching Audio Features With Embedded Linguistic Information [10.698093106994804]
Traditional topic identification solutions from audio rely on an automatic speech recognition system (ASR) to produce transcripts. We compare audio-only and hybrid techniques of jointly utilising text and audio features. The models evaluated on spontaneous Finnish speech demonstrate that purely audio-based solutions are a viable option when ASR components are not available.
arXiv Detail & Related papers (2023-07-21T09:30:46Z)
ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning. This tuning requirement can lead to systems failing to generalise to other datasets and domains. We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z)
End-to-end Spoken Conversational Question Answering: Task, Dataset and Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows. Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning [66.71308154398176]
Spoken Question Answering (SQA) has gained research attention and made remarkable progress in recent years. Existing SQA methods rely on Automatic Speech Recognition (ASR) transcripts, which are time and cost-prohibitive to collect. This work proposes an ASR transcript-free SQA framework named Discrete Unit Adaptive Learning (DUAL), which leverages unlabeled data for pre-training and is fine-tuned by the SQA downstream task.
arXiv Detail & Related papers (2022-03-09T17:46:22Z)
An Initial Investigation of Non-Native Spoken Question-Answering [36.89541375786233]
We show that a simple text-based ELECTRA MC model trained on SQuAD2.0 transfers well for spoken question answering tests. One significant challenge is the lack of appropriately annotated speech corpora to train systems for this task. Mismatches must be considered between text documents and spoken responses; non-native spoken grammar and written grammar.
arXiv Detail & Related papers (2021-07-09T21:59:16Z)
The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020. We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model. We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.