An Audio-enriched BERT-based Framework for Spoken Multiple-choice
Question Answering
- URL: http://arxiv.org/abs/2005.12142v1
- Date: Mon, 25 May 2020 14:41:28 GMT
- Title: An Audio-enriched BERT-based Framework for Spoken Multiple-choice
Question Answering
- Authors: Chia-Chih Kuo, Shang-Bao Luo, Kuan-Yu Chen
- Abstract summary: In a spoken multiple-choice question answering (SMCQA) task, given a passage, a question, and multiple choices all in the form of speech, the machine needs to pick the correct choice to answer the question.
This study concentrates on designing a BERT-based SMCQA framework, which not only inherits the advantages of contextualized language representations learned by BERT, but integrates the complementary acoustic-level information distilled from audio with the text-level information.
- Score: 11.307739925111944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In a spoken multiple-choice question answering (SMCQA) task, given a passage,
a question, and multiple choices all in the form of speech, the machine needs
to pick the correct choice to answer the question. While the audio could
contain useful cues for SMCQA, usually only the auto-transcribed text is
utilized in system development. Thanks to the large-scaled pre-trained language
representation models, such as the bidirectional encoder representations from
transformers (BERT), systems with only auto-transcribed text can still achieve
a certain level of performance. However, previous studies have evidenced that
acoustic-level statistics can offset text inaccuracies caused by the automatic
speech recognition systems or representation inadequacy lurking in word
embedding generators, thereby making the SMCQA system robust. Along the line of
research, this study concentrates on designing a BERT-based SMCQA framework,
which not only inherits the advantages of contextualized language
representations learned by BERT, but integrates the complementary
acoustic-level information distilled from audio with the text-level
information. Consequently, an audio-enriched BERT-based SMCQA framework is
proposed. A series of experiments demonstrates remarkable improvements in
accuracy over selected baselines and SOTA systems on a published Chinese SMCQA
dataset.
Related papers
- Language Modelling for Speaker Diarization in Telephonic Interviews [13.851959980488529]
Combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER.
The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.
arXiv Detail & Related papers (2025-01-28T18:18:04Z) - Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT [81.99600765234285]
We propose an end-to-end framework to predict the pronunciation of a polyphonic character.
The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier.
arXiv Detail & Related papers (2025-01-02T06:51:52Z) - Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems [0.0]
This paper presents an overview of rule-based system for automatic accentuation and phonemic transcription of Russian texts.
Two parts of the developed system, accentuation and transcription, use different approaches to achieve correct phonemic representations of input phrases.
The developed toolkit is written in the Python language and is accessible on GitHub for any researcher interested.
arXiv Detail & Related papers (2024-10-03T14:43:43Z) - Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - Topic Identification For Spontaneous Speech: Enriching Audio Features
With Embedded Linguistic Information [10.698093106994804]
Traditional topic identification solutions from audio rely on an automatic speech recognition system (ASR) to produce transcripts.
We compare audio-only and hybrid techniques of jointly utilising text and audio features.
The models evaluated on spontaneous Finnish speech demonstrate that purely audio-based solutions are a viable option when ASR components are not available.
arXiv Detail & Related papers (2023-07-21T09:30:46Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - DUAL: Textless Spoken Question Answering with Speech Discrete Unit
Adaptive Learning [66.71308154398176]
Spoken Question Answering (SQA) has gained research attention and made remarkable progress in recent years.
Existing SQA methods rely on Automatic Speech Recognition (ASR) transcripts, which are time and cost-prohibitive to collect.
This work proposes an ASR transcript-free SQA framework named Discrete Unit Adaptive Learning (DUAL), which leverages unlabeled data for pre-training and is fine-tuned by the SQA downstream task.
arXiv Detail & Related papers (2022-03-09T17:46:22Z) - An Initial Investigation of Non-Native Spoken Question-Answering [36.89541375786233]
We show that a simple text-based ELECTRA MC model trained on SQuAD2.0 transfers well for spoken question answering tests.
One significant challenge is the lack of appropriately annotated speech corpora to train systems for this task.
Mismatches must be considered between text documents and spoken responses; non-native spoken grammar and written grammar.
arXiv Detail & Related papers (2021-07-09T21:59:16Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.