Cross-sentence Neural Language Models for Conversational Speech
Recognition
- URL: http://arxiv.org/abs/2106.06922v2
- Date: Tue, 15 Jun 2021 04:44:55 GMT
- Title: Cross-sentence Neural Language Models for Conversational Speech
Recognition
- Authors: Shih-Hsuan Chiu, Tien-Hong Lo and Berlin Chen
- Abstract summary: We propose an effective cross-sentence neural LM approach that reranks the ASR N-best hypotheses of an upcoming sentence.
We also explore to extract task-specific global topical information of the cross-sentence history.
- Score: 17.317583079824423
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An important research direction in automatic speech recognition (ASR) has
centered around the development of effective methods to rerank the output
hypotheses of an ASR system with more sophisticated language models (LMs) for
further gains. A current mainstream school of thoughts for ASR N-best
hypothesis reranking is to employ a recurrent neural network (RNN)-based LM or
its variants, with performance superiority over the conventional n-gram LMs
across a range of ASR tasks. In real scenarios such as a long conversation, a
sequence of consecutive sentences may jointly contain ample cues of
conversation-level information such as topical coherence, lexical entrainment
and adjacency pairs, which however remains to be underexplored. In view of
this, we first formulate ASR N-best reranking as a prediction problem, putting
forward an effective cross-sentence neural LM approach that reranks the ASR
N-best hypotheses of an upcoming sentence by taking into consideration the word
usage in its precedent sentences. Furthermore, we also explore to extract
task-specific global topical information of the cross-sentence history in an
unsupervised manner for better ASR performance. Extensive experiments conducted
on the AMI conversational benchmark corpus indicate the effectiveness and
feasibility of our methods in comparison to several state-of-the-art reranking
methods.
Related papers
- CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation [68.81271028921647]
We introduce CORAL, a benchmark designed to assess RAG systems in realistic multi-turn conversational settings.
CORAL includes diverse information-seeking conversations automatically derived from Wikipedia.
It supports three core tasks of conversational RAG: passage retrieval, response generation, and citation labeling.
arXiv Detail & Related papers (2024-10-30T15:06:32Z) - Crossmodal ASR Error Correction with Discrete Speech Units [16.58209270191005]
We propose a post-ASR processing approach for ASR Error Correction (AEC)
We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon.
We propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality.
arXiv Detail & Related papers (2024-05-26T19:58:38Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Cross-utterance ASR Rescoring with Graph-based Label Propagation [14.669201156515891]
We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation.
In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information.
arXiv Detail & Related papers (2023-03-27T12:08:05Z) - Factual Consistency Oriented Speech Recognition [23.754107608608106]
The proposed framework optimize the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions.
It is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries.
arXiv Detail & Related papers (2023-02-24T00:01:41Z) - A Comparative Study on Speaker-attributed Automatic Speech Recognition
in Multi-party Meetings [53.120885867427305]
Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario.
The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER)
The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
arXiv Detail & Related papers (2022-03-31T06:39:14Z) - Conversational speech recognition leveraging effective fusion methods
for cross-utterance language modeling [12.153618111267514]
We put forward disparate conversation history fusion methods for language modeling in automatic speech recognition.
A novel audio-fusion mechanism is introduced, which manages to fuse and utilize the acoustic embeddings of a current utterance and the semantic content of its corresponding conversation history.
To flesh out our ideas, we frame the ASR N-best hypothesis rescoring task as a prediction problem, leveraging BERT, an iconic pre-trained LM.
arXiv Detail & Related papers (2021-11-05T09:07:23Z) - A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text
Generation [59.64193903397301]
Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.
We conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR)
The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances.
arXiv Detail & Related papers (2021-10-11T13:05:06Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.