Seed Words Based Data Selection for Language Model Adaptation
- URL: http://arxiv.org/abs/2107.09433v1
- Date: Tue, 20 Jul 2021 12:08:27 GMT
- Title: Seed Words Based Data Selection for Language Model Adaptation
- Authors: Roberto Gretter, Marco Matassoni, Daniele Falavigna
- Abstract summary: We present an approach for automatically selecting sentences, from a text corpus, that match, both semantically and morphologically, a glossary of terms furnished by the user.
The vocabulary of the baseline model is expanded and tailored, reducing the resulting OOV rate.
Results using different metrics (OOV rate, WER, precision and recall) show the effectiveness of the proposed techniques.
- Score: 11.59717828860318
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We address the problem of language model customization in applications where
the ASR component needs to manage domain-specific terminology; although current
state-of-the-art speech recognition technology provides excellent results for
generic domains, the adaptation to specialized dictionaries or glossaries is
still an open issue. In this work we present an approach for automatically
selecting sentences, from a text corpus, that match, both semantically and
morphologically, a glossary of terms (words or composite words) furnished by
the user. The final goal is to rapidly adapt the language model of an hybrid
ASR system with a limited amount of in-domain text data in order to
successfully cope with the linguistic domain at hand; the vocabulary of the
baseline model is expanded and tailored, reducing the resulting OOV rate. Data
selection strategies based on shallow morphological seeds and semantic
similarity viaword2vec are introduced and discussed; the experimental setting
consists in a simultaneous interpreting scenario, where ASRs in three languages
are designed to recognize the domain-specific terms (i.e. dentistry). Results
using different metrics (OOV rate, WER, precision and recall) show the
effectiveness of the proposed techniques.
Related papers
- Evaluating Shortest Edit Script Methods for Contextual Lemmatization [6.0158981171030685]
Modern contextual lemmatizers often rely on automatically induced Shortest Edit Scripts (SES) to transform a word form into its lemma.
Previous work has not investigated the direct impact of SES in the final lemmatization performance.
We show that computing the casing and edit operations separately is beneficial overall, but much more clearly for languages with high-inflected morphology.
arXiv Detail & Related papers (2024-03-25T17:28:24Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z) - Graph Adaptive Semantic Transfer for Cross-domain Sentiment
Classification [68.06496970320595]
Cross-domain sentiment classification (CDSC) aims to use the transferable semantics learned from the source domain to predict the sentiment of reviews in the unlabeled target domain.
We present Graph Adaptive Semantic Transfer (GAST) model, an adaptive syntactic graph embedding method that is able to learn domain-invariant semantics from both word sequences and syntactic graphs.
arXiv Detail & Related papers (2022-05-18T07:47:01Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - Contextual Biasing of Language Models for Speech Recognition in
Goal-Oriented Conversational Agents [11.193867567895353]
Goal-oriented conversational interfaces are designed to accomplish specific tasks.
We propose a new architecture that utilizes context embeddings derived from BERT on sample utterances provided during inference time.
Our experiments show a word error rate (WER) relative reduction of 7% over non-contextual utterance-level NLM rescorers on goal-oriented audio datasets.
arXiv Detail & Related papers (2021-03-18T15:38:08Z) - Evolutionary optimization of contexts for phonetic correction in speech
recognition systems [0.0]
It is common for general purpose ASR systems to fail in applications that use a domain-specific language.
Various strategies have been used to reduce the error, such as providing a context that modifies the language model.
This article explores the use of an evolutionary process to generate an optimized context for a specific application domain.
arXiv Detail & Related papers (2021-02-23T04:14:51Z) - Adapting BERT for Word Sense Disambiguation with Gloss Selection
Objective and Example Sentences [18.54615448101203]
Domain adaptation or transfer learning using pre-trained language models such as BERT has proven to be an effective approach for many natural language processing tasks.
We propose to formulate word sense disambiguation as a relevance ranking task, and fine-tune BERT on sequence-pair ranking task to select the most probable sense definition.
arXiv Detail & Related papers (2020-09-24T16:37:04Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Deep learning models for representing out-of-vocabulary words [1.4502611532302039]
We present a performance evaluation of deep learning models for representing out-of-vocabulary (OOV) words.
Although the best technique for handling OOV words is different for each task, Comick, a deep learning method that infers the embedding based on the context and the morphological structure of the OOV word, obtained promising results.
arXiv Detail & Related papers (2020-07-14T19:31:25Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.