Deep learning models for representing out-of-vocabulary words
- URL: http://arxiv.org/abs/2007.07318v2
- Date: Tue, 28 Jul 2020 14:40:56 GMT
- Title: Deep learning models for representing out-of-vocabulary words
- Authors: Johannes V. Lochter, Renato M. Silva, Tiago A. Almeida
- Abstract summary: We present a performance evaluation of deep learning models for representing out-of-vocabulary (OOV) words.
Although the best technique for handling OOV words is different for each task, Comick, a deep learning method that infers the embedding based on the context and the morphological structure of the OOV word, obtained promising results.
- Score: 1.4502611532302039
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Communication has become increasingly dynamic with the popularization of
social networks and applications that allow people to express themselves and
communicate instantly. In this scenario, distributed representation models have
their quality impacted by new words that appear frequently or that are derived
from spelling errors. These words that are unknown by the models, known as
out-of-vocabulary (OOV) words, need to be properly handled to not degrade the
quality of the natural language processing (NLP) applications, which depend on
the appropriate vector representation of the texts. To better understand this
problem and finding the best techniques to handle OOV words, in this study, we
present a comprehensive performance evaluation of deep learning models for
representing OOV words. We performed an intrinsic evaluation using a benchmark
dataset and an extrinsic evaluation using different NLP tasks: text
categorization, named entity recognition, and part-of-speech tagging. Although
the results indicated that the best technique for handling OOV words is
different for each task, Comick, a deep learning method that infers the
embedding based on the context and the morphological structure of the OOV word,
obtained promising results.
Related papers
- Morphological evaluation of subwords vocabulary used by BETO language model [0.1638581561083717]
Subword tokenization algorithms are more efficient and can independently build the necessary vocabulary of words and subwords without human intervention.
In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language.
By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality.
This evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors' claims
arXiv Detail & Related papers (2024-10-03T08:07:14Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - Context-based out-of-vocabulary word recovery for ASR systems in Indian
languages [5.930734371401316]
We propose a post-processing technique to improve the performance of context-based OOV recovery.
The effectiveness of the proposed cost function is evaluated at both word-level and sentence-level.
arXiv Detail & Related papers (2022-06-09T06:51:31Z) - Seed Words Based Data Selection for Language Model Adaptation [11.59717828860318]
We present an approach for automatically selecting sentences, from a text corpus, that match, both semantically and morphologically, a glossary of terms furnished by the user.
The vocabulary of the baseline model is expanded and tailored, reducing the resulting OOV rate.
Results using different metrics (OOV rate, WER, precision and recall) show the effectiveness of the proposed techniques.
arXiv Detail & Related papers (2021-07-20T12:08:27Z) - A Comparison of Methods for OOV-word Recognition on a New Public Dataset [0.0]
We propose using the CommonVoice dataset to create test sets for languages with a high out-of-vocabulary ratio.
We then evaluate, within the context of a hybrid ASR system, how much better subword models are at recognizing OOVs.
We propose a new method for modifying a subword-based language model so as to better recognize OOV-words.
arXiv Detail & Related papers (2021-07-16T19:39:30Z) - Training Bi-Encoders for Word Sense Disambiguation [4.149972584899897]
State-of-the-art approaches in Word Sense Disambiguation leverage lexical information along with pre-trained embeddings from these models to achieve results comparable to human inter-annotator agreement on standard evaluation benchmarks.
We further the state of the art in Word Sense Disambiguation through our multi-stage pre-training and fine-tuning pipeline.
arXiv Detail & Related papers (2021-05-21T06:06:03Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z) - On Vocabulary Reliance in Scene Text Recognition [79.21737876442253]
Methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary.
We call this phenomenon "vocabulary reliance"
We propose a simple yet effective mutual learning strategy to allow models of two families to learn collaboratively.
arXiv Detail & Related papers (2020-05-08T11:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.