DefSent: Sentence Embeddings using Definition Sentences
- URL: http://arxiv.org/abs/2105.04339v2
- Date: Tue, 11 May 2021 14:45:57 GMT
- Title: DefSent: Sentence Embeddings using Definition Sentences
- Authors: Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda
- Abstract summary: We propose DefSent, a sentence embedding method that uses definition sentences from a word dictionary.
DefSent is more broadly applicable than methods using NLI datasets without constructing additional datasets.
- Score: 8.08585816311037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentence embedding methods using natural language inference (NLI) datasets
have been successfully applied to various tasks. However, these methods are
only available for limited languages due to relying heavily on the large NLI
datasets. In this paper, we propose DefSent, a sentence embedding method that
uses definition sentences from a word dictionary. Since dictionaries are
available for many languages, DefSent is more broadly applicable than methods
using NLI datasets without constructing additional datasets. We demonstrate
that DefSent performs comparably on unsupervised semantics textual similarity
(STS) tasks and slightly better on SentEval tasks to the methods using large
NLI datasets.
Related papers
- DefSent+: Improving sentence embeddings of language models by projecting definition sentences into a quasi-isotropic or isotropic vector space of unlimited dictionary entries [5.317095505067784]
This paper presents a significant improvement on the previous conference paper known as DefSent.
We propose a novel method to progressively build entry embeddings not subject to the limitations.
As a result, definition sentences can be projected into a quasi-isotropic or isotropic vector space of unlimited dictionary entries.
arXiv Detail & Related papers (2024-05-25T09:43:38Z) - Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z) - Learning to Infer from Unlabeled Data: A Semi-supervised Learning
Approach for Robust Natural Language Inference [47.293189105900524]
Natural Language Inference (NLI) aims at predicting the relation between a pair of sentences (premise and hypothesis) as entailment, contradiction or semantic independence.
Deep learning models have shown promising performance for NLI in recent years, they rely on large scale expensive human-annotated datasets.
Semi-supervised learning (SSL) is a popular technique for reducing the reliance on human annotation by leveraging unlabeled data for training.
arXiv Detail & Related papers (2022-11-05T20:34:08Z) - DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries.
We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z) - Lacking the embedding of a word? Look it up into a traditional
dictionary [0.2624902795082451]
We propose to use definitions retrieved in traditional dictionaries to produce word embeddings for rare words.
DefiNNet and DefBERT significantly outperform state-of-the-art as well as baseline methods for producing embeddings of unknown words.
arXiv Detail & Related papers (2021-09-24T06:27:58Z) - DocNLI: A Large-scale Dataset for Document-level Natural Language
Inference [55.868482696821815]
Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems.
This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI.
arXiv Detail & Related papers (2021-06-17T13:02:26Z) - Mining Knowledge for Natural Language Inference from Wikipedia
Categories [53.26072815839198]
We introduce WikiNLI: a resource for improving model performance on NLI and LE tasks.
It contains 428,899 pairs of phrases constructed from naturally annotated category hierarchies in Wikipedia.
We show that we can improve strong baselines such as BERT and RoBERTa by pretraining them on WikiNLI and transferring the models on downstream tasks.
arXiv Detail & Related papers (2020-10-03T00:45:01Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - FarsTail: A Persian Natural Language Inference Dataset [1.3048920509133808]
Natural language inference (NLI) is one of the central tasks in natural language processing (NLP)
We present a new dataset for the NLI task in the Persian language, also known as Farsi.
This dataset, named FarsTail, includes 10,367 samples which are provided in both the Persian language and the indexed format.
arXiv Detail & Related papers (2020-09-18T13:04:04Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.