CLSE: Corpus of Linguistically Significant Entities
- URL: http://arxiv.org/abs/2211.02423v2
- Date: Wed, 30 Aug 2023 12:30:33 GMT
- Title: CLSE: Corpus of Linguistically Significant Entities
- Authors: Aleksandr Chuklin, Justin Zhao, Mihir Kale
- Abstract summary: We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
- Score: 58.29901964387952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the biggest challenges of natural language generation (NLG) is the
proper handling of named entities. Named entities are a common source of
grammar mistakes such as wrong prepositions, wrong article handling, or
incorrect entity inflection. Without factoring linguistic representation, such
errors are often underrepresented when evaluating on a small set of arbitrarily
picked argument values, or when translating a dataset from a linguistically
simpler language, like English, to a linguistically complex language, like
Russian. However, for some applications, broadly precise grammatical
correctness is critical -- native speakers may find entity-related grammar
errors silly, jarring, or even offensive.
To enable the creation of more linguistically diverse NLG datasets, we
release a Corpus of Linguistically Significant Entities (CLSE) annotated by
linguist experts. The corpus includes 34 languages and covers 74 different
semantic types to support various applications from airline ticketing to video
games. To demonstrate one possible use of CLSE, we produce an augmented version
of the Schema-Guided Dialog Dataset, SGD-CLSE. Using the CLSE's entities and a
small number of human translations, we create a linguistically representative
NLG evaluation benchmark in three languages: French (high-resource), Marathi
(low-resource), and Russian (highly inflected language). We establish quality
baselines for neural, template-based, and hybrid NLG systems and discuss the
strengths and weaknesses of each approach.
Related papers
- From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages [0.5706164516481158]
We propose a model-agnostic cost-effective approach to developing bilingual base large language models (LLMs) to support English and any target language.
We performed experiments with three languages, each using a non-Latin script - Ukrainian, Arabic, and Georgian.
arXiv Detail & Related papers (2024-10-24T15:20:54Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs [2.9521383230206966]
This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP)
RuBLiMP includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon.
We find that the widely used language models for Russian are sensitive to morphological and agreement-oriented contrasts but fall behind humans on phenomena requiring understanding of structural relations, negation, transitivity, and tense.
arXiv Detail & Related papers (2024-06-27T14:55:19Z) - How do lexical semantics affect translation? An empirical study [1.0152838128195467]
A distinguishing factor of natural language is that words are typically ordered according to the rules of the grammar of a given language.
We investigate how the word ordering of and lexical similarity between the source and target language affect translation performance.
arXiv Detail & Related papers (2021-12-31T23:28:28Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - On the Difficulty of Translating Free-Order Case-Marking Languages [2.9434930072968584]
We investigate whether free-order case-marking languages are more difficult to translate by state-of-the-art Neural Machine Translation models (NMT)
We find that word order flexibility in the source language only leads to a very small loss of NMT quality.
In medium- and low-resource settings, the overall NMT quality of fixed-order languages remains unmatched.
arXiv Detail & Related papers (2021-07-13T13:09:58Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.