Multilingual Nonce Dependency Treebanks: Understanding how Language Models represent and process syntactic structure
- URL: http://arxiv.org/abs/2311.07497v2
- Date: Wed, 12 Jun 2024 13:41:16 GMT
- Title: Multilingual Nonce Dependency Treebanks: Understanding how Language Models represent and process syntactic structure
- Authors: David Arps, Laura Kallmeyer, Younes Samih, Hassan Sajjad,
- Abstract summary: SPUD (Semantically Perturbed Universal Dependencies) is a framework for creating nonce treebanks for the Universal Dependencies (UD) corpora.
We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks.
- Score: 15.564927804136852
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce SPUD (Semantically Perturbed Universal Dependencies), a framework for creating nonce treebanks for the multilingual Universal Dependencies (UD) corpora. SPUD data satisfies syntactic argument structure, provides syntactic annotations, and ensures grammaticality via language-specific rules. We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks. First, we investigate the effect of nonce data on word co-occurrence statistics, as measured by perplexity scores of autoregressive (ALM) and masked language models (MLM). We find that ALM scores are significantly more affected by nonce data than MLM scores. Second, we show how nonce data affects the performance of syntactic dependency probes. We replicate the findings of M\"uller-Eberstein et al. (2022) on nonce test data and show that the performance declines on both MLMs and ALMs wrt. original test data. However, a majority of the performance is kept, suggesting that the probe indeed learns syntax independently from semantics.
Related papers
- Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)
Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)
We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language [0.0]
This study introduces a prescriptive annotation benchmark grounded in humanities research to ensure consistent, unbiased labeling of offensive language.
We contribute two newly annotated datasets that achieve higher inter-annotator agreement between human and language model (LLM) annotations.
arXiv Detail & Related papers (2024-10-17T08:10:24Z) - Which Syntactic Capabilities Are Statistically Learned by Masked
Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities.
To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Measuring Reliability of Large Language Models through Semantic
Consistency [3.4990427823966828]
We develop a measure of semantic consistency that allows the comparison of open-ended text outputs.
We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions.
arXiv Detail & Related papers (2022-11-10T20:21:07Z) - Multilingual Syntax-aware Language Modeling through Dependency Tree
Conversion [12.758523394180695]
We study the effect on neural language models (LMs) performance across nine conversion methods and five languages.
On average, the performance of our best model represents a 19 % increase in accuracy over the worst choice across all languages.
Our experiments highlight the importance of choosing the right tree formalism, and provide insights into making an informed decision.
arXiv Detail & Related papers (2022-04-19T03:56:28Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Multilingual Irony Detection with Dependency Syntax and Neural Models [61.32653485523036]
It focuses on the contribution from syntactic knowledge, exploiting linguistic resources where syntax is annotated according to the Universal Dependencies scheme.
The results suggest that fine-grained dependency-based syntactic information is informative for the detection of irony.
arXiv Detail & Related papers (2020-11-11T11:22:05Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - Cross-Lingual Adaptation Using Universal Dependencies [1.027974860479791]
We show that models trained using UD parse trees for complex NLP tasks can characterize very different languages.
Based on UD parse trees, we develop several models using tree kernels and show that these models trained on the English dataset can correctly classify data of other languages.
arXiv Detail & Related papers (2020-03-24T13:04:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.