Related papers: Dependency Parsing is More Parameter-Efficient with Normalization

Dependency Parsing is More Parameter-Efficient with Normalization

URL: http://arxiv.org/abs/2505.20215v2
Date: Fri, 24 Oct 2025 03:08:06 GMT
Title: Dependency Parsing is More Parameter-Efficient with Normalization
Authors: Paolo Gajo, Domenic Rosati, Hassan Sajjad, Alberto Barrón-Cedeño,
Abstract summary: We argue that biaffine scoring can be made substantially more efficient by performing score normalization.<n>We train $N$-layer stacked BiLSTMs and evaluate their performance with and without normalizing biaffine scores.
Score: 12.09105878208262
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dependency parsing is the task of inferring natural language structure, often approached by modeling word interactions via attention through biaffine scoring. This mechanism works like self-attention in Transformers, where scores are calculated for every pair of words in a sentence. However, unlike Transformer attention, biaffine scoring does not use normalization prior to taking the softmax of the scores. In this paper, we provide theoretical evidence and empirical results revealing that a lack of normalization necessarily results in overparameterized parser models, where the extra parameters compensate for the sharp softmax outputs produced by high variance inputs to the biaffine scoring function. We argue that biaffine scoring can be made substantially more efficient by performing score normalization. We conduct experiments on semantic and syntactic dependency parsing in multiple languages, along with latent graph inference on non-linguistic data, using various settings of a $k$-hop parser. We train $N$-layer stacked BiLSTMs and evaluate the parser's performance with and without normalizing biaffine scores. Normalizing allows us to achieve state-of-the-art performance with fewer samples and trainable parameters. Code: https://github.com/paolo-gajo/EfficientSDP

Related papers

Parameter-Efficient Transformer Embeddings [0.0]
We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs.<n>We train standard transformers and our architecture on natural language inference tasks.<n>Our results demonstrate that the proposed method competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout.
arXiv Detail & Related papers (2025-05-04T21:47:18Z)
Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.<n>Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z)
Language Generation with Strictly Proper Scoring Rules [70.340673452404]
We propose a strategy for adapting scoring rules to language generation, allowing for language modeling with any non-local scoring rules. We train language generation models using two classic strictly proper scoring rules, the Brier score and the Spherical score, as alternatives to the logarithmic score.
arXiv Detail & Related papers (2024-05-29T09:09:00Z)
Provable benefits of score matching [30.317535687908755]
We give the first example of a natural exponential family of distributions such that score matching loss is computationally efficient to optimize. We show that designing a zeroth-order or first-order oracle for optimizing the likelihood loss is NP-hard. Minimizing the score matching loss is both computationally and statistically efficient, with complexity in the ambient dimension.
arXiv Detail & Related papers (2023-06-03T03:42:30Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
Accelerating Attention through Gradient-Based Learned Runtime Pruning [9.109136535767478]
Self-attention is a key enabler of state-of-art accuracy for transformer-based Natural Language Processing models. This paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. We devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism.
arXiv Detail & Related papers (2022-04-07T05:31:13Z)
SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing. We identify that their limitations are rooted in keeping the softmax self-attention during approximations. For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z)
On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar. We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods. Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z)
A Modest Pareto Optimisation Analysis of Dependency Parsers in 2021 [0.38073142980733]
We evaluate three leading dependency systems from different paradigms on a small yet diverse subset languages. As we are interested in efficiency, we evaluate cores without pretrained language models. Biaffine parsing emerges as a well-balanced default choice.
arXiv Detail & Related papers (2021-06-08T09:55:47Z)
Semantic Parsing with Less Prior and More Monolingual Data [12.715221084359085]
This work investigates whether a generic transformer-based seq2seq model can achieve competitive performance with minimal semantic-parsing specific inductive bias design. By exploiting a relatively large monolingual corpus of the target programming language, which is cheap to mine from the web, unlike a parallel corpus, we achieved 80.75% exact match accuracy on Django and 32.57 BLEU score on CoNaLa, both are SOTA to the best of our knowledge.
arXiv Detail & Related papers (2021-01-01T16:02:38Z)
Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application [63.10266319378212]
We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT) We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
arXiv Detail & Related papers (2020-09-22T02:15:05Z)
A Simple Global Neural Discourse Parser [61.728994693410954]
We propose a simple chart-based neural discourse that does not require any manually-crafted features and is based on learned span representations only. We empirically demonstrate that our model achieves the best performance among globals, and comparable performance to state-of-art greedys.
arXiv Detail & Related papers (2020-09-02T19:28:40Z)
Is Supervised Syntactic Parsing Beneficial for Language Understanding? An Empirical Investigation [71.70562795158625]
Traditional NLP has long held (supervised) syntactic parsing necessary for successful higher-level semantic language understanding (LU) Recent advent of end-to-end neural models, self-supervised via language modeling (LM), and their success on a wide range of LU tasks, questions this belief. We empirically investigate the usefulness of supervised parsing for semantic LU in the context of LM-pretrained transformer networks.
arXiv Detail & Related papers (2020-08-15T21:03:36Z)
Efficient Second-Order TreeCRF for Neural Dependency Parsing [23.426500262860777]
In the deep learning (DL) era, parsing models are extremely simplified with little hurt on performance. This paper presents a second-order TreeCRF extension to the biaffine. We propose an effective way to batchify the inside and Viterbi algorithms for direct large matrix operation.
arXiv Detail & Related papers (2020-05-03T03:18:59Z)
Towards Instance-Level Parser Selection for Cross-Lingual Transfer of Dependency Parsers [59.345145623931636]
We argue for a novel cross-lingual transfer paradigm: instance-level selection (ILPS) We present a proof-of-concept study focused on instance-level selection in the framework of delexicalized transfer.
arXiv Detail & Related papers (2020-04-16T13:18:55Z)
Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks [1.6507910904669727]
We show optimal combination of hyper- parameters exists and evaluate various combinations. We obtain better human-assigned WordSim scores, corresponding Spearman correlation and better downstream performances compared to the original model.
arXiv Detail & Related papers (2020-03-23T07:38:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.