Related papers: The Impact of Positional Encodings on Multilingual Compression

The Impact of Positional Encodings on Multilingual Compression

URL: http://arxiv.org/abs/2109.05388v1
Date: Sat, 11 Sep 2021 23:22:50 GMT
Title: The Impact of Positional Encodings on Multilingual Compression
Authors: Vinit Ravishankar, Anders S{\o}gaard
Abstract summary: Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models.
Score: 3.454503173118508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In order to preserve word-order information in a non-autoregressive setting, transformer architectures tend to include positional knowledge, by (for instance) adding positional encodings to token embeddings. Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture; these include, for instance, separating position encodings and token embeddings, or directly modifying attention weights based on the distance between word pairs. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models. We then answer why that is: Sinusoidal encodings were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps. Higher variances in multilingual training distributions requires higher compression, in which case, compositionality becomes indispensable. Learned absolute positional encodings (e.g., in mBERT) tend to approximate sinusoidal embeddings in multilingual settings, but more complex positional encoding architectures lack the inductive bias to effectively learn compositionality and cross-lingual alignment. In other words, while sinusoidal positional encodings were originally designed for monolingual applications, they are particularly useful in multilingual language models.

Related papers

Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training [58.696660064190475]
We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching.
arXiv Detail & Related papers (2025-04-02T15:09:58Z)
Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment [50.80949663719335]
Training for cross-lingual alignment of sentence embeddings distorts the optimal monolingual structure of semantic spaces of individual languages. We train language-specific sentence encoders to avoid negative interference between languages. We then align all non-English monolingual encoders to the English encoder by training a cross-lingual alignment adapter on top of each.
arXiv Detail & Related papers (2024-07-20T13:56:39Z)
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z)
A Morphology-Based Investigation of Positional Encodings [46.667985003225496]
Morphology and word order are closely linked, with the latter incorporated into transformer-based models through positional encodings. This prompts a fundamental inquiry: Is there a correlation between the morphological complexity of a language and the utilization of positional encoding in pre-trained language models? In pursuit of an answer, we present the first study addressing this question, encompassing 22 languages and 5 downstream tasks.
arXiv Detail & Related papers (2024-04-06T07:10:47Z)
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. MYTE produces shorter encodings for all 99 analyzed languages. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z)
The Locality and Symmetry of Positional Encodings [9.246374019271938]
We conduct a systematic study of positional encodings in textbfBi Masked Language Models (BERT-style) We uncover the core function of PEs by identifying two common properties, Locality and Symmetry. We quantify the weakness of current PEs by introducing two new probing tasks, on which current PEs perform poorly.
arXiv Detail & Related papers (2023-10-19T16:15:15Z)
CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling [10.26356931263957]
We introduce CONFLATOR: a neural language modeling approach for code-mixed languages. We show that rotatory positional encodings along with switching point information yield the best results. ConFLATOR outperforms the state-of-the-art on two tasks based on code-mixed Hindi and English.
arXiv Detail & Related papers (2023-09-11T07:02:13Z)
Online Gesture Recognition using Transformer and Natural Language Processing [0.0]
Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences. Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences.
arXiv Detail & Related papers (2023-05-05T10:17:22Z)
Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone. We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z)
Transformer Language Models without Positional Encodings Still Learn Positional Information [45.42248458957122]
We find that transformer language models without any explicit positional encoding are still competitive with standard models. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position.
arXiv Detail & Related papers (2022-03-30T19:37:07Z)
VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.