The Impact of Positional Encodings on Multilingual Compression
- URL: http://arxiv.org/abs/2109.05388v1
- Date: Sat, 11 Sep 2021 23:22:50 GMT
- Title: The Impact of Positional Encodings on Multilingual Compression
- Authors: Vinit Ravishankar, Anders S{\o}gaard
- Abstract summary: Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture.
We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models.
- Score: 3.454503173118508
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order to preserve word-order information in a non-autoregressive setting,
transformer architectures tend to include positional knowledge, by (for
instance) adding positional encodings to token embeddings. Several
modifications have been proposed over the sinusoidal positional encodings used
in the original transformer architecture; these include, for instance,
separating position encodings and token embeddings, or directly modifying
attention weights based on the distance between word pairs. We first show that
surprisingly, while these modifications tend to improve monolingual language
models, none of them result in better multilingual language models. We then
answer why that is: Sinusoidal encodings were explicitly designed to facilitate
compositionality by allowing linear projections over arbitrary time steps.
Higher variances in multilingual training distributions requires higher
compression, in which case, compositionality becomes indispensable. Learned
absolute positional encodings (e.g., in mBERT) tend to approximate sinusoidal
embeddings in multilingual settings, but more complex positional encoding
architectures lack the inductive bias to effectively learn compositionality and
cross-lingual alignment. In other words, while sinusoidal positional encodings
were originally designed for monolingual applications, they are particularly
useful in multilingual language models.
Related papers
- Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment [50.80949663719335]
Training for cross-lingual alignment of sentence embeddings distorts the optimal monolingual structure of semantic spaces of individual languages.
We train language-specific sentence encoders to avoid negative interference between languages.
We then align all non-English monolingual encoders to the English encoder by training a cross-lingual alignment adapter on top of each.
arXiv Detail & Related papers (2024-07-20T13:56:39Z) - MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [75.2540291039202]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - A Morphology-Based Investigation of Positional Encodings [46.667985003225496]
Morphology and word order are closely linked, with the latter incorporated into transformer-based models through positional encodings.
This prompts a fundamental inquiry: Is there a correlation between the morphological complexity of a language and the utilization of positional encoding in pre-trained language models?
In pursuit of an answer, we present the first study addressing this question, encompassing 22 languages and 5 downstream tasks.
arXiv Detail & Related papers (2024-04-06T07:10:47Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - The Locality and Symmetry of Positional Encodings [9.246374019271938]
We conduct a systematic study of positional encodings in textbfBi Masked Language Models (BERT-style)
We uncover the core function of PEs by identifying two common properties, Locality and Symmetry.
We quantify the weakness of current PEs by introducing two new probing tasks, on which current PEs perform poorly.
arXiv Detail & Related papers (2023-10-19T16:15:15Z) - CONFLATOR: Incorporating Switching Point based Rotatory Positional
Encodings for Code-Mixed Language Modeling [10.26356931263957]
We introduce CONFLATOR: a neural language modeling approach for code-mixed languages.
We show that rotatory positional encodings along with switching point information yield the best results.
ConFLATOR outperforms the state-of-the-art on two tasks based on code-mixed Hindi and English.
arXiv Detail & Related papers (2023-09-11T07:02:13Z) - Online Gesture Recognition using Transformer and Natural Language
Processing [0.0]
Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences.
Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences.
arXiv Detail & Related papers (2023-05-05T10:17:22Z) - Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone.
We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z) - Transformer Language Models without Positional Encodings Still Learn
Positional Information [45.42248458957122]
We find that transformer language models without any explicit positional encoding are still competitive with standard models.
We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position.
arXiv Detail & Related papers (2022-03-30T19:37:07Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.