Building Whitespace-Sensitive Languages Using Whitespace-Insensitive Components
- URL: http://arxiv.org/abs/2510.08200v1
- Date: Thu, 09 Oct 2025 13:26:47 GMT
- Title: Building Whitespace-Sensitive Languages Using Whitespace-Insensitive Components
- Authors: Alexander Hellwig, Nico Jansen, Bernhard Rumpe,
- Abstract summary: This paper presents a technique for using modular, whitespaceinsensitive language modules to construct whitespace sensitive languages.<n>Our solution aims to increase the reusability of existing language components to reduce development time and increase the overall quality of software languages.
- Score: 42.44842805761906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Software Language Engineering, there is a trend towards reusability by composing modular language components. However, this reusability is severely inhibited by a gap in integrating whitespace-sensitive and whitespace-insensitive languages. There is currently no consistent procedure for seamlessly reusing such language components in both cases, such that libraries often cannot be reused, and whitespacesensitive languages are developed from scratch. This paper presents a technique for using modular, whitespaceinsensitive language modules to construct whitespace sensitive languages by pre-processing language artifacts before parsing. The approach is evaluated by reconstructing a simplified version of the programming language Python. Our solution aims to increase the reusability of existing language components to reduce development time and increase the overall quality of software languages.
Related papers
- From Separate Compilation to Sound Language Composition [7.697692044735504]
This work introduces nlgcheck, a theoretically sound static analysis tool based on data-flow analysis for the Neverlang language workbench.<n>nlgcheck detects potential runtime errors -- such as undefined attribute accesses -- at compile time, preserving separate compilation while maintaining strong static correctness guarantees.
arXiv Detail & Related papers (2026-02-03T17:38:34Z) - Evaluating Cross-Lingual Unlearning in Multilingual Language Models [7.530890774798437]
Subspace-projection achieves strong cross-lingual forgetting with minimal degradation.<n>We show that multilingual forgetting depends on geometry in weight space, motivating subspace-based approaches for future unlearning systems.
arXiv Detail & Related papers (2026-01-10T20:27:32Z) - LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal [34.73949500194166]
multilingual embeddings encode language identity alongside semantics.<n>We propose LangSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings.<n> Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage.
arXiv Detail & Related papers (2026-01-08T09:36:41Z) - LangSAMP: Language-Script Aware Multilingual Pretraining [48.16511046793275]
We propose Language-Script Aware Multilingual Pretraining (LangSAMP)<n>LangSAMP incorporates both language and script embeddings to enhance representation learning.<n>We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages.
arXiv Detail & Related papers (2024-09-26T18:29:10Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Accelerating Multilingual Language Model for Excessively Tokenized Languages [3.5570874721859016]
tokenizers in large language models (LLMs) often fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages.
We introduce a simple yet effective framework to accelerate text generation in such languages.
arXiv Detail & Related papers (2024-01-19T12:26:57Z) - Discovering Low-rank Subspaces for Language-agnostic Multilingual
Representations [38.56175462620892]
Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer.
We present a novel view of projecting away language-specific factors from a multilingual embedding space.
We show that applying our method consistently leads to improvements over commonly used ML-LMs.
arXiv Detail & Related papers (2024-01-11T09:54:11Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.