Related papers: H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages

H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages

URL: http://arxiv.org/abs/2508.05628v1
Date: Thu, 07 Aug 2025 17:59:01 GMT
Title: H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
Authors: Mehrdad Zakershahrak, Samira Ghodratnama,
Abstract summary: H-NET++ is a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training.<n>On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results.
Score: 0.6629765271909505
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Byte-level language models eliminate fragile tokenizers but face computational challenges in morphologically-rich languages (MRLs), where words span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training. Key innovations include: (1) a lightweight Transformer context-mixer (1.9M parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for document-level consistency, (3) specialized handling of orthographic artifacts (e.g. Persian ZWNJ), and (4) curriculum-based training with staged sequence lengths. On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results: 0.159 BPB reduction versus BPE-based GPT-2-fa (12% better compression), 5.4pp gain on ParsGLUE, 53% improved robustness to ZWNJ corruption, and 73.8% F1 on gold morphological boundaries. Our learned chunks align with Persian morphology without explicit supervision, demonstrating that hierarchical dynamic chunking provides an effective tokenizer-free solution for MRLs while maintaining computational efficiency.

Related papers

Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language [0.0]
We present PunGPT2, the first fully open-source suite of Punjabi large language models.<n>We also present Pun-RAG, a retrieval-augmented generation framework combining PunGPT2 with a dense FAISS retriever.<n>We propose Quantum-RAG, a novel hybrid retrieval system that fuses sparse (BM25) and dense methods with quantum-inspired semantic matching.
arXiv Detail & Related papers (2025-08-03T21:03:22Z)
TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.<n>Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.<n>Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z)
Retrofitting Large Language Models with Dynamic Tokenization [3.608780819053423]
Current language models (LMs) use a fixed, static subword tokenizer.<n>This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English.<n>We propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text.
arXiv Detail & Related papers (2024-11-27T17:51:58Z)
Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors. We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z)
In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL) We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z)
Machine Translation for Ge'ez Language [0.0]
Machine translation for low-resource languages such as Ge'ez faces challenges such as out-of-vocabulary words, domain mismatches, and lack of labeled training data. We develop a multilingual neural machine translation (MNMT) model based on languages relatedness. We also experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches.
arXiv Detail & Related papers (2023-11-24T14:55:23Z)
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT) This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method. SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z)
MorphPiece : A Linguistic Tokenizer for Large Language Models [3.8073142980733]
I propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows comparable or superior performance on a variety of supervised and unsupervised NLP tasks.
arXiv Detail & Related papers (2023-07-14T10:35:04Z)
Effects of sub-word segmentation on performance of transformer language models [0.628122931748758]
We compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation. We show that training with morphological segmentation allows the LMs to: 1. achieve lower perplexity, 2. converge more efficiently in terms of training time, and 3. achieve equivalent or better evaluation scores on downstream tasks.
arXiv Detail & Related papers (2023-05-09T14:30:29Z)
Learning to Decompose Visual Features with Latent Textual Prompts [140.2117637223449]
We propose Decomposed Feature Prompting (DeFo) to improve vision-language models. Our empirical study shows DeFo's significance in improving the vision-language models.
arXiv Detail & Related papers (2022-10-09T15:40:13Z)
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model) GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.