HalleluBERT: Let every token that has meaning bear its weight
- URL: http://arxiv.org/abs/2510.21372v1
- Date: Fri, 24 Oct 2025 11:52:29 GMT
- Title: HalleluBERT: Let every token that has meaning bear its weight
- Authors: Raphael Scheible-Schmitt,
- Abstract summary: We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have advanced NLP, yet Hebrew still lacks a large-scale RoBERTa encoder which is extensively trained. Existing models such as HeBERT, AlephBERT, and HeRo are limited by corpus size, vocabulary, or training depth. We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary. Evaluated on NER and sentiment classification benchmarks, HalleluBERT outperforms both monolingual and multilingual baselines. HalleluBERT sets a new state of the art for Hebrew and highlights the benefits of fully converged monolingual pretraining.
Related papers
- mmBERT: A Modern Multilingual Encoder with Annealed Language Learning [57.58071656545661]
mmBERT is an encoder-only language model pretrained on 3T tokens of multilingual text.<n>We add over 1700 low-resource languages to the data mix only during the decay phase.<n>We show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks.
arXiv Detail & Related papers (2025-09-08T17:08:42Z) - GeistBERT: Breathing Life into German NLP [0.22099217573031676]
GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus.<n>The model was trained on a 1.3 TB German corpus with dynamic masking and a fixed sequence length of 512 tokens.<n>It achieved strong results across all tasks, leading among base models and setting a new state-of-the-art (SOTA) in GermEval 2018 fine text classification.
arXiv Detail & Related papers (2025-06-13T15:53:17Z) - SuperBPE: Space Travel for Language Models [103.09169510391972]
We introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm.<n>SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average.<n>Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks.
arXiv Detail & Related papers (2025-03-17T17:53:23Z) - NarrowBERT: Accelerating Masked Language Model Pretraining and Inference [50.59811343945605]
We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2times$.
NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining.
We show that NarrowBERT increases the throughput at inference time by as much as $3.5times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI.
arXiv Detail & Related papers (2023-01-11T23:45:50Z) - Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language [3.0663766446277845]
We present a new pre-trained language model (PLM) for Rabbinic Hebrew, termed Berel.
Berel is trained on modern Hebrew texts, which diverges substantially from Rabbinic Hebrew in terms of its lexicographical, morphological, syntactic and orthographic norms.
We demonstrate the superiority of Berel on Rabbinic texts via a challenge set of Hebrew homographs.
arXiv Detail & Related papers (2022-08-03T06:59:04Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - belabBERT: a Dutch RoBERTa-based language model applied to psychiatric
classification [0.0]
We present belabBERT, a new Dutch language model extending the RoBERTa architecture.
belabBERT is trained on a large Dutch corpus (+32 GB) of web crawled texts.
We evaluate the strength of text-based classification using belabBERT, and compared the results to the existing RobBERT model.
arXiv Detail & Related papers (2021-06-02T11:50:49Z) - AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your
Hebrew NLP Application With [7.345047237652976]
Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology.
While advances reported for English using PLMs are unprecedented, reported advances using PLMs in Hebrew are few and far between.
arXiv Detail & Related papers (2021-04-08T20:51:29Z) - GottBERT: a pure German Language Model [0.0]
No German single language RoBERTa model is yet published, which we introduce in this work (GottBERT)
In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones.
GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture.
arXiv Detail & Related papers (2020-12-03T17:45:03Z) - It's not Greek to mBERT: Inducing Word-Level Translations from
Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages.
We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.