Related papers: On the Effect of (Near) Duplicate Subwords in Language Modelling

On the Effect of (Near) Duplicate Subwords in Language Modelling

URL: http://arxiv.org/abs/2404.06508v3
Date: Wed, 17 Jul 2024 17:39:39 GMT
Title: On the Effect of (Near) Duplicate Subwords in Language Modelling
Authors: Anton Schäfer, Thomas Hofmann, Imanol Schlag, Tiago Pimentel,
Abstract summary: We study the impact of near duplicate subwords on LM training efficiency. We find that LMs need roughly 17% more data when trained in a fully duplicated setting. Although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated.
Score: 43.18042176382878
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this by duplicating each subword in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that merging them considerably hurts LM performance. Therefore, although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.

Related papers

An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Subword models struggle with word learning, but surprisal hides it [8.883534683127415]
We study word learning in subword and character language models with the psycholinguistic lexical decision task.<n>While subword LMs struggle to discern words and non-words with high accuracy, character LMs solve this task easily and consistently.
arXiv Detail & Related papers (2025-02-18T13:09:16Z)
Language Models are Symbolic Learners in Arithmetic [8.34588487873447]
Large Language Models (LLMs) are thought to struggle with arithmetic learning due to inherent differences between language modeling and numerical computation. We first investigate whether LLMs leverage partial products during arithmetic learning. We find that although LLMs can identify some partial products after learning, they fail to leverage them for arithmetic tasks, conversely.
arXiv Detail & Related papers (2024-10-21T01:57:16Z)
From Tokens to Words: On the Inner Lexicon of LLMs [7.148628740938674]
Natural language is composed of words, but modern LLMs process sub-words as input. We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent word representations. Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope.
arXiv Detail & Related papers (2024-10-08T09:53:35Z)
Order-Independence Without Fine Tuning [18.020492646988746]
We present Set-Based Prompting, a technique that guarantees the output of an LLM will not have order dependence on a specified set of sub-sequences. Despite our inputs being out of distribution, the impact on expected accuracy is small, where the expectation is over the order of uniformly chosen shuffling of the candidate responses.
arXiv Detail & Related papers (2024-06-04T16:09:13Z)
Bridging the Gap between Different Vocabularies for LLM Ensemble [10.669552498083709]
vocabulary discrepancies among various large language models (LLMs) have constrained previous studies. We propose a novel method to Ensemble LLMs via Vocabulary Alignment (EVA) EVA bridges the lexical gap among various LLMs, enabling meticulous ensemble at each generation step.
arXiv Detail & Related papers (2024-04-15T06:28:20Z)
BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models [2.2863439039616127]
Probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Previous approaches rely on the objective function used in pre-training LMs. We propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement.
arXiv Detail & Related papers (2024-04-05T14:13:55Z)
Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems [76.69936664916061]
We study how the number of LM calls affects the performance of Vote and Filter-Vote. We find, surprisingly, that across multiple language tasks, the performance of both Vote and Filter-Vote can first increase but then decrease as a function of the number of LM calls.
arXiv Detail & Related papers (2024-03-04T19:12:48Z)
MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models [64.70153487607172]
Language Models (LMs) have shown impressive performance in various natural language tasks. When it comes to natural language reasoning, LMs still face challenges such as hallucination, generating incorrect intermediate reasoning steps, and making mathematical errors. Recent research has focused on enhancing LMs through self-improvement using feedback. In this work, we propose Multi-Aspect Feedback, an iterative refinement framework that integrates multiple feedback modules, including frozen LMs and external tools, each focusing on a specific error category.
arXiv Detail & Related papers (2023-10-19T02:32:39Z)
You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM [65.74934004876914]
Retrieval-enhanced language models (LMs) condition their predictions on text retrieved from large external datastores. One such approach, the $k$NN-LM, interpolates any existing LM's predictions with the output of a $k$-nearest neighbors model. We empirically measure the effectiveness of our approach on two English language modeling datasets.
arXiv Detail & Related papers (2022-10-28T02:57:40Z)
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models [106.65127123304842]
Branch-Train-Merge (BTM) is an efficient algorithm for parallel training of large language models (LLMs) BTM learns a set of independent expert LMs (ELMs), each specialized to a different textual domain. Experiments show that BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs.
arXiv Detail & Related papers (2022-08-05T17:46:38Z)
Sort by Structure: Language Model Ranking as Dependency Probing [25.723591566201343]
Making an informed choice of pre-trained language model (LM) is critical for performance, yet environmentally costly, and as such widely underexplored. We propose probing to rank LMs, specifically for parsing dependencies in a given language, by measuring the degree to which labeled trees are recoverable from an LM's contextualized embeddings. Across 46 typologically and architecturally diverse LM-language pairs, our approach predicts the best LM choice of 79% of orders of less compute than training a full magnitude of orders of less compute.
arXiv Detail & Related papers (2022-06-10T08:10:29Z)
Language Model Prior for Low-Resource Neural Machine Translation [85.55729693003829]
We propose a novel approach to incorporate a LM as prior in a neural translation model (TM) We add a regularization term, which pushes the output distributions of the TM to be probable under the LM prior. Results on two low-resource machine translation datasets show clear improvements even with limited monolingual data.
arXiv Detail & Related papers (2020-04-30T16:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.