Related papers: Making Hard Problems Easier with Custom Data Distributions and Loss Regularization: A Case Study in Modular Arithmetic

Related papers

LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations [5.275682987885503]
We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks.<n>We show that models encode a model-specific notion of difficulty that is distinct from human difficulty.<n>We demonstrate that routing queries across a pool of models can exceed the best-performing model.
arXiv Detail & Related papers (2026-02-10T15:57:00Z)
DéjàQ: Open-Ended Evolution of Diverse, Learnable and Verifiable Problems [19.381443841718596]
We introduce DéjQ, a framework that evolves a diverse set of synthetic mathematical problems alongside model training.<n>This evolutionary process adapts to the model's ability throughout training, optimising problems for learnability.<n>We find that the model can generate novel and meaningful problems, and that these LLM-driven mutations improve RL training.
arXiv Detail & Related papers (2026-01-05T09:27:49Z)
Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.497499123166804]
This work studies the performance of frontier language models on a broad set of "easy" reasoning problems.<n>We create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning.<n>We show that even state-of-the-art thinking models consistently fail on such problems and for similar reasons.
arXiv Detail & Related papers (2025-07-09T22:22:49Z)
Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent [15.291830857281015]
We investigate whether Transformers can truly learn simple majority functions when trained using gradient-based methods. Our analysis demonstrates that even after $mathrmpoly(d)$ gradient queries, the generalization error of the Transformer model still remains substantially large.
arXiv Detail & Related papers (2025-04-07T03:08:12Z)
Provably Overwhelming Transformer Models with Designed Inputs [0.0]
We say that $mathcalM$ is overwhelmed'' by $s$ when the output of the model evaluated on this string plus any additional string $t$, $mathcalM(s + t)$, is completely insensitive to the value of the string $t$ whenever length($t$) $leq n_free$. We prove a particularly strong worst-case form of over-squashing'', which we use to bound the model's behavior.
arXiv Detail & Related papers (2025-02-09T21:21:57Z)
The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities [51.594836904623534]
We investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples.<n>We show that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts.<n>Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve.
arXiv Detail & Related papers (2025-01-15T10:57:55Z)
Can a Large Language Model Learn Matrix Functions In Context? [3.7478782183628634]
Large Language Models (LLMs) have demonstrated the ability to solve complex tasks through In-Context Learning (ICL) This paper explores the capacity of LLMs to solve non-linear numerical computations, with specific emphasis on functions of the Singular Value Decomposition.
arXiv Detail & Related papers (2024-11-24T00:33:43Z)
Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia [55.23627698804683]
We study the scaling behavior of different numeral systems in the context of transformer-based large language models. A base $10$ system is consistently more data-efficient than a base $102$ or $103$ system across training data scale. We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations.
arXiv Detail & Related papers (2024-09-25T22:08:31Z)
MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models. MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z)
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z)
RKLD: Reverse KL-Divergence-based Knowledge Distillation for Unlearning Personal Information in Large Language Models [23.91608718129775]
We propose RKLD, a novel textbfReverse textbfKL-Divergence-based Knowledge textbfDistillation unlearning algorithm for large language models (LLMs) We achieve significant forget quality and effectively maintain the model utility in our experiments.
arXiv Detail & Related papers (2024-06-04T05:51:43Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model. We employ a proxy model which has far fewer parameters, and take its answers as answers. Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z)
Salsa Fresca: Angular Embeddings and Pre-Training for ML Attacks on Learning With Errors [10.800552110718714]
Learning with Errors (LWE) is a hard math problem underlying post-quantum cryptography systems for key exchange and digital signatures. Prior work proposed new machine learning (ML)-based attacks on LWE problems with small, sparse secrets, but these attacks require millions of LWE samples to train on and take days to recover secrets. We propose three key methods -- better preprocessing, angular embeddings and model pre-training -- to improve these attacks.
arXiv Detail & Related papers (2024-02-02T00:48:27Z)
In-Context Learning Creates Task Vectors [40.990432572831885]
In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm. Here we show that the functions learned by ICL often have a very simple structure. We support the above claim via comprehensive experiments across a range of models and tasks.
arXiv Detail & Related papers (2023-10-24T15:17:14Z)
Length Generalization in Arithmetic Transformers [41.62455986786115]
We show how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training. We propose train set priming: adding a few ($10$ to $50$) long sequences to the training set. We show that priming allows models trained on $5$-digit $times$ $3$-digit multiplications to generalize to $35times 3$ examples.
arXiv Detail & Related papers (2023-06-27T11:53:25Z)
Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton. We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$. We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z)
On the Provable Generalization of Recurrent Neural Networks [7.115768009778412]
We analyze the training and generalization for Recurrent Neural Network (RNN) We prove a generalization error bound to learn functions without normalized conditions. We also prove a novel result to learn N-variables functions of input sequence.
arXiv Detail & Related papers (2021-09-29T02:06:33Z)
Halving the width of Toffoli based constant modular addition to n+3 qubits [69.43216268165402]
We present an arithmetic circuit performing constant modular addition having $mathcalO(n)$ depth of Toffoli gates. This is an improvement by a factor of two compared to the width of the state-of-the-art Toffoli-based constant modular adder.
arXiv Detail & Related papers (2021-02-06T17:07:48Z)
Improving Robustness and Generality of NLP Models Using Disentangled Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$. We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning. We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)
On the Theory of Transfer Learning: The Importance of Task Diversity [114.656572506859]
We consider $t+1$ tasks parameterized by functions of the form $f_j circ h$ in a general function class $mathcalF circ mathcalH$. We show that for diverse training tasks the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(mathcalH) + t C(mathcalF)$.
arXiv Detail & Related papers (2020-06-20T20:33:59Z)
On the Modularity of Hypernetworks [103.1147622394852]
We show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method.
arXiv Detail & Related papers (2020-02-23T22:51:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.