Related papers: Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA

Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA

URL: http://arxiv.org/abs/2602.22617v1
Date: Thu, 26 Feb 2026 04:45:07 GMT
Title: Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA
Authors: Hai Huang, Yann LeCun, Randall Balestriero,
Abstract summary: We introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear.<n>We show this constraint improves signal-to-noise ratio, and preserves diversity by preventing collisions during trajectory.<n>We demonstrate that geometric priors can surpass brute-force scaling.
Score: 50.494504099850325
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) obey consistent scaling laws -- empirical power-law fits that predict how loss decreases with compute, data, and parameters. While predictive, these laws are descriptive rather than prescriptive: they characterize typical training, not optimal training. Surprisingly few works have successfully challenged the data-efficiency bounds implied by these laws -- which is our primary focus. To that end, we introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear. Building on this principle, we propose a novel Semantic Tube Prediction (STP) task, a JEPA-style regularizer that confines hidden-state trajectories to a tubular neighborhood of the geodesic. STP generalizes JEPA to language without requiring explicit multi-view augmentations. We show this constraint improves signal-to-noise ratio, and consequently preserves diversity by preventing trajectory collisions during inference. Empirically, STP allows LLMs to match baseline accuracy with 16$\times$ less training data on the NL-RX-SYNTH dataset, directly violating the data term of Chinchilla-style scaling laws and demonstrating that principled geometric priors can surpass brute-force scaling. Code is available at https://github.com/galilai-group/llm-jepa#stp.

Related papers

Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training [46.54209378000497]
Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM.<n>We propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss.<n>Our method consistently identifies near-optimal training subsets and achieves superior performance on both medical and general-domain benchmarks.
arXiv Detail & Related papers (2025-12-25T05:40:46Z)
Relative Scaling Laws for LLMs [91.73497548097775]
Scaling laws describe how language models improve with additional data, parameters, and compute.<n>We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale.<n>These results show that although scaling improves overall performance, it is not a universal equalizer.
arXiv Detail & Related papers (2025-10-28T16:55:22Z)
Bayesian scaling laws for in-context learning [85.34114399339741]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.<n>We show that ICL approximates a Bayesian learner, which gives rise to a novel Bayesian scaling law for ICL.<n>Our scaling law matches existing scaling laws in accuracy while also offering interpretable terms for task priors, learning efficiency, and per-example probabilities.
arXiv Detail & Related papers (2024-10-21T21:45:22Z)
Temporal Scaling Law for Large Language Models [70.74571133406958]
We propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up.<n>In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position.<n>We derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law.
arXiv Detail & Related papers (2024-04-27T05:49:11Z)
Scaling Laws for Downstream Task Performance in Machine Translation [27.278023091494507]
We study how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by metrics such as BLEU and COMET scores.<n>With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data.
arXiv Detail & Related papers (2024-02-06T17:31:20Z)
Selecting Large Language Model to Fine-tune via Rectified Scaling Law [74.84096546112215]
Given constrained resources, fine-tuning all models and making selections afterward is unrealistic. We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase" By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption.
arXiv Detail & Related papers (2024-02-04T01:55:00Z)
Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT) We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.