Related papers: You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs

You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs

URL: http://arxiv.org/abs/2510.10223v1
Date: Sat, 11 Oct 2025 14:00:39 GMT
Title: You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs
Authors: Yijie Xu, Huizai Yao, Zhiyu Guo, Weiyu Guo, Pengteng Li, Aiwei Liu, Xuming Hu, Hui Xiong,
Abstract summary: Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture.<n>We study label-free test-time adaptation for language models and present SyTTA, an inference-time framework that adapts models on-the-fly without additional supervision.
Score: 50.54173262572369
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture, where they face significant distribution shifts from their training data. Domain-specific fine-tuning can mitigate this challenge but relies on high-quality labeled data that is expensive and slow to collect in expertise-limited settings. We study label-free test-time adaptation for language models and present SyTTA, an inference-time framework that adapts models on-the-fly without additional supervision. SyTTA couples two complementary uncertainty signals that arise under distribution shift: input-side perplexity, indicating mismatch with domain-specific terminology and patterns, and output-side predictive entropy, indicating diffuse and unstable token probabilities during generation. Across diverse model architectures and domain-specific benchmarks, SyTTA delivers consistent gains. Notably, on agricultural question answering, SyTTA improves Rouge-LSum by over 120% on Qwen-2.5-7B with only 4 extra tokens per query. These results show that effective test-time adaptation for language models is achievable without labeled examples, supporting deployment in label-scarce domains. The code will be made available upon acceptance.

Related papers

Zero-Shot Anomaly Detection with Dual-Branch Prompt Selection [17.263625932911534]
Zero-shot anomaly detection (ZSAD) enables identifying and localizing defects in unseen categories.<n>Existing ZSAD methods, whether using fixed or learned prompts, struggle under domain shifts because their training data are derived from limited training domains.<n>We introduce PILOT, a framework designed to overcome these challenges through two key innovations.
arXiv Detail & Related papers (2025-08-01T17:00:12Z)
Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding [55.2480439325792]
In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution.<n>We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution.<n>We show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation.
arXiv Detail & Related papers (2025-04-29T06:33:13Z)
Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation [55.73341401764367]
We introduce DCSQE, a novel framework for alleviating distribution shift in synthetic QE data.<n> DCSQE uses references, i.e., translation supervision signals, to guide both the generation and annotation processes.<n>Experiments demonstrate that DCSQE outperforms SOTA baselines in both supervised and unsupervised settings.
arXiv Detail & Related papers (2025-02-27T10:11:53Z)
Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z)
Unsupervised Domain Adaptation for Action Recognition via Self-Ensembling and Conditional Embedding Alignment [2.06242362470764]
We propose a novel joint optimization architecture comprised of three functions: consistency regularizer, temporal ensemble and conditional distribution alignment. $mu$DAR results in a range of $approx$ 4-12% average macro-F1 score improvement over six state-of-the-art UDA methods in four benchmark wHAR datasets.
arXiv Detail & Related papers (2024-10-23T00:59:27Z)
Adaptive Test-Time Personalization for Federated Learning [51.25437606915392]
We introduce a novel setting called test-time personalized federated learning (TTPFL) In TTPFL, clients locally adapt a global model in an unsupervised way without relying on any labeled data during test-time. We propose a novel algorithm called ATP to adaptively learn the adaptation rates for each module in the model from distribution shifts among source domains.
arXiv Detail & Related papers (2023-10-28T20:42:47Z)
Test-Time Self-Adaptive Small Language Models for Question Answering [63.91013329169796]
We show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data. Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets.
arXiv Detail & Related papers (2023-10-20T06:49:32Z)
Addressing Distribution Shift at Test Time in Pre-trained Language Models [3.655021726150369]
State-of-the-art pre-trained language models (PLMs) outperform other models when applied to the majority of language processing tasks. PLMs have been found to degrade in performance under distribution shift. We present an approach that improves the performance of PLMs at test-time under distribution shift.
arXiv Detail & Related papers (2022-12-05T16:04:54Z)
Bridging Few-Shot Learning and Adaptation: New Challenges of Support-Query Shift [4.374837991804085]
Few-Shot Learning algorithms have made substantial progress in learning novel concepts with just a handful of labelled data. To classify query instances from novel classes encountered at test-time, they only require a support set composed of a few labelled samples. In a realistic set-ting, data distribution is plausibly subject to change, a situation referred to as Distribution Shift (DS)
arXiv Detail & Related papers (2021-05-25T10:10:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.