Related papers: Textual Self-attention Network: Test-Time Preference Optimization through Textual Gradient-based Attention

Textual Self-attention Network: Test-Time Preference Optimization through Textual Gradient-based Attention

URL: http://arxiv.org/abs/2511.06682v1
Date: Mon, 10 Nov 2025 04:01:46 GMT
Title: Textual Self-attention Network: Test-Time Preference Optimization through Textual Gradient-based Attention
Authors: Shibing Mo, Haoyang Ruan, Kai Wu, Jing Liu,
Abstract summary: This paper proposes the Textual Self-Attention Network (TSAN), a new paradigm for test-time preference optimization.<n>TSAN emulates self-attention entirely in natural language to overcome this gap.<n>With just three test-time iterations on a base SFT model, TSAN outperforms supervised models like Llama-3.1-70B-Instruct.
Score: 11.162559089998576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization capabilities, but aligning their outputs with human preferences typically requires expensive supervised fine-tuning. Recent test-time methods leverage textual feedback to overcome this, but they often critique and revise a single candidate response, lacking a principled mechanism to systematically analyze, weigh, and synthesize the strengths of multiple promising candidates. Such a mechanism is crucial because different responses may excel in distinct aspects (e.g., clarity, factual accuracy, or tone), and combining their best elements may produce a far superior outcome. This paper proposes the Textual Self-Attention Network (TSAN), a new paradigm for test-time preference optimization that requires no parameter updates. TSAN emulates self-attention entirely in natural language to overcome this gap: it analyzes multiple candidates by formatting them into textual keys and values, weighs their relevance using an LLM-based attention module, and synthesizes their strengths into a new, preference-aligned response under the guidance of the learned textual attention. This entire process operates in a textual gradient space, enabling iterative and interpretable optimization. Empirical evaluations demonstrate that with just three test-time iterations on a base SFT model, TSAN outperforms supervised models like Llama-3.1-70B-Instruct and surpasses the current state-of-the-art test-time alignment method by effectively leveraging multiple candidate solutions.

Related papers

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling [59.088798018184235]
textbfRAPO++ is a cross-stage prompt optimization framework.<n>It unifies training-data-aligned refinement, test-time iterative scaling, and large language model fine-tuning.<n> RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility.
arXiv Detail & Related papers (2025-10-23T04:45:09Z)
Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization [6.3914079241545885]
We present Contrastive Reasoning Prompt Optimization (CRPO), a novel framework that formulates prompt optimization as a retrieval-augmented reasoning process.<n>Our approach retrieves top k reference prompt-response pairs from the HelpSteer2 dataset.<n>By explicitly contrasting high and low quality exemplars, CRPO enables the model to deduce why certain prompts succeed while others fail.
arXiv Detail & Related papers (2025-09-02T08:45:29Z)
Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z)
Test-Time Alignment for Large Language Models via Textual Model Predictive Control [63.508812485566374]
Textual Model Predictive Control (TMPC) is a novel predictive planning framework adapted for aligning Large Language Models at inference time.<n>TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis.<n>Results demonstrate that TMPC consistently improves performance, highlighting the generality.
arXiv Detail & Related papers (2025-02-28T07:24:33Z)
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback [40.01227095901647]
Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining.<n>We introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference.<n>Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly.
arXiv Detail & Related papers (2025-01-22T14:15:46Z)
Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization [67.8738082040299]
Self-Sampling Preference Optimization (SSPO) is a new alignment method for post-training reinforcement learning.<n>SSPO eliminates the need for paired data and reward models while retaining the training stability of SFT.<n>SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)
In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)
Enhancing LLM-Based Text Classification in Political Science: Automatic Prompt Optimization and Dynamic Exemplar Selection for Few-Shot Learning [1.6967824074619953]
Large language models (LLMs) offer substantial promise for text classification in political science.<n>Our framework enhances LLM performance through automatic prompt optimization, dynamic exemplar selection, and a consensus mechanism.<n>An open-source Python package (PoliPrompt) is available on GitHub.
arXiv Detail & Related papers (2024-09-02T21:05:31Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.