Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation
- URL: http://arxiv.org/abs/2509.25144v1
- Date: Mon, 29 Sep 2025 17:51:55 GMT
- Title: Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation
- Authors: Yen-Ju Lu, Thomas Thebaud, Laureano Moro-Velazquez, Najim Dehak, Jesus Villalba,
- Abstract summary: Paired by the Teacher (PbT) is a two-stage teacher-student pipeline that synthesizes accurate input-output pairs without human labels or parallel data.<n>We evaluate PbT on five benchmarks-document summarization, dialogue summarization, SAMSum, DialogSum, and question generation.<n>An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora.
- Score: 17.879274676067784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Paired by the Teacher (PbT), a two-stage teacher-student pipeline that synthesizes accurate input-output pairs without human labels or parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like highlights, recaps, or questions, or only raw inputs, such as articles, dialogues, or paragraphs, but seldom both. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. PbT addresses this by asking a teacher LLM to compress each unpaired example into a concise intermediate representation (IR), and training a student to reconstruct inputs from IRs. This enables outputs to be paired with student-generated inputs, yielding high-quality synthetic data. We evaluate PbT on five benchmarks-document summarization (XSum, CNNDM), dialogue summarization (SAMSum, DialogSum), and question generation (SQuAD)-as well as an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, coming within 1.2 ROUGE-L of human-annotated pairs and closing 82% of the oracle gap at one-third the annotation cost of direct synthesis. Human evaluation on SwitchBoard further confirms that only PbT produces concise, faithful summaries aligned with the target style, highlighting its advantage of generating in-domain sources that avoid the mismatch, limiting direct synthesis.
Related papers
- AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning [3.5047438945401717]
We release two datasets that support research on neural retrieval-ranking and instruction-following text generation.<n>The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets.<n>The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types.
arXiv Detail & Related papers (2026-02-10T15:45:20Z) - Expanding the Capabilities of Reinforcement Learning via Text Feedback [49.561885700139676]
We formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference.<n>To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective.<n>Our results show that both methods consistently outperform strong baselines across benchmarks.
arXiv Detail & Related papers (2026-02-02T18:56:56Z) - Synthetic bootstrapped pretraining [52.92577542049469]
We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure.<n>SBP first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training.<n>We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound.
arXiv Detail & Related papers (2025-09-17T22:28:27Z) - NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors [0.12499537119440242]
This paper presents our system for Track 1, Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors.<n>The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's reasoning.<n>Our system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided parsing produceable predictions.
arXiv Detail & Related papers (2025-06-12T12:11:56Z) - Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning [80.27561080938747]
CANOE is a framework to reduce hallucinations of faithfulness of large language models across different downstream tasks without human annotations.<n>Dual-GRPO is a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data.<n> Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different tasks, even outperforming the most advanced LLMs.
arXiv Detail & Related papers (2025-05-22T10:10:07Z) - BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression [91.23933111083389]
Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge.<n>This paper presents BRIEF, a lightweight approach that performs query-aware multi-hop reasoning.<n>Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries.
arXiv Detail & Related papers (2024-10-20T04:24:16Z) - SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z) - From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition [64.59093444558549]
We propose a simple, easy-to-implement, two-step training pipeline that we call From Fake to Real.
By training on real and synthetic data separately, FFR does not expose the model to the statistical differences between real and synthetic data.
Our experiments show that FFR improves worst group accuracy over the state-of-the-art by up to 20% over three datasets.
arXiv Detail & Related papers (2023-08-08T19:52:28Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - Leverage Unlabeled Data for Abstractive Speech Summarization with
Self-Supervised Learning and Back-Summarization [6.465251961564605]
Supervised approaches for Neural Abstractive Summarization require large annotated corpora that are costly to build.
We present a French meeting summarization task where reports are predicted based on the automatic transcription of the meeting audio recordings.
We report large improvements compared to the previous baseline for both approaches on two evaluation sets.
arXiv Detail & Related papers (2020-07-30T08:22:47Z) - Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining.
We distill the approximate marginal distribution over words in context from the syntactic LM.
Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.