MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
- URL: http://arxiv.org/abs/2510.26345v1
- Date: Thu, 30 Oct 2025 10:52:43 GMT
- Title: MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
- Authors: Mykhailo Poliakov, Nadiya Shvai,
- Abstract summary: We investigate the impact of synthetic data generation and fine-tuning techniques on the ability of large language models to recognize fallacious arguments.<n>In this work, we propose Mis Synth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples.<n>Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines.
- Score: 2.1127261244588156
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.
Related papers
- A Technical Exploration of Causal Inference with Hybrid LLM Synthetic Data [3.121656940390038]
Large Language Models (LLMs) offer a flexible means to generate synthetic data.<n>Existing approaches often fail to preserve key causal parameters such as the average treatment effect (ATE)
arXiv Detail & Related papers (2025-10-31T23:34:44Z) - Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls [25.294408301653576]
Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply.<n>We compare natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data.<n>We find pre-training on rephrased synthetic data textitalone is not faster than pre-training on natural web texts.
arXiv Detail & Related papers (2025-10-02T03:24:42Z) - Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z) - FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline [71.19227942708741]
We introduce FLAMES, a Framework for LLM Assessment of Math rEasoning Data Synthesis.<n>Our FLAMES experiments provide valuable insights about the optimal balance of difficulty and diversity of synthetic data.<n>We develop the FLAMES dataset, an effective blend of our novel and existing data synthesis strategies.
arXiv Detail & Related papers (2025-08-22T16:37:40Z) - Scaling Laws of Synthetic Data for Language Models [125.41600201811417]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text [23.412546862849396]
We assume an adversary has access to some synthetic data generated by a Large Language Models (LLMs)<n>We design membership inference attacks (MIAs) that target the training data used to fine-tune the LLM that is then used to synthesize data.<n>We find that canaries crafted for model-based MIAs are sub-optimal for privacy auditing when only synthetic data is released.
arXiv Detail & Related papers (2025-02-19T15:30:30Z) - Why LLMs Are Bad at Synthetic Table Generation (and what to do about it) [11.266896863556124]
Synthetic data generation is integral to ML pipelines, e.g., to augment training data, replace sensitive information, and even to power advanced platforms like DeepSeek.<n>While LLMs fine-tuned for synthetic data generation are gaining traction, synthetic table generation remains under-explored compared to text and image synthesis.<n>This paper shows that LLMs, whether used as-is or after traditional fine-tuning, are inadequate for generating synthetic tables.
arXiv Detail & Related papers (2024-06-20T17:52:29Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification [11.6055501181235]
We investigate the use of verification on synthesized data to prevent model collapse.
We show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse.
arXiv Detail & Related papers (2024-06-11T17:46:16Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.