Related papers: CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

URL: http://arxiv.org/abs/2603.00889v1
Date: Sun, 01 Mar 2026 03:23:41 GMT
Title: CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
Authors: Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li, Olli Saarikivi, Yun Zhu, Yu Meng,
Abstract summary: CHIMERA is a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning.<n>It has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy.<n>It achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam.
Score: 44.519834940763964
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.

Related papers

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods [41.49799689399879]
We introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens.<n>The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces.<n>Our models establish new state-of-the-art results for their size class.
arXiv Detail & Related papers (2026-01-29T15:07:28Z)
Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns [34.16978953994544]
We define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question.<n>We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential.<n>We show that only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025.
arXiv Detail & Related papers (2025-09-25T13:11:35Z)
Excessive Reasoning Attack on Reasoning LLMs [26.52688123765127]
In this work, we expose a novel threat: adversarial inputs can be crafted to exploit excessive reasoning behaviors.<n>Our results demonstrate a 3x to 9x increase in reasoning length with comparable utility performance.<n>Our crafted adversarial inputs exhibit transferability, inducing computational overhead in o3-mini, o1-mini, DeepSeek-R1, and QWQ models.
arXiv Detail & Related papers (2025-06-17T10:16:52Z)
Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning [66.43194385702297]
Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL)<n>We propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks.
arXiv Detail & Related papers (2025-04-15T21:37:13Z)
Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation [24.081573908824353]
First-order logic (FOL) reasoning is pivotal for intelligent systems.<n>Existing benchmarks often rely on extensive human annotation or handcrafted templates.<n>We propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models with the rigor and precision of symbolic provers.
arXiv Detail & Related papers (2025-02-10T15:31:54Z)
Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models [62.12031550252253]
We present Path-of-Thoughts (PoT), a novel framework designed to tackle relation reasoning.<n>PoT efficiently extracts a task-agnostic graph that identifies crucial entities, relations, and attributes within the problem context.<n>PoT identifies relevant reasoning chains within the graph corresponding to the posed question, facilitating inference of potential answers.
arXiv Detail & Related papers (2024-12-23T20:27:12Z)
Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch [54.12139707822201]
We propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method.<n>By generating diverse questions from scratch, we produce a dataset of 1 million problem-solution pairs.<n>Our experiments demonstrate that models trained on our data outperform existing open-source datasets.
arXiv Detail & Related papers (2024-10-24T12:42:04Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs) Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts. We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.