Fugu-MT 論文翻訳(概要): DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

論文の概要: DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

arxiv url: http://arxiv.org/abs/2508.12726v1
Date: Mon, 18 Aug 2025 08:49:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:11.088748
Title: DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
Title（参考訳）: DESIGNER:LLM推論のための設計論理型多分野データ合成
Authors: Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Yuchi Xu, Wenbo Su, Bo Zheng,
Abstract要約: DeSIGNER: DESIGN-logic-guidEd Reasoningデータ合成パイプラインを提案する。中心となる革新はデザイン論理の概念の導入である。これらの設計ロジックを学際的な資料と組み合わせることで、既存のデータセットの難易度や多様性をはるかに超える理性的な疑問を生み出すことができる。
参考スコア（独自算出の注目度）: 20.498029847124034
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often either lack disciplinary breadth or the structural depth necessary to elicit robust reasoning behaviors. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (book corpus and web corpus) to generate multidisciplinary challenging questions. A core innovation of our approach is the introduction of a Design Logic concept, which mimics the question-creation process of human educators. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with disciplinary source materials, we are able to create reasoning questions that far surpass the difficulty and diversity of existing datasets. Based on this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: Design-Logic-Reasoning-Book (DLR-Book), containing 3.04 million challenging questions synthesized from the book corpus, and Design-Logic-Reasoning-Web (DLR-Web), with 1.66 million challenging questions from the web corpus. Our data analysis demonstrates that the questions synthesized by our method exhibit substantially greater difficulty and diversity than those in the baseline datasets. We validate the effectiveness of these datasets by conducting SFT experiments on the Qwen3-8B-Base and Qwen3-4B-Base models. The results show that our dataset significantly outperforms existing multidisciplinary datasets of the same volume. Training with the full datasets further enables the models to surpass the multidisciplinary reasoning performance of the official Qwen3-8B and Qwen3-4B models.
Abstract（参考訳）: 大規模言語モデル(LLM)は多くの自然言語処理において顕著な成功を収めてきたが、それでも複雑で多段階の推論に苦戦している。既存の推論データセットは、しばしば学際的な幅や、頑健な推論行動を引き起こすために必要な構造的な深さを欠いている。我々はDESIGNER:DESIGN-logic-guidEd Reasoningデータ合成パイプラインを提案する。このアプローチの中核的な革新は、人間の教育者の質問作成プロセスを模倣するデザイン論理の概念の導入である。 LLMを使って、さまざまな分野にわたる既存の質問から12万以上の設計ロジックをリバースエンジニアリングし、抽象化します。これらの設計ロジックを学際的な資料と組み合わせることで、既存のデータセットの難易度や多様性をはるかに超える理性的な疑問を生み出すことができる。このパイプラインに基づいて、75の分野にまたがる2つの大規模推論データセットを合成した。DLR-Book(Design-Logic-Reasoning-Book)。データ分析により,本手法で合成した質問は,ベースラインデータセットよりも有意に難易度と多様性を示した。我々は,これらのデータセットの有効性を,Qwen3-8B-BaseモデルとQwen3-4B-Baseモデルを用いてSFT実験により検証した。その結果、我々のデータセットは、同じボリュームの既存の複数の学際的データセットよりも大幅に優れていた。完全なデータセットを使用したトレーニングにより、公式のQwen3-8BとQwen3-4Bモデルの多分野推論性能を上回ることができる。

論文の概要: DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

関連論文リスト