Fugu-MT 論文翻訳(概要): Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLMs for Text Clustering and Controllable Generation

論文の概要: Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLMs for Text Clustering and Controllable Generation

arxiv url: http://arxiv.org/abs/2509.22211v1
Date: Fri, 26 Sep 2025 11:27:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.385298
Title: Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLMs for Text Clustering and Controllable Generation
Title（参考訳）: 質問駆動分析と合成:テキストクラスタリングと制御可能な生成のためのLLMを用いた解釈可能なテーマツリーの構築
Authors: Tiago Fernandes Tavares,
Abstract要約: 二分木を対話的に構築するための再帰的テーマ分割(RTP)を導入する。ツリーの各ノードは、データを意味的に分割する自然言語の質問であり、完全に解釈可能な分類である。 RTPの質問駆動階層はBERTopicのような強力なベースラインからのキーワードベースのトピックよりも解釈可能であることを示す。
参考スコア（独自算出の注目度）: 1.3750624267664158
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Unsupervised analysis of text corpora is challenging, especially in data-scarce domains where traditional topic models struggle. While these models offer a solution, they typically describe clusters with lists of keywords that require significant manual effort to interpret and often lack semantic coherence. To address this critical interpretability gap, we introduce Recursive Thematic Partitioning (RTP), a novel framework that leverages Large Language Models (LLMs) to interactively build a binary tree. Each node in the tree is a natural language question that semantically partitions the data, resulting in a fully interpretable taxonomy where the logic of each cluster is explicit. Our experiments demonstrate that RTP's question-driven hierarchy is more interpretable than the keyword-based topics from a strong baseline like BERTopic. Furthermore, we establish the quantitative utility of these clusters by showing they serve as powerful features in downstream classification tasks, particularly when the data's underlying themes correlate with the task labels. RTP introduces a new paradigm for data exploration, shifting the focus from statistical pattern discovery to knowledge-driven thematic analysis. Furthermore, we demonstrate that the thematic paths from the RTP tree can serve as structured, controllable prompts for generative models. This transforms our analytical framework into a powerful tool for synthesis, enabling the consistent imitation of specific characteristics discovered in the source corpus.
Abstract（参考訳）: テキストコーパスの教師なし解析は、特に従来のトピックモデルが苦労するデータスカース領域では困難である。これらのモデルはソリューションを提供するが、典型的には、意味的コヒーレンスを欠くためにかなりの手作業を必要とするキーワードのリストを持つクラスタを記述する。この重要な解釈可能性のギャップに対処するために、我々は、Large Language Models (LLM)を活用して対話的にバイナリツリーを構築する新しいフレームワークであるRecursive Thematic Partitioning (RTP)を紹介します。ツリー内の各ノードは、データを意味的に分割する自然言語の質問であり、その結果、各クラスタの論理が明示される完全に解釈可能な分類結果となる。我々の実験は、BERTopicのような強力なベースラインからのキーワードベースのトピックよりも、RTPの質問駆動階層の方が解釈可能であることを示した。さらに、これらのクラスタの定量的有用性は、下流の分類タスクにおいて強力な機能として機能することを示し、特に、データの基礎となるテーマがタスクラベルと相関する場合に有効であることを示す。 RTPはデータ探索の新しいパラダイムを導入し、統計パターン発見から知識駆動のテーマ分析へと焦点を移した。さらに、RTP木からのテーマパスが、生成モデルのための構造化された制御可能なプロンプトとして機能することを実証した。これにより、分析フレームワークを強力な合成ツールに変換し、ソースコーパスで発見された特定の特性を一貫した模倣を可能にする。

論文の概要: Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLMs for Text Clustering and Controllable Generation

関連論文リスト