Fugu-MT 論文翻訳(概要): THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science

論文の概要: THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science

arxiv url: http://arxiv.org/abs/2603.05972v1
Date: Fri, 06 Mar 2026 07:12:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.282562
Title: THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science
Title（参考訳）: テキストハイブリッド埋め込みに基づくトピック分析フレームワークTheTAとAI Scientist Agent for Scalable Computational Social Science
Authors: Zhenke Duan, Xin Li,
Abstract要約: 本稿では,テキストハイブリッド埋め込みに基づくトピック分析(THETA)を紹介する。 THETAは、膨大なデータスケールと豊富な理論深度の間のギャップを埋める、新しい計算パラダイムとオープンソースツールである。以上の結果から,LDA,EMM,CTMなどの従来のモデルよりも高い性能を示した。
参考スコア（独自算出の注目度）: 5.225859530177356
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The explosion of big social data has created a scalability trap for traditional qualitative research, as manual coding remains labor-intensive and conventional topic models often suffer from semantic thinning and a lack of domain awareness. This paper introduces Textual Hybrid Embedding based Topic Analysis (THETA), a novel computational paradigm and open-source tool designed to bridge the gap between massive data scale and rich theoretical depth. THETA moves beyond frequency-based statistics by implementing Domain-Adaptive Fine-tuning (DAFT) via LoRA on foundation embedding models, which effectively optimizes semantic vector structures within specific social contexts to capture latent meanings. To ensure epistemological rigor, we encapsulate this process into an AI Scientist Agent framework, comprising Data Steward, Modeling Analyst, and Domain Expert agents, to simulate the human-in-the-loop expert judgment and constant comparison processes central to grounded theory. Departing from purely computational models, this framework enables agents to iteratively evaluate algorithmic clusters, perform cross-topic semantic alignment, and refine raw outputs into logically consistent theoretical categories. To validate the effectiveness of THETA, we conducted experiments across six domains, including financial regulation and public health. Our results demonstrate that THETA significantly outperforms traditional models, such as LDA, ETM, and CTM, in capturing domain-specific interpretive constructs while maintaining superior coherence. By providing an interactive analysis platform, THETA democratizes advanced natural language processing for social scientists and ensures the trustworthiness and reproducibility of research findings. Code is available at https://github.com/CodeSoul-co/THETA.
Abstract（参考訳）: 大規模ソーシャルデータの爆発は、手作業によるコーディングが労働集約的なままであり、従来のトピックモデルはセマンティックスシン化とドメイン認識の欠如に悩まされるため、従来の定性的な研究のためのスケーラビリティの罠を生み出している。本稿では,大規模データスケールと豊富な理論深度の間のギャップを埋めるために設計された,新しい計算パラダイムとオープンソースツールであるTextual Hybrid Embedding based Topic Analysis (THETA)を紹介する。 TheTAは、特定の社会的文脈における意味的ベクトル構造を効果的に最適化し、潜在意味をキャプチャする基礎埋め込みモデル上で、LoRAを介してDAFT(Domain-Adaptive Fine-tuning)を実装することで、周波数ベースの統計学を超えて進んでいる。認識論的厳密性を確保するために、我々は、このプロセスをデータスチュワード、モデリングアナリスト、ドメインエキスパートエージェントからなるAIサイエンティストエージェントフレームワークにカプセル化し、人間とループのエキスパートの判断と、グラウンドド理論の中心となる一定の比較プロセスをシミュレートする。このフレームワークは純粋に計算モデルから離れ、エージェントがアルゴリズムクラスタを反復的に評価し、横断的なセマンティックアライメントを実行し、生の出力を論理的に一貫した理論的カテゴリに洗練することを可能にする。 TheTAの有効性を検証するため、金融規制や公衆衛生を含む6つの領域で実験を行った。以上の結果から,LDA,EMM,CTMなどの従来のモデルでは,優れたコヒーレンスを維持しつつ,ドメイン固有の解釈構造を捕捉し,その性能が著しく向上することが示唆された。対話型分析プラットフォームを提供することにより、TheTAは社会科学者のための高度な自然言語処理を民主化し、研究結果の信頼性と再現性を確保する。コードはhttps://github.com/CodeSoul-co/THETAで入手できる。

論文の概要: THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science

関連論文リスト