Fugu-MT 論文翻訳(概要): Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data

論文の概要: Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data

arxiv url: http://arxiv.org/abs/2508.15793v1
Date: Wed, 13 Aug 2025 01:09:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-31 21:54:20.53156
Title: Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data
Title（参考訳）: 先駆者としてのフォーマット:不均一データのためのLCMにおけるバイアスの定量と解析
Authors: Jiacheng Liu, Mayi Xu, Qiankun Pi, Wenli Li, Ming Zhong, Yuanyuan Zhu, Mengchi Liu, Tieyun Qian,
Abstract要約: 大規模言語モデル(LLM)は、異種フォーマットからの処理情報を必要とするアプリケーションにますます採用されている。本稿では, LLMにおけるフォーマットバイアスを調査し, 解析する試みについて述べる。フォーマットバイアスを低減するための3つの今後の研究方向として、フォーマットのサニタイズと正規化によるデータ前処理の改善、注意再重み付けなどの推論時間介入の導入、フォーマットバランスの取れたトレーニングコーパスの開発を挙げる。
参考スコア（独自算出の注目度）: 17.88854327331652
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including text, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs' ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Despite these concerns, it remains uncertain whether such format biases are systematic, which data-level factors contribute to them, and what internal mechanisms in LLMs underlie their emergence. In this paper, we make the first attempt to investigate and analyze the format bias in LLMs. To systematically investigate the aforementioned questions, we conduct a three-stage empirical study by constructing an heterogeneous data conflict scenario for the exploration of bias. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage aims to examine how key data-level factors, including information richness, structure quality, and format type, influence these biases. The third stage analyzes how format bias emerges within LLMs' attention patterns and evaluates a lightweight intervention to test its potential mitigability. Based on these investigations, we identify three future research directions to reduce format bias: improving data preprocessing through format sanitization and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.
Abstract（参考訳）: 大規模言語モデル(LLM)は、テキスト、テーブル、インフォボックス、知識グラフなどの異種フォーマットからの情報処理を必要とするアプリケーションにますます採用されている。しかし、特定のフォーマットに対する体系的なバイアスは、LLMが不均一なデータを不均一に統合する能力を損なう可能性があるため、推論エラーや下流タスクのリスクの増加につながる可能性がある。これらの懸念にもかかわらず、そのようなフォーマットバイアスが体系的であるか、どのデータレベル要因がそれらに寄与するか、そしてLSMの内部メカニズムがそれらの出現を支えているかは、まだ不明である。本稿では, LLMにおけるフォーマットバイアスを調査し, 解析するための最初の試みを行う。上記の質問を体系的に調査するために,偏見探索のための異種データ競合シナリオを構築し,三段階の実証的研究を行った。第1段階では、様々な LLM におけるバイアスの存在と方向について検討する。第2段階は、情報豊かさ、構造品質、フォーマットタイプといった重要なデータレベル要素が、これらのバイアスにどのように影響するかを検討することを目的としている。第3段階では、フォーマットバイアスがLLMの注意パターン内でどのように現れるかを分析し、その潜在的な軽減可能性をテストするための軽量な介入を評価する。これらの調査に基づき、フォーマットバイアスを低減するための3つの今後の研究方向を、フォーマットのサニタイズと正規化によるデータ前処理の改善、注意再重み付けなどの推論時間介入の導入、フォーマットバランスの取れたトレーニングコーパスの開発である。これらの方向は、より堅牢で公平な異種データ処理システムの設計をサポートする。

論文の概要: Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data

関連論文リスト