Fugu-MT 論文翻訳(概要): Structured Exploration and Exploitation of Label Functions for Automated Data Annotation

論文の概要: Structured Exploration and Exploitation of Label Functions for Automated Data Annotation

arxiv url: http://arxiv.org/abs/2604.08578v1
Date: Sat, 28 Mar 2026 04:19:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.453746
Title: Structured Exploration and Exploitation of Label Functions for Automated Data Annotation
Title（参考訳）: 自動データアノテーションのためのラベル関数の構造的探索と爆発
Authors: Phong Lam, Ha-Linh Nguyen, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo,
Abstract要約: プログラムラベリングはラベル関数(LF)、すなわちデータセットのトレーニングに弱いラベルを自動的に生成するルールを使用する。本稿では,多様性と信頼性のバランスをとるプログラムラベリングの自動化フレームワークであるEXPONAを紹介する。実験の結果、EXPONAは最先端のLF自動生成手法よりも一貫して優れていた。
参考スコア（独自算出の注目度）: 3.780303340354419
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: High-quality labeled data is critical for training reliable machine learning and deep learning models, yet manual annotation remains costly and error-prone. Programmatic labeling addresses this challenge by using label functions (LFs), i.e., heuristic rules that automatically generate weak labels for training datasets. However, existing automated LF generation methods either rely on large language models (LLMs) to synthesize surface-level heuristics or employ model-based synthesis over hand-crafted primitives. These approaches often result in limited coverage and unreliable label quality. In this paper, we introduce EXPONA, an automated framework for programmatic labeling that formulates LF generation as a principled process balancing diversity and reliability. EXPONA systematically explores multi-level LFs, spanning surface, structural, and semantic perspectives. EXPONA further applies reliability-aware mechanisms to suppress noisy or redundant heuristics while preserving complementary signals. To evaluate EXPONA, we conducted extensive experiments on eleven classification datasets across diverse domains. Experimental results show that EXPONA consistently outperformed state-of-the-art automated LF generation methods. Specifically, EXPONA achieved nearly complete label coverage (up to 98.9%), improved weak label quality by up to 87%, and yielded downstream performance gains of up to 46% in weighted F1. These results indicate that EXPONA's combination of multi-level LF exploration and reliability-aware filtering enabled more consistent label quality and downstream performance across diverse tasks by balancing coverage and precision in the generated LF set.
Abstract（参考訳）: 高品質なラベル付きデータは、信頼できる機械学習とディープラーニングモデルのトレーニングには不可欠だが、手作業によるアノテーションは高価でエラーを起こしやすい。プログラムラベリングは、ラベル関数(LF)、すなわち、データセットをトレーニングするための弱いラベルを自動的に生成するヒューリスティックルールを使用することによって、この問題に対処する。しかし、既存の自動LF生成法は、表面レベルのヒューリスティックを合成するために大きな言語モデル(LLM)に依存するか、手作りプリミティブよりもモデルベースの合成を採用する。これらのアプローチはしばしば、限定的なカバレッジと信頼性の低いラベル品質をもたらす。本稿では,LF生成を多様性と信頼性のバランスの原則として定式化するプログラムラベリングの自動化フレームワークであるEXPONAを紹介する。 EXPONAは多層LF、表面、構造、意味的な視点を体系的に探索する。 EXPONAはさらに、相補的な信号を保持しながらノイズや冗長なヒューリスティックを抑制するために信頼性に配慮したメカニズムを適用している。 EXPONAを評価するために,さまざまな領域にまたがる11の分類データセットについて広範な実験を行った。実験の結果,EXPONAは最先端のLF自動生成法より一貫して優れていた。特に、EXPONAは、ほぼ完全なラベルカバレッジ(98.9%まで)を達成し、弱いラベル品質を最大87%改善し、下流のパフォーマンスは46%まで向上した。これらの結果から, EXPONAの多レベルLF探索と信頼性を考慮したフィルタリングの組み合わせにより,生成したLF集合のカバレッジと精度のバランスをとることにより,多様なタスクにおけるラベル品質とダウンストリーム性能の整合性が向上したことが示唆された。

論文の概要: Structured Exploration and Exploitation of Label Functions for Automated Data Annotation

関連論文リスト