Fugu-MT 論文翻訳(概要): Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling

論文の概要: Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling

arxiv url: http://arxiv.org/abs/2606.02004v1
Date: Mon, 01 Jun 2026 09:59:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:31.776251
Title: Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling
Title（参考訳）: 商品名を消費者価格カテゴリにリテール化するための機械学習 - 信頼性に富んだヒューマン・イン・ザ・ループラベリングを備えたルール+バガ・オブ・ワードパイプライン
Authors: Vladimir Beskorovainyi,
Abstract要約: 本稿では,一般的な再現可能な手法としてのマッピングについて検討する。我々は、アノテータが二項有効/再帰の判断を下す、Human-in-the-loopプロトコルを使用する。モンテカルロのラベリングプロトコルに関する調査では、信頼性に富んだ投票がほぼ多数派を圧倒している。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Consumer-price measurement increasingly draws on alternative data sources -- scanner, web-scraped, and transaction/receipt data. A recurring obstacle is that product descriptions in such sources are short, noisy, and abbreviated, with no standard product code, so each item must first be mapped to a consumption classification (e.g., the UN COICOP scheme) before prices can be compared. This paper studies that mapping as a general, reproducible method. The pipeline is: (i) text normalization and tokenization of noisy item names; (ii) a prefix-tree (trie) rule-based pre-classifier driven by per-category key-phrases and stop-phrases; and (iii) a per-category binary confirmation model deciding whether an item belongs to a tentatively assigned category. For labels at scale we use a human-in-the-loop protocol in which annotators give a binary valid/reject judgment, aggregated by a dynamically updated reliability weight; the model joins the same rule, enabling continual fine-tuning. Our empirical finding is deflationary: in a controlled, leakage-free study (one category, real positives vs. hard negatives, five seeds), bag-of-words models essentially saturate the task (F1 about 0.99) -- a linear classifier matches a multilayer perceptron, explicit word-order (n-gram) features add nothing, and about 67 labeled examples already suffice. A Monte-Carlo study of the labeling protocol shows the reliability-weighted vote barely beats plain majority (its additive weights saturate) while Dawid-Skene recovers labels markedly better. We also discuss price-level quality control and design lessons for statistical offices considering transaction data. All figures are illustrative; no confidential data, code, or documentation is reproduced.
Abstract（参考訳）: 消費者物価測定は、スキャナー、Webスクラップ、トランザクション/受信データなど、代替データソースに着目する傾向にある。繰り返し発生する障害は、製品記述が短く、騒々しく、省略され、標準製品コードがないため、価格を比較する前に、各項目を消費分類(例えば、UN COICOPスキーム)にマッピングする必要があることである。本稿では,一般的な再現可能な手法としてのマッピングについて検討する。パイプラインは以下のとおりです。一騒々しい項目名の文字の正規化及びトークン化 (ii)カテゴリーごとのキーフレーズと停止フレーズによって駆動されるプレフィックスツリー(トリー)ルールに基づく事前分類器三アイテムが仮に割り当てられたカテゴリに属するか否かを判定するカテゴリごとのバイナリ確認モデルラベルの大規模化には、アノテータが動的に更新された信頼性重みで集約されたバイナリバリデーション/リジェクトの判断を行うヒューマン・イン・ザ・ループプロトコルを使用します。制御されたリークのない研究(一カテゴリ、実陰性対強陰性対5シード)では、バッグ・オブ・ワードのモデルは基本的にタスクを飽和させる(F1約0.99) -- 線形分類器は多層パーセプトロンと一致し、明示的な単語順序(n-gram)機能は無意味であり、すでに67のラベル付き例が十分である。モンテカルロのラベリング・プロトコルに関する調査では、信頼性に富んだ票がほとんど過半数を上回り(加重が飽和する)、一方ダウィド・スケインはラベリングを著しく改善している。また、取引データを考慮した統計事務所の価格レベルの品質管理と設計指導についても論じる。秘密データ、コード、ドキュメントは再生されない。

論文の概要: Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling

関連論文リスト