Fugu-MT 論文翻訳(概要): A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data

論文の概要: A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data

arxiv url: http://arxiv.org/abs/2606.07128v1
Date: Fri, 05 Jun 2026 10:41:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.693794
Title: A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data
Title（参考訳）: 素数値データにおける非ランダムパターン検出のための機械学習支援プログレッシブ桁ランダムスクリーニングフレームワーク
Authors: Zhuphua Cao,
Abstract要約: ファブリケーションリスク・ディジット・ランダムネス・スクリーニング・モデル(FDRS)は、数値的な研究データから非ランダムな桁パターンの不規則を検出するためのフレームワークである。 FDRSはシングルおよびジョイント・ディシマル・ディジット・テスト、クラマーのV、エントロピー・メトリクス、カルバック・リーブラーの発散、ディジット・プレフレクションの指標、プログレッシブ・サブサンプリング、半教師付きリスクスコアを統合している。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Raw numerical datasets remain less systematically examined in integrity screening than images, plagiarism, or summary-statistic inconsistencies. We developed the Fabrication-risk Digit Randomness Screening model (FDRS), a statistical and machine-learning framework for detecting non-random digit-pattern irregularities in numerical research data. FDRS integrates single- and joint-decimal-digit tests, Cramer's V, entropy metrics, Kullback-Leibler divergence, digit-preference indices, progressive subsampling, and semi-supervised risk scoring. It was evaluated using an instrument-derived enzymatic absorbance dataset (RawData, n=253) and a blinded manually simulated irregular dataset (ErrData, n=255). RawData showed no significant deviation in single third-decimal-digit analysis, whereas ErrData showed a significant deviation. In joint third-fourth decimal digit analysis, ErrData showed higher Cramer's V, lower normalized entropy, higher KL divergence, and a more persistent progressive-subsampling deviation signal. In internal validation, Elastic-net Logistic Regression achieved the highest AUC (0.98395) and lowest Brier score (0.048439), while Random Forest achieved the highest accuracy (0.926667) and balanced accuracy (0.935). RawData received a low ensemble risk score of 0.124627 and was classified as Grade 0; ErrData received a score of 0.740760 and was classified as Grade 3. External real-world benchmarks supported graded risk stratification: three datasets without identified public post-publication concerns were classified as Grade 0 or 1, whereas two datasets from publicly questioned or institutionally handled articles were classified as Grade 2 or 3. FDRS can prioritize raw numerical datasets for further review by integrating interpretable statistical and machine-learning features. It is an auxiliary digit-structure screening tool, not standalone evidence of fabrication or misconduct.
Abstract（参考訳）: 初期の数値データセットは、画像、プラジャリズム、または要約統計の不整合よりも、整合性スクリーニングにおいて体系的に調べられていない。数値的な研究データから非ランダムな桁パターンの不規則を検出するための統計的および機械学習フレームワークであるFDRS(Producement-risk Digit Randomness Screening Model)を開発した。 FDRSはシングルおよびジョイント・ディシマル・ディジット・テスト、クラマーのV、エントロピー・メトリクス、カルバック・リーブラーの発散、ディジット・プレフレクションの指標、プログレッシブ・サブサンプリング、半教師付きリスクスコアを統合している。楽器由来の酵素吸収データセット(RawData, n=253)とブラインドした手動不規則データセット(ErrData, n=255)を用いて評価した。また,RawDataは3桁連続解析では有意差を認めなかったが,ErrDataは有意差を認めなかった。第3次十進数解析では, クレーマーVの上昇, 正常化エントロピーの低下, KLの偏差の増大, プログレッシブ・サブサンプリングの偏差信号の持続性を示した。内部検証では、Elastic-net Logistic Regressionは最高AUC(0.98395)と最低Brierスコア(0.048439)を達成し、Random Forestは最高精度(0.926667)と平衡精度(0.935)を達成した。 RawDataは低いアンサンブルリスクスコア0.124627、グレード0、ErrDataは0.740760、グレード3に分類された。公開後の懸念が特定されていない3つのデータセットはグレード0または1に分類され、公的な疑問または制度的に扱われた2つのデータセットはグレード2または3に分類された。 FDRSは、解釈可能な統計的特徴と機械学習機能を統合することで、さらなるレビューのために生の数値データセットを優先順位付けすることができる。補助的なデジタル構造スクリーニングツールであり、製造や不正行為のスタンドアロンの証拠ではない。

論文の概要: A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data

関連論文リスト