Fugu-MT: arxivの論文翻訳

このサイトではarxivの論文のうち、30ページ以下でCreative Commonsライセンス（CC 0, CC BY, CC BY-SA）の論文を日本語訳しています。本文がCCでない論文、長すぎる論文はメタデータのみを翻訳しています。（arxivのメタデータは CC 0です。）翻訳文のライセンスはCC BY-SA 4.0です。翻訳にはFugu-Machine Translatorを利用しています。

本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。

公開日が20241018となっている論文です。

Title	Authors	Abstract	論文公表日・翻訳日
# トレイン・アンド・コンストレイン:トピックとパラフレーズから音韻的にインフォームドされたトング・ツイスター生成 Train & Constrain: Phonologically Informed Tongue-Twister Generation from Topics and Paraphrases ( http://arxiv.org/abs/2403.13901v2 ) ライセンス: Link先を確認	Tyler Loakman, Chen Tang, Chenghua Lin,	(参考訳) 音韻学的・音声学的に根ざした言語生成の先行研究は、主に句や詩などの領域に焦点を当てている。本稿では,入力話題やフレーズとのセマンティックな整合性を維持しつつ,文法的正確性を維持しつつ,音素レベルで条件を定めなければならない言語である,英語舌ツイスターの生成に関する新たな研究について述べる。提案するTwisterListerは,人間の言語モデル(LLM)から音韻的に入力された舌ねじれ音を生成するパイプラインであり,人間の言語モデルとLLMの著者の組み合わせによる17K以上の例からなる,舌ねじれ音のアノテートデータセットであるTwistList 2.0を生成する。我々の生成パイプラインは、LLMと共に音韻的に制約された語彙を用いることで、新規な非派生的な舌ねじれの例を生成する。さらに, 音声学的知識を明示的に注入することなく, 音韻的動機付け言語が生成できる範囲を示すために, 生成されたデータセット上で訓練された小型モデルの自動的, 人為的評価結果も提示する。さらに、自動回帰言語モデルに統合可能な音素制約付きデコードモジュール(PACD)を導入し、基礎となる言語モデルを微調整することなく良質な舌ねじれを生成することを示した。また,主に音素編集距離(PED)に基づいて,音韻的に動機付けされ,舌ねじり器の独特な本質を捉えた舌ねじり器生成作業のための多種多様な自動測度を設計・実装する。 Previous work in phonologically and phonetically grounded language generation has mainly focused on domains such as puns and poetry. In this article, we present new work on the generation of English tongue twisters - a form of language that is required to be conditioned on a phoneme level to maximize sound overlap, while maintaining semantic consistency with an input topic or phrase and still being grammatically correct. We present TwisterLister, a pipeline for generating phonologically informed tongue twisters from large language models (LLMs) that we use to generate TwistList 2.0, the largest annotated dataset of tongue twisters to date, consisting of 17K+ examples from a combination of human and LLM authors. Our generation pipeline involves the use of a phonologically constrained vocabulary alongside LLM prompting to generate novel, non-derivative tongue twister examples. We additionally present the results of automatic and human evaluation of smaller models trained on our generated dataset to demonstrate the extent to which phonologically motivated language types can be generated without explicit injection of phonological knowledge. Additionally, we introduce a phoneme-aware constrained decoding module (PACD) that can be integrated into an autoregressive language model and demonstrate that this method generates good quality tongue twisters both with and without fine-tuning the underlying language model. We also design and implement a range of automatic metrics for the task of tongue twister generation that is phonologically motivated and captures the unique essence of tongue twisters, primarily based on phonemic edit distance (PED)	翻訳日:2024-11-09 03:59:23 公開日:2024-10-18
# リモート光胸腺造影信号形態に基づく生体認証 Biometric Authentication Based on Enhanced Remote Photoplethysmography Signal Morphology ( http://arxiv.org/abs/2407.04127v2 ) ライセンス: Link先を確認	Zhaodong Sun, Xiaobai Li, Jukka Komulainen, Guoying Zhao,	(参考訳) 遠隔プラチスモグラフィー(Remote Photoplethysmography、rPPG)は、コンタクトセンサーから得られる接触型フォトプレチスモグラフィー(cPPG)の代替として、顔画像から心臓の信号を計測する非接触式方法である。近年の研究では、顔画像から抽出したrPPG信号の形態を人物認証に利用するために、各個人が生体認証として利用できる独自のcPPG信号形態を持っていることが示されている。顔の外観とrPPGが混在しているため、まず顔の外観を識別し、rPPG情報を保持しながら顔の外観を除去し、顔のプライバシーを保護し、rPPGのみが認証に使用されることを保証する。未同定ビデオは、rPPG信号形態を認証するためにrPPGモデルに入力される。第1の訓練段階では、粗いrPPG信号を得るために、教師なしrPPG訓練を行う。第2の訓練段階では、外部のcPPGデータセットを組み込んで、rPPG生体認証を実現し、rPPG信号形態を向上することにより、rPPG-cPPGハイブリッドトレーニングを行う。提案手法では,rPPG認証モデルのトレーニングを行うために,対象ID付き顔認識ビデオのみを必要とする。実験により, 顔画像に隠されたrPPG信号形態が生体認証に有効であることが確認された。コードはhttps://github.com/zhaodongsun/rppg_biometricsで公開されている。 Remote photoplethysmography (rPPG) is a non-contact method for measuring cardiac signals from facial videos, offering a convenient alternative to contact photoplethysmography (cPPG) obtained from contact sensors. Recent studies have shown that each individual possesses a unique cPPG signal morphology that can be utilized as a biometric identifier, which has inspired us to utilize the morphology of rPPG signals extracted from facial videos for person authentication. Since the facial appearance and rPPG are mixed in the facial videos, we first de-identify facial videos to remove facial appearance while preserving the rPPG information, which protects facial privacy and guarantees that only rPPG is used for authentication. The de-identified videos are fed into an rPPG model to get the rPPG signal morphology for authentication. In the first training stage, unsupervised rPPG training is performed to get coarse rPPG signals. In the second training stage, an rPPG-cPPG hybrid training is performed by incorporating external cPPG datasets to achieve rPPG biometric authentication and enhance rPPG signal morphology. Our approach needs only de-identified facial videos with subject IDs to train rPPG authentication models. The experimental results demonstrate that rPPG signal morphology hidden in facial videos can be used for biometric authentication. The code is available at https://github.com/zhaodongsun/rppg_biometrics.	翻訳日:2024-11-08 23:57:53 公開日:2024-10-18
# 地球回転による中性子の角運動量測定 Measuring the Angular Momentum of a Neutron Using Earth's Rotation ( http://arxiv.org/abs/2407.09307v3 ) ライセンス: Link先を確認	Niels Geerits, Stephan Sponar, Kyle E. Steffen, William M. Snow, Steven R. Parnell, Giacomo Mauri, Gregory N. Smith, Robert M. Dalgliesh, Victor de Haan,	(参考訳) サニャック効果(英語版)として知られる地球回転と軌道角運動量(OAM)の結合は、スピンエコー干渉計を用いて生じる絡み合った中性子で観測される。機器の体系的な修正の後、測定された結合は理論の5%以内であり、不確実性は7.2%である。セットアップ中のOAMは伝播方向を横切り、波長(4A〜12.75A)と直線的にスケールするので、デバイスを機械的に回転させることなく結合を可変させることができる。したがって、系統的な誤差は以前の実験より低い。検出されたビームの逆OAMは、以前の中性子実験より5桁低い4098 +- 295 hbar A-1と一致し、サニャック効果を用いて中性子OAMを確定測定し、量子サニャック効果の観測への道を開く可能性を示す。 A coupling between Earths rotation and orbital angular momentum (OAM), known as the Sagnac effect, is observed in entangled neutrons produced using a spin echo interferometer. After correction for instrument systematics the measured coupling is within 5% of theory, with an uncertainty of 7.2%. The OAM in our setup is transverse to the propagation direction and scales linearly with wavelength (4 A - 12.75 A), hence the coupling can be varied, without mechanically rotating the device. Therefore, the systematic error is lower than in previous experiments. The detected transverse OAM of our beam corresponds to 4098 +- 295 hbar A-1, 5 orders of magnitude lower than in previous neutron experiments, thereby demonstrating the feasibility of using the Sagnac effect to definitively measure neutron OAM and paving the way towards observations of the quantum Sagnac effect	翻訳日:2024-11-08 22:06:29 公開日:2024-10-18
# ASTPrompter: 毒なプロンプットを識別する言語モデルの再設計 ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts ( http://arxiv.org/abs/2407.09447v2 ) ライセンス: Link先を確認	Amelia F. Hardy, Houjun Liu, Bernard Lange, Mykel J. Kochenderfer,	(参考訳) 大規模言語モデル(LLM)の自動再チームの典型的なスキームは、凍結した言語モデル(ディフェンダー)をトリガーして有害なテキストを生成するプロンプトを発見することに焦点を当てている。これはしばしば、不可知であり、起こりそうもないテキストを生成するプロンプトモデル(敵)を生み出します。本稿では,(1)凍結したディフェンダーから有毒な出力を誘導するプロンプトと(2)そのディフェンダーが得点するパープレキシティの低いプロンプトの発見を可能にする,LDMレッドチームタスクの強化学習形式を提案する。これらのケースは、ディフェンダーモデルの通常の使用中に発生する可能性が高いため、レッドチーム環境で最も重要なケースである、と我々は主張する。我々は、GPT-2、GPT-2 XL、TinyLlamaディフェンダーによる、オンラインおよび弱教師付きIdentity Preference Optimization(IPO)によるこの定式化を解決する。当社のポリシーは、これらすべてのアーキテクチャから毒性を引き起こす可能性のある(低複雑さ)プロンプトを生成することができることを実証しています。さらに,このポリシーは,高い確率で発生し,より効果的である攻撃を発生させることにより,ベースラインよりも優れていることを示す。最後に, 可能性と毒性のトレードオフについて検討した。このプロジェクトのソースコードは、https://github.com/sisl/ASTPrompter/.comで入手できる。 Typical schemes for the automated red-teaming of large language models (LLMs) focus on discovering prompts that trigger a frozen language model (the defender) to generate toxic text. This often results in the prompting model (the adversary) producing text that is unintelligible and unlikely to arise. Here, we propose a reinforcement learning formulation of the LLM red-teaming task that allows us to discover prompts that both (1) trigger toxic outputs from a frozen defender and (2) have low perplexity as scored by that defender. We argue these cases are the most pertinent in a red-teaming setting because they are likely to arise during normal use of the defender model. We solve this formulation through a novel online and weakly supervised variant of Identity Preference Optimization (IPO) on GPT-2, GPT-2 XL, and TinyLlama defenders. We demonstrate that our policy is capable of generating likely (low-perplexity) prompts that also trigger toxicity from all of these architectures. Furthermore, we show that this policy outperforms baselines by producing attacks that are occur with higher probability and are more effective. Finally, we discuss our findings and the observed trade-offs between likelihood vs toxicity. Source code for this project is available for this project at: https://github.com/sisl/ASTPrompter/.	翻訳日:2024-11-08 22:06:29 公開日:2024-10-18
# 極端における顆粒の因果関係 Granger Causality in Extremes ( http://arxiv.org/abs/2407.09632v2 ) ライセンス: Link先を確認	Juraj Bodik, Olivier C. Pasche,	(参考訳) 本稿では,時系列における極端事象からの因果関係の同定を目的とした,極端におけるグランガー因果関係の厳密な数学的枠組みを提案する。グランガー因果関係は、時間変化変数間の方向関係を明らかにする上で重要な役割を果たす。この概念は極端かつ非常に不安定な期間に重要性を増すが、最先端の手法は主に分布の本体内の因果性に焦点を当てており、しばしば極端な出来事にのみ現れる因果的メカニズムを見落としている。本フレームワークは, 因果尾係数を利用して, 主に極端な事象から因果関係を推定するように設計されている。我々は、極端な因果関係と(古典的な)グランガー因果関係、シムズ因果関係、構造因果関係などの他の因果関係の概念の等価性を確立する。 Grangerの因果関係の他の重要な性質を極端に証明し、このフレームワークが隠れた共同創設者の存在下で特に有用であることを示す。また,データから極端にグランガー因果性が存在することを検出する新しい推論手法を提案する。提案手法はモデルフリーであり, 非線形・高次元時系列処理が可能であり, 性能, 速度の両面において, 現状の手法よりも優れており, 財務・極端気象観測におけるコヒーレントな効果を明らかにすることができた。 We introduce a rigorous mathematical framework for Granger causality in extremes, designed to identify causal links from extreme events in time series. Granger causality plays a pivotal role in uncovering directional relationships among time-varying variables. While this notion gains heightened importance during extreme and highly volatile periods, state-of-the-art methods primarily focus on causality within the body of the distribution, often overlooking causal mechanisms that manifest only during extreme events. Our framework is designed to infer causality mainly from extreme events by leveraging the causal tail coefficient. We establish equivalences between causality in extremes and other causal concepts, including (classical) Granger causality, Sims causality, and structural causality. We prove other key properties of Granger causality in extremes and show that the framework is especially helpful under the presence of hidden confounders. We also propose a novel inference method for detecting the presence of Granger causality in extremes from data. Our method is model-free, can handle non-linear and high-dimensional time series, outperforms current state-of-the-art methods in all considered setups, both in performance and speed, and was found to uncover coherent effects when applied to financial and extreme weather observations.	翻訳日:2024-11-08 21:54:45 公開日:2024-10-18
# ヒューマン・アウェア・パス・プランニングのための社会的コスト関数の学習 Learning Social Cost Functions for Human-Aware Path Planning ( http://arxiv.org/abs/2407.10547v2 ) ライセンス: Link先を確認	Andrea Eirale, Matteo Leonetti, Marcello Chiaberge,	(参考訳) 社会的受容を達成することは、社会ロボットナビゲーションの主要な目標の1つである。この話題は近年注目されているが、研究の大半は障害物のない軌道に沿ってロボットエージェントを駆動することに焦点を当てており、個人距離を尊重し、ナビゲーションを最適化するために将来の人間の動きを推定する計画を立てている。しかし、日常生活における社会的相互作用は、カットするよりもキューの端に立っている場合など、運動に厳密に依存しない規範によっても規定される。本稿では,一般的な社会的シナリオを認識し,従来のプランナーのコスト関数を適応させる新しい手法を提案する。このソリューションは、従来のナビゲーションの堅牢性を維持しながら、他の方法では発生しない様々なソーシャルナビゲーション行動を実行することを可能にする。我々のアプローチでは、ロボットはタスクごとに異なるモジュールを持つのではなく、単一の学習モデルで異なる社会的規範を学習することができる。概念実証として、話し合う人々の集団の相互作用空間をキューイングし、尊重するタスクについて考察するが、この方法は動きを伴わない他の人間の活動にまで拡張することができる。 Achieving social acceptance is one of the main goals of Social Robotic Navigation. Despite this topic has received increasing interest in recent years, most of the research has focused on driving the robotic agent along obstacle-free trajectories, planning around estimates of future human motion to respect personal distances and optimize navigation. However, social interactions in everyday life are also dictated by norms that do not strictly depend on movement, such as when standing at the end of a queue rather than cutting it. In this paper, we propose a novel method to recognize common social scenarios and modify a traditional planner's cost function to adapt to them. This solution enables the robot to carry out different social navigation behaviors that would not arise otherwise, maintaining the robustness of traditional navigation. Our approach allows the robot to learn different social norms with a single learned model, rather than having different modules for each task. As a proof of concept, we consider the tasks of queuing and respect interaction spaces of groups of people talking to one another, but the method can be extended to other human activities that do not involve motion.	翻訳日:2024-11-08 21:32:38 公開日:2024-10-18
# 放射性炭素とAIを用いた筆跡解析を用いた古写本の年代推定 Dating ancient manuscripts using radiocarbon and AI-based writing style analysis ( http://arxiv.org/abs/2407.12013v2 ) ライセンス: Link先を確認	Mladen Popović, Maruf A. Dhali, Lambert Schomaker, Johannes van der Plicht, Kaare Lund Rasmussen, Jacopo La Nasa, Ilaria Degano, Maria Perla Colombini, Eibert Tigchelaar,	(参考訳) 古写本の年代決定は、思想の進化の再構築に不可欠である。デッドシースクロールにとって、これは特に重要である。しかし、ほぼ完全な年代記の欠如がタイムラインに均等に散在し、パレオグラフィー比較で利用可能な類似の書体で書かれている。本稿では,現在最先端のAIに基づく年代予測モデルであるEnochについて紹介する。 Enochは、確立された手書きスタイルの記述子を使用し、ベイズ尾根の回帰を適用している。本研究の課題は,現在の機械学習では大量のトレーニングデータを必要とするのに対して,放射性炭素年代付原稿の数は少ないことである。角線およびアログラフによる特徴ベクトルとベイジアンリッジの回帰を併用することにより,エノクは放射性炭素系年代を27.9～30.7年で予測できることを示した。その後、エノクは135点の未確認写本の日付を推定するために用いられ、標本の79パーセントがパレオグラフィーによるポストホック評価で「現実的」であるとされた。我々はその巻物の新しい年表を提示する。放射性炭素の範囲とエノクのスタイルに基づく予測は、伝統的に推定されるパレオグラフィー推定よりも古いことが多い。紀元前300-50年の範囲では、エノクの年代予測により粒度は改善された。本研究は, マルチモーダル機械学習技術の現況と一致し, 他の部分的古写本コレクションの日付予測に利用することができる。この研究は、エノクの量的、確率に基づくアプローチが、パレオグラフィーや歴史家にとっての道具となり、古代ユダヤ人の鍵となる文章を再編纂し、現在のユダヤ教とキリスト教の起源に関する議論に寄与していることを示している。 Determining the chronology of ancient handwritten manuscripts is essential for reconstructing the evolution of ideas. For the Dead Sea Scrolls, this is particularly important. However, there is an almost complete lack of date-bearing manuscripts evenly distributed across the timeline and written in similar scripts available for palaeographic comparison. Here, we present Enoch, a state-of-the-art AI-based date-prediction model, trained on the basis of new radiocarbon-dated samples of the scrolls. Enoch uses established handwriting-style descriptors and applies Bayesian ridge regression. The challenge of this study is that the number of radiocarbon-dated manuscripts is small, while current machine learning requires an abundance of training data. We show that by using combined angular and allographic writing style feature vectors and applying Bayesian ridge regression, Enoch could predict the radiocarbon-based dates from style, supported by leave-one-out validation, with varied MAEs of 27.9 to 30.7 years relative to the radiocarbon dating. Enoch was then used to estimate the dates of 135 unseen manuscripts, revealing that 79 per cent of the samples were considered 'realistic' upon palaeographic post-hoc evaluation. We present a new chronology of the scrolls. The radiocarbon ranges and Enoch's style-based predictions are often older than the traditionally assumed palaeographic estimates. In the range of 300-50 BCE, Enoch's date prediction provides an improved granularity. The study is in line with current developments in multimodal machine-learning techniques, and the methods can be used for date prediction in other partially-dated manuscript collections. This research shows how Enoch's quantitative, probability-based approach can be a tool for palaeographers and historians, re-dating ancient Jewish key texts and contributing to current debates on Jewish and Christian origins.	翻訳日:2024-11-08 20:59:00 公開日:2024-10-18
# 検索強化機械学習:合成と機会 Retrieval-Enhanced Machine Learning: Synthesis and Opportunities ( http://arxiv.org/abs/2407.12982v2 ) ライセンス: Link先を確認	To Eun Kim, Alireza Salemi, Andrew Drozdov, Fernando Diaz, Hamed Zamani,	(参考訳) 言語モデリングの分野では、自然言語処理(NLP)分野で直面するいくつかの課題に対処するために、検索コンポーネントで拡張されたモデルが有望なソリューションとして登場した。 NLPに主眼を置いているにもかかわらず、検索・エンハンスメントのパラダイムはコンピュータビジョン、時系列予測、計算生物学など幅広い機械学習(ML)に拡張できると仮定する。そこで本研究では,このパラダイムの形式的枠組みであるRetrieval-Enhanced Machine Learning (REML)を導入し,MLの各領域の文献を,現在の文献から欠落している一貫した表記で合成する。また,多くの研究が検索コンポーネントを用いてモデルを強化する一方で,基礎的情報検索(IR)研究との連携が欠如していることが判明した。我々は、REMLフレームワークを構成する各コンポーネントを調査することで、セミナルIR研究と現代のREML研究のギャップを埋める。究極的には、この研究の目的は、様々な分野の研究者に、検索強化モデルの包括的、正式に構造化された枠組みを付与し、学際的な将来の研究を促進することである。 In the field of language modeling, models augmented with retrieval components have emerged as a promising solution to address several challenges faced in the natural language processing (NLP) field, including knowledge grounding, interpretability, and scalability. Despite the primary focus on NLP, we posit that the paradigm of retrieval-enhancement can be extended to a broader spectrum of machine learning (ML) such as computer vision, time series prediction, and computational biology. Therefore, this work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature. Also, we found that while a number of studies employ retrieval components to augment their models, there is a lack of integration with foundational Information Retrieval (IR) research. We bridge this gap between the seminal IR research and contemporary REML studies by investigating each component that comprises the REML framework. Ultimately, the goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.	翻訳日:2024-11-08 20:25:29 公開日:2024-10-18
# 交差するワッサースタインボールによる分布的および逆ロバストなロジスティック回帰 Distributionally and Adversarially Robust Logistic Regression via Intersecting Wasserstein Balls ( http://arxiv.org/abs/2407.13625v2 ) ライセンス: Link先を確認	Aras Selvi, Eleonora Kreacic, Mohsen Ghassemi, Vamsi Potluru, Tucker Balch, Manuela Veloso,	(参考訳) 逆堅牢最適化(Adversarially robust optimization, ARO)は、テスト中に敵の攻撃に対して防御する訓練モデルのデファクトスタンダードとなっている。しかし、その頑丈さにもかかわらず、これらのモデルはしばしば過度なオーバーフィットに悩まされる。この問題を緩和するために、トレーニングにおける経験的分布を次のように置き換えるなど、いくつかの成功したアプローチが提案されている。一曖昧性集合内の最悪の場合の分布で、AROの分布的堅牢性(DR)に繋がるもの二補助的データセット(例えば、合成、外部、ドメイン外)から派生した経験分布の混合物。最初のアプローチに基づいて、ロジスティック回帰のための ARO のワッサーシュタイン DR を探索し、トラクタブル凸最適化の修正を認めることを示す。第2のアプローチを採用することで,データ生成と補助分布間のワッサーシュタイン距離を推定し,そのあいまいさを補助的データセットから構築したものと交差させることにより,DRフレームワークを強化する。提案手法は,結果の最適化問題を解析し,効率的な解を開発し,標準データセットのベンチマーク手法よりも優れていることを示す。 Adversarially robust optimization (ARO) has become the de facto standard for training models to defend against adversarial attacks during testing. However, despite their robustness, these models often suffer from severe overfitting. To mitigate this issue, several successful approaches have been proposed, including replacing the empirical distribution in training with: (i) a worst-case distribution within an ambiguity set, leading to a distributionally robust (DR) counterpart of ARO; or (ii) a mixture of the empirical distribution with one derived from an auxiliary dataset (e.g., synthetic, external, or out-of-domain). Building on the first approach, we explore the Wasserstein DR counterpart of ARO for logistic regression and show it admits a tractable convex optimization reformulation. Adopting the second approach, we enhance the DR framework by intersecting its ambiguity set with one constructed from an auxiliary dataset, which yields significant improvements when the Wasserstein distance between the data-generating and auxiliary distributions can be estimated. We analyze the resulting optimization problem, develop efficient solutions, and show that our method outperforms benchmark approaches on standard datasets.	翻訳日:2024-11-08 20:14:30 公開日:2024-10-18
# クエリコンテキスト信号の活用によるスポンサー検索における検索精度の向上 Improving Retrieval in Sponsored Search by Leveraging Query Context Signals ( http://arxiv.org/abs/2407.14346v2 ) ライセンス: Link先を確認	Akash Kumar Mohankumar, Gururaj K, Gagan Madan, Amit Singh,	(参考訳) ユーザクエリに関する関連する入札キーワードを正確に検索することは、Sponsored Searchでは重要だが、特に短いあいまいなクエリでは難しい。既存の高密度で生成的な検索モデルは、これらのケースにおいて、ニュアンスのあるユーザ意図をキャプチャできないことが多い。そこで本研究では,オンラインキャッシュに格納されたWeb検索結果と大規模言語モデルから得られるリッチなコンテキスト信号でクエリを増強し,クエリ理解を強化する手法を提案する。具体的には、Web検索のタイトルとスニペットを使って、現実世界の情報にクエリを接地し、GPT-4を使って、ユーザの意図を明確にしたクエリの書き直しや説明を生成する。これらの信号はFusion-in-DecoderベースのUnityアーキテクチャを通じて効率よく統合され、高密度かつ生成的な検索と従来の文脈自由モデルと同等の費用がかかる。キャッシュでコンテキストが利用できないシナリオに対処するために、推論中にコンテキスト信号なしでモデルロバスト性や性能を改善するカリキュラム学習戦略であるコンテキストグラシングを導入する。大規模なオフライン実験は、文脈認識アプローチが文脈自由モデルを大幅に上回ることを示した。さらに、160以上の国で有名な検索エンジン上でのオンラインA/Bテストでは、ユーザのエンゲージメントと収益が大幅に改善されている。 Accurately retrieving relevant bid keywords for user queries is critical in Sponsored Search but remains challenging, particularly for short, ambiguous queries. Existing dense and generative retrieval models often fail to capture nuanced user intent in these cases. To address this, we propose an approach to enhance query understanding by augmenting queries with rich contextual signals derived from web search results and large language models, stored in an online cache. Specifically, we use web search titles and snippets to ground queries in real-world information and utilize GPT-4 to generate query rewrites and explanations that clarify user intent. These signals are efficiently integrated through a Fusion-in-Decoder based Unity architecture, enabling both dense and generative retrieval with serving costs on par with traditional context-free models. To address scenarios where context is unavailable in the cache, we introduce context glancing, a curriculum learning strategy that improves model robustness and performance even without contextual signals during inference. Extensive offline experiments demonstrate that our context-aware approach substantially outperforms context-free models. Furthermore, online A/B testing on a prominent search engine across 160+ countries shows significant improvements in user engagement and revenue.	翻訳日:2024-11-08 19:38:31 公開日:2024-10-18
# 量子エンタングルメント、量子テレポーテーション、多線形多項式と幾何学 Quantum Entanglement, Quantum Teleportation, Multilinear Polynomials and Geometry ( http://arxiv.org/abs/2407.17621v3 ) ライセンス: Link先を確認	Juan M. Romero, Emiliano Montoya-Gonzalez, Oscar Velazquez-Alvarado,	(参考訳) 量子絡み合い状態は、分解できない多線型多項式と関連していることを示す。これらの多線型多項式を用いて、絡み合い状態の幾何学的表現を提案する。特に、ベル状態は3次元曲面で幾何学的に表現できる非分解可能実多重線型多項式と関連していることを示す。さらに, この枠組みでは, 量子回路を平面幾何学の幾何学的変換と見なすことができる。この現象は、物質が時空を曲がる重力と類似している。さらに、量子テレポーテーションと多線型多項式を含む演算の類似性を示す。 We show that quantum entanglement states are associated with multilinear polynomials that cannot be factored. By using these multilinear polynomials, we propose a geometric representation for entanglement states. In particular, we show that the Bell's states are associated with non-factorable real multilinear polynomial, which can be represented geometrically by three-dimensional surfaces. Furthermore, in this framework, we show that a quantum circuit can be seen as a geometric transformations of plane geometry. This phenomenon is analogous to gravity, where matter curves space-time. In addition, we show an analogy between quantum teleportation and operations involving multilinear polynomials.	翻訳日:2024-11-08 15:12:19 公開日:2024-10-18
# 低エネルギー物質励起における空洞媒介相互作用の一般理論 General theory of cavity-mediated interactions between low-energy matter excitations ( http://arxiv.org/abs/2407.19478v2 ) ライセンス: Link先を確認	Carlos J. Sánchez Martínez, Frieder Lindel, Francisco J. García-Vidal, Johannes Feist,	(参考訳) 超伝導、強磁性、強磁性などの低エネルギー物質特性のキャビティ量子力学技術による操作は、これらの多体集合現象を強化する方法として提案されている。本研究では, 共振器外結合と共振器共振器共振器共振器共振器による低エネルギー物質励起と共振器共振器共振器共振器の有効相互作用について検討する。物質の全偏極密度と磁化密度を考慮した双極子近似を超越して、従来の研究を拡張した。さらに、しばしば無視される反磁性相互作用を包含し、空洞に対しては、非局所性および非相互性を持つ一般的な線形吸収媒体を検討する。この一般的なシナリオにおいても、自由度の物質間の効果的な空洞誘起相互作用は静電気的および静磁的性質であることを示す。このことは、低エネルギーの仮定が成立する物質系の空洞工学におけるマルチモード記述の必要性を裏付けるものである。本研究は, 一般的な光環境が拡張低エネルギー物質励起に与える影響を理論的に研究するための枠組みを提供する。 The manipulation of low-energy matter properties such as superconductivity, ferromagnetism and ferroelectricity via cavity quantum electrodynamics engineering has been suggested as a way to enhance these many-body collective phenomena. In this work, we investigate the effective interactions between low-energy matter excitations induced by the off-resonant coupling with cavity electromagnetic modes. We extend previous work by going beyond the dipole approximation accounting for the full polarization and magnetization densities of matter. We further include the often neglected diamagnetic interaction and, for the cavity, we consider general linear absorbing media with possibly non-local and non-reciprocal response. We demonstrate that, even in this general scenario, the effective cavity-induced interactions between the matter degrees of freedom are of electrostatic and magnetostatic nature. This confirms the necessity of a multimode description for cavity engineering of matter systems where the low-energy assumption holds. Our findings provide a theoretical framework for studying the influence of general optical environments on extended low-energy matter excitations.	翻訳日:2024-11-08 14:27:29 公開日:2024-10-18
# 局所処理によるマルコフ決定過程の実験 Experimenting on Markov Decision Processes with Local Treatments ( http://arxiv.org/abs/2407.19618v2 ) ライセンス: Link先を確認	Shuze Chen, David Simchi-Levi, Chonghuan Wang,	(参考訳) 短期的治療が短期成績に与える影響を評価するためにランダム化実験を利用することは、工業的実践における黄金の基準となっている。しかし、サービスシステムが動的かつパーソナライズされていくにつれて、介入への生涯的露出を通じて、顧客寿命価値などの長期的な累積的な成果の最大化に焦点が移りつつある。このギャップを埋めるために,マルコフ決定過程(MDP)をモデル化した力学系におけるランダム化実験について検討する。我々のゴールは、比較的短期的な観察による長期累積報酬に対する治療・制御政策の影響を評価することである。まず,一般的な治療パターンの効果を評価するための最適推論手法を開発した。さらに, 実世界の処理の多くは, 実用的効率と運用上の便宜のために微粒化され, 局所化される傾向があることを認識し, 非ターゲット状態の情報を共有することで, この局所化構造を利用する方法を提案する。我々の新しい推定器は局所的な処理構造を組み込んだより厳密な下界をマッチングしながら、一般的な処理に対する分散下界を効果的に克服する。さらに, 推定器は, 分散の大きな部分に対して, 試験アーム数の線形化を最適に行うことができる。最後に、制御アームの完全な知識と推論効率をさらに向上させる設計推定器を用いてシナリオを探索する。 Utilizing randomized experiments to evaluate the effect of short-term treatments on the short-term outcomes has been well understood and become the golden standard in industrial practice. However, as service systems become increasingly dynamical and personalized, much focus is shifting toward maximizing long-term cumulative outcomes, such as customer lifetime value, through lifetime exposure to interventions. To bridge this gap, we investigate the randomized experiments within dynamical systems modeled as Markov Decision Processes (MDPs). Our goal is to assess the impact of treatment and control policies on long-term cumulative rewards from relatively short-term observations. We first develop optimal inference techniques for assessing the effects of general treatment patterns. Furthermore, recognizing that many real-world treatments tend to be fine-grained and localized for practical efficiency and operational convenience, we then propose methods to harness this localized structure by sharing information on the non-targeted states. Our new estimator effectively overcomes the variance lower bound for general treatments while matching the more stringent lower bound incorporating the local treatment structure. Furthermore, our estimator can optimally achieve a linear reduction with the number of test arms for a major part of the variance. Finally, we explore scenarios with perfect knowledge of the control arm and design estimators that further improve inference efficiency.	翻訳日:2024-11-08 14:27:29 公開日:2024-10-18
# 二次帯域交差を伴う位相相転移におけるキブル・ズールクの挙動 Kibble-Zurek behavior in a topological phase transition with a quadratic band crossing ( http://arxiv.org/abs/2407.19780v2 ) ライセンス: Link先を確認	Huan Yuan, Jinyi Zhang, Shuai Chen, Xiaotian Nie,	(参考訳) Kibble-Zurek (KZ) メカニズムは、連続対称性を破る遷移でシステムを駆動する際のスケーリングの振る舞いを記述している。従来の研究では、KZ様のスケーリング挙動はQi-Wu-Zhangモデル (2D) とSu-Schrieffer-Heegerモデル (1D) のトポロジ的遷移にも関係していることが示されたが、対称性の破れはここでは存在しない。線形帯域交差を持つどちらのモデルも$\nu=1$と$z=1$を与える。線形帯域通過を超えるトポロジカル遷移において、異なる臨界指数が取得できるかどうか疑問である。本研究では,2次帯域交差を持つトポロジカル2次元チェッカーボード格子のKZ挙動について検討する。クリーンシステムにおけるベリー曲率の運動量分布の単純さと、従来のKZ記述とより直感的な類似である混乱系における領域様局所チャーンマーカー構成の実空間解析の2点から検討する。平衡では、相関長は$\nu\simeq 1/2$で分岐する。そして、トポロジカル位相遷移でゆっくりと系を焼くことで、フリーズアウト時間 $t_\mathrm{f}$ と未凍長スケール $\xi(t_\mathrm{f})$ が KZ のスケーリングを満足し、$z\simeq 2$ を検証できることが分かる。その後、他の高次帯域通過と位相相転移におけるKZ挙動を探索し、臨界指数と順序の関係を見出す。我々の結果は、KZ機構と非平衡トポロジカル相転移の理解を拡大する。 Kibble-Zurek (KZ) mechanism describes the scaling behavior when driving a system across a continuous symmetry-breaking transition. Previous studies have shown that the KZ-like scaling behavior also lies in the topological transitions in the Qi-Wu-Zhang model (2D) and the Su-Schrieffer-Heeger model (1D), although symmetry breaking does not exist here. Both models with linear band crossings give that $\nu=1$ and $z=1$. We wonder whether different critical exponents can be acquired in topological transitions beyond linear band crossing. In this work, we look into the KZ behavior in a topological 2D checkerboard lattice with a quadratic band crossing. We investigate from dual perspectives: momentum distribution of the Berry curvature in clean systems for simplicity, and real-space analysis of domain-like local Chern marker configurations in disordered systems, which is a more intuitive analog to conventional KZ description. In equilibrium, we find the correlation length diverges with a power $\nu\simeq 1/2$. Then, by slowly quenching the system across the topological phase transition, we find that the freeze-out time $t_\mathrm{f}$ and the unfrozen length scale $\xi(t_\mathrm{f})$ both satisfy the KZ scaling, verifying $z\simeq 2$. We subsequently explore KZ behavior in topological phase transitions with other higher-order band crossing and find the relationship between the critical exponents and the order. Our results extend the understanding of the KZ mechanism and non-equilibrium topological phase transitions.	翻訳日:2024-11-08 14:27:29 公開日:2024-10-18
# 非神経モデルにおける創発性:平均勾配外積によるモジュラー算術 Emergence in non-neural models: grokking modular arithmetic via average gradient outer product ( http://arxiv.org/abs/2407.20199v2 ) ライセンス: Link先を確認	Neil Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin,	(参考訳) モジュラー演算タスクを解くために訓練されたニューラルネットワークは、モデルがトレーニングプロセスで100%のトレーニング精度を達成した後、テスト精度が長く改善し始める現象であるグラッキングを示す。モデル能力は相転移を通じて急激に現れます。本研究では,グルーキング現象はニューラルネットワークや勾配降下に基づく最適化に特有ではないことを示す。具体的には、一般的な機械学習モデルを用いてタスク固有の特徴学習を可能にするために、平均勾配外積(AGOP)を用いた反復アルゴリズムであるRecursive Feature Machines (RFM) を用いてモジュラー算術を学習する際に、この現象が生じることを示す。カーネルマシンと組み合わせて使用すると、RCMを繰り返すと、ランダムにほぼゼロに近いテスト精度から完全なテスト精度へ素早く移行する。この移行は、同じゼロのトレーニング損失や、初期イテレーションで一定であるテスト損失から予測することはできない。 RFMは徐々にブロック循環機能を学び、モジュラー演算を解く。 RFMの結果と並行して、モジュラー演算を解くニューラルネットワークもブロック循環の特徴を学習することを示した。さらに, ニューラルネットワークがこれらの課題から学習する一般化解として提案されるフーリエ乗算アルゴリズムの実装に, RFMがそのようなブロック循環的特徴を用いるという理論的証拠を示す。この結果から,出現はタスク関連の特徴を学習することによるものであり,ニューラルアーキテクチャや勾配降下に基づく最適化手法に特有ではないことが示唆された。さらに、我々の研究は、ニューラルネットワークにおける特徴学習の鍵となるメカニズムとしてAGOPのさらなる証拠を提供する。 Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.	翻訳日:2024-11-08 14:16:02 公開日:2024-10-18
# マンバのサーベイ A Survey of Mamba ( http://arxiv.org/abs/2408.01129v4 ) ライセンス: Link先を確認	Haohao Qu, Liangbo Ning, Rui An, Wenqi Fan, Tyler Derr, Hui Liu, Xin Xu, Qing Li,	(参考訳) 最も代表的なDL技術の1つとして、トランスフォーマーアーキテクチャは多くの高度なモデル、特に数十億のパラメータからなる大規模言語モデル(LLM)が強化され、ディープラーニングの基盤となっている。素晴らしい成果にもかかわらず、トランスフォーマーは依然として固有の制限に直面しており、特に注意計算の2次計算の複雑さから生じる時間を要する推論である。近年、古典的状態空間モデル(SSM)からインスピレーションを得た新しいアーキテクチャであるMambaが、トランスフォーマーに匹敵するモデリング能力を提供しながら、シーケンス長に関するほぼ直線的なスケーラビリティを保ちながら、基礎モデルを構築するための有望な代替手段として登場した。このことが、様々な領域で印象的なパフォーマンスを達成するためのマンバの可能性を積極的に探究する研究を活発に進めるきっかけとなった。このような急速な進化を考えると、既存のマンバ駆動モデルを統合する体系的なレビューが不可欠であり、この新たなモデルアーキテクチャの包括的理解を提供する。そこで本研究では,近年のマンバ関連研究を詳細に調査し,マンバモデルの発展,さまざまなデータにマンバを適応させる技術,およびマンバが優れている応用の3つの主な側面について考察する。具体的には,様々な代表的な深層学習モデルの基礎知識と,Mamba-1&2の詳細について概説する。そして、AIにおけるMambaの重要性を示すために、Mambaモデルのアーキテクチャ設計、データ適応性、アプリケーションに焦点を当てた関連する研究を網羅的にレビューする。最後に,現状の限界について考察し,将来的な研究の方向性を探究し,今後の研究に深い洞察を与える。 As one of the most representative DL techniques, Transformer architecture has empowered numerous advanced models, especially the large language models (LLMs) that comprise billions of parameters, becoming a cornerstone in deep learning. Despite the impressive achievements, Transformers still face inherent limitations, particularly the time-consuming inference resulting from the quadratic computation complexity of attention calculation. Recently, a novel architecture named Mamba, drawing inspiration from classical state space models (SSMs), has emerged as a promising alternative for building foundation models, delivering comparable modeling abilities to Transformers while preserving near-linear scalability concerning sequence length. This has sparked an increasing number of studies actively exploring Mamba's potential to achieve impressive performance across diverse domains. Given such rapid evolution, there is a critical need for a systematic review that consolidates existing Mamba-empowered models, offering a comprehensive understanding of this emerging model architecture. In this survey, we therefore conduct an in-depth investigation of recent Mamba-associated studies, covering three main aspects: the advancements of Mamba-based models, the techniques of adapting Mamba to diverse data, and the applications where Mamba can excel. Specifically, we first review the foundational knowledge of various representative deep learning models and the details of Mamba-1&2 as preliminaries. Then, to showcase the significance of Mamba for AI, we comprehensively review the related studies focusing on Mamba models' architecture design, data adaptability, and applications. Finally, we present a discussion of current limitations and explore various promising research directions to provide deeper insights for future investigations.	翻訳日:2024-11-08 13:18:17 公開日:2024-10-18
# Signal-SGN:時間周波数ダイナミクスの学習による骨格行動認識のためのスパイキンググラフ畳み込みネットワーク Signal-SGN: A Spiking Graph Convolutional Network for Skeletal Action Recognition via Learning Temporal-Frequency Dynamics ( http://arxiv.org/abs/2408.01701v2 ) ライセンス: Link先を確認	Naichuan Zheng, Hailun Xia, Dapeng Liu,	(参考訳) 骨格に基づく行動認識では、グラフ畳み込みネットワーク(GCN)ベースの手法は、その複雑さと高エネルギー消費のために制限に直面している。スパイキングニューラルネットワーク(SNN)は近年、低エネルギー消費で注目を集めているが、GCNとSNNを組み合わせた既存の手法では骨格配列の時間的特性を完全に活用できず、ストレージと計算コストが増大している。この問題に対処するために、骨格配列の時間次元をスパイキング時間ステップとして利用し、特徴を離散確率信号として扱うSignal-SGN(Spiking Graph Convolutional Network)を提案する。ネットワークのコアは1Dスパイキンググラフ畳み込みネットワーク(1D-SGN)と周波数スパイキング畳み込みネットワーク(FSN)で構成されている。 SGNは単一フレーム上でグラフ畳み込みを行い、スパイクネットワーク特性を取り入れてフレーム間時間関係を捉え、FSNはFast Fourier Transform(FFT)と複雑な畳み込みを用いて時間周波数の特徴を抽出する。また,マルチスケールウェーブレット変換機能融合モジュール(MWTF)を導入し,時間信号のスペクトル特性を捉え,モデルの分類能力を向上する。本稿では,時間空間的特徴抽出モジュール(TFSM)を提案する。 NTU RGB+D、NTU RGB+D 120、およびNW-UCLAデータセットに関する多数の実験により、提案モデルは既存のSNNベースの手法を精度良く上回るだけでなく、トレーニング中の計算および記憶コストを低減できることを示した。さらに、対応するGCNベースの手法と比較して競争精度が向上し、非常に顕著である。 In skeletal-based action recognition, Graph Convolutional Networks (GCNs) based methods face limitations due to their complexity and high energy consumption. Spiking Neural Networks (SNNs) have gained attention in recent years for their low energy consumption, but existing methods combining GCNs and SNNs fail to fully utilize the temporal characteristics of skeletal sequences, leading to increased storage and computational costs. To address this issue, we propose a Signal-SGN(Spiking Graph Convolutional Network), which leverages the temporal dimension of skeletal sequences as the spiking timestep and treats features as discrete stochastic signals. The core of the network consists of a 1D Spiking Graph Convolutional Network (1D-SGN) and a Frequency Spiking Convolutional Network (FSN). The SGN performs graph convolution on single frames and incorporates spiking network characteristics to capture inter-frame temporal relationships, while the FSN uses Fast Fourier Transform (FFT) and complex convolution to extract temporal-frequency features. We also introduce a multi-scale wavelet transform feature fusion module(MWTF) to capture spectral features of temporal signals, enhancing the model's classification capability. We propose a pluggable temporal-frequency spatial semantic feature extraction module(TFSM) to enhance the model's ability to distinguish features without increasing inference-phase consumption. Our numerous experiments on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets demonstrate that the proposed models not only surpass existing SNN-based methods in accuracy but also reduce computational and storage costs during training. Furthermore, they achieve competitive accuracy compared to corresponding GCN-based methods, which is quite remarkable.	翻訳日:2024-11-08 13:07:08 公開日:2024-10-18
# 観測時空間データにおける治療応答サブグループ同定 Identifying treatment response subgroups in observational time-to-event data ( http://arxiv.org/abs/2408.03463v2 ) ライセンス: Link先を確認	Vincent Jeanselme, Chang Ho Yoon, Fabian Falck, Brian Tom, Jessica Barrett,	(参考訳) 治療反応の異なる患者サブグループを特定することは、医療勧告、ガイドライン、将来の臨床試験の設計を知らせる重要な課題である。既存のサブグループ分析のアプローチは主にランダム化制御試験 (Randomized Controlled Trials, RRT) に依存しており、処理の割り当てはランダム化されている。 RCTの患者コホートはコストに制約されることが多く、実際の臨床で治療を受ける可能性の高い患者の異種性を表すものではない。観察研究に適用すると、サブグループ分析のアプローチは、特に治療の非ランダム化のために有意な統計バイアスに悩まされる。本研究は、観察研究における治療応答サブグループを特定するための、新しい結果誘導手法を提案する。本手法では,各患者を2つの時間-時間分布に関連するサブグループ,すなわち治療中のサブグループとコントロール中のサブグループに割り当てる。そのため、個々の治療効果と平均治療効果の見積もりの間に位置づけられる。本モデルの仮定は, 逆確率重み付けによる非ランダム化処理から統計バイアスを簡易に補正する。実験では, ランダム化処理と観察処理の両方において, 結果誘導サブグループ分析の最先端手法を著しく上回る結果を得た。 Identifying patient subgroups with different treatment responses is an important task to inform medical recommendations, guidelines, and the design of future clinical trials. Existing approaches for subgroup analysis primarily rely on Randomised Controlled Trials (RCTs), in which treatment assignment is randomised. RCTs' patient cohorts are often constrained by cost, rendering them not representative of the heterogeneity of patients likely to receive treatment in real-world clinical practice. When applied to observational studies, subgroup analysis approaches suffer from significant statistical biases particularly because of the non-randomisation of treatment. Our work introduces a novel, outcome-guided method for identifying treatment response subgroups in observational studies. Our approach assigns each patient to a subgroup associated with two time-to-event distributions: one under treatment and one under control regime. It hence positions itself in between individualised and average treatment effect estimation. The assumptions of our model result in a simple correction of the statistical bias from treatment non-randomisation through inverse propensity weighting. In experiments, our approach significantly outperforms the current state-of-the-art method for outcome-guided subgroup analysis in both randomised and observational treatment regimes.	翻訳日:2024-11-08 12:33:46 公開日:2024-10-18
# 観測時空間データにおける治療応答サブグループ同定 Identifying treatment response subgroups in observational time-to-event data ( http://arxiv.org/abs/2408.03463v3 ) ライセンス: Link先を確認	Vincent Jeanselme, Chang Ho Yoon, Fabian Falck, Brian Tom, Jessica Barrett,	(参考訳) 治療反応の異なる患者サブグループを特定することは、医療勧告、ガイドライン、将来の臨床試験の設計を知らせる重要な課題である。既存のサブグループ分析のアプローチは主にランダム化制御試験 (Randomized Controlled Trials, RRT) に依存しており、処理の割り当てはランダム化されている。 RCTの患者コホートはコストに制約されることが多く、実際の臨床で治療を受ける可能性の高い患者の異種性を表すものではない。観察研究に適用すると、サブグループ分析のアプローチは、特に治療の非ランダム化のために有意な統計バイアスに悩まされる。本研究は、観察研究における治療応答サブグループを特定するための、新しい結果誘導手法を提案する。本手法では,各患者を2つの時間-時間分布に関連するサブグループ,すなわち治療中のサブグループとコントロール中のサブグループに割り当てる。そのため、個々の治療効果と平均治療効果の見積もりの間に位置づけられる。本モデルの仮定は, 逆確率重み付けによる非ランダム化処理から統計バイアスを簡易に補正する。実験では, ランダム化処理と観察処理の両方において, 結果誘導サブグループ分析の最先端手法を著しく上回る結果を得た。 Identifying patient subgroups with different treatment responses is an important task to inform medical recommendations, guidelines, and the design of future clinical trials. Existing approaches for subgroup analysis primarily rely on Randomised Controlled Trials (RCTs), in which treatment assignment is randomised. RCTs' patient cohorts are often constrained by cost, rendering them not representative of the heterogeneity of patients likely to receive treatment in real-world clinical practice. When applied to observational studies, subgroup analysis approaches suffer from significant statistical biases particularly because of the non-randomisation of treatment. Our work introduces a novel, outcome-guided method for identifying treatment response subgroups in observational studies. Our approach assigns each patient to a subgroup associated with two time-to-event distributions: one under treatment and one under control regime. It hence positions itself in between individualised and average treatment effect estimation. The assumptions of our model result in a simple correction of the statistical bias from treatment non-randomisation through inverse propensity weighting. In experiments, our approach significantly outperforms the current state-of-the-art method for outcome-guided subgroup analysis in both randomised and observational treatment regimes.	翻訳日:2024-11-08 12:33:46 公開日:2024-10-18
# P3: LLMトレーニングにおけるデータプルーニングのためのポリシー駆動型、ペース適応型、多様性駆動型フレームワーク P3: A Policy-Driven, Pace-Adaptive, and Diversity-Promoted Framework for data pruning in LLM Training ( http://arxiv.org/abs/2408.05541v2 ) ライセンス: Link先を確認	Yingxuan Yang, Huayi Wang, Muning Wen, Xiaoyun Mo, Qiuying Peng, Jun Wang, Weinan Zhang,	(参考訳) 大規模言語モデル(LLM)の急速に進歩する分野では、モデルの可能性の最大化のために微調整中に既存のデータセットを効果的に活用することが最重要事項である。本稿では、反復データプルーニングによるタスク固有の微調整プロセスの最適化を目的とした適応型フレームワークであるP3を紹介する。 P3は,(1)静的メトリクスを適応性評価に置き換え,モデルのリアルタイムパフォーマンスに基づいてデータの難易度を動的に評価するポリシ駆動の難易度測定,(2)より困難なデータを段階的に導入し,モデル能力を向上するペース適応型選択,(3)決定点プロセス(Determinantal Point Process, DPP)を導入し,エポック間のデータの多様性を保証するための多様性向上,といった3つの要素から構成される。我々は,従来のデータプルーニング手法に対して,P3を推論シナリオであるAPPSとMATHで検証し,大幅な改善を示した。動的データ選択と利用戦略の進歩により、P3はLLMのパフォーマンス改善のために既存のデータを完全に活用する理論的なフレームワークと具体的なアプローチの両方に貢献し、多様なタスクにまたがるユーティリティを提供する。 In the rapidly advancing field of Large Language Models (LLMs), effectively leveraging existing datasets during fine-tuning to maximize the model's potential is of paramount importance. This paper introduces P3, an adaptive framework aimed at optimizing the task-specific fine-tuning process through iterative data pruning. P3 consists of three key components: (1) Policy-driven Difficulty Measurement, which dynamically assesses data difficulty based on the model's real-time performance, replacing static metrics with adaptable evaluations; (2) Pace-Adaptive Selection, leveraging self-paced learning to progressively introduce more challenging data, thereby enhancing model capability; (3) Diversity Promotion, incorporating Determinantal Point Process (DPP) to ensure data diversity across epochs, enriching the learning process. We validate P3 on the reasoning scenarios, APPS and MATH, demonstrating significant improvements over traditional data pruning methods. By advancing dynamic data selection and utilization strategies, P3 contributes both a theoretical framework and concrete approach to fully exploit existing data for LLMs' performance improvement, offering utility across diverse tasks.	翻訳日:2024-11-08 11:49:24 公開日:2024-10-18
# Residual-INR: 命令型ニューラル表現を用いた通信効率の良いオンデバイス学習 Residual-INR: Communication Efficient On-Device Learning Using Implicit Neural Representation ( http://arxiv.org/abs/2408.05617v2 ) ライセンス: Link先を確認	Hanqiu Chen, Xuebin Yao, Pradeep Subedi, Cong Hao,	(参考訳) エッジコンピューティング(エッジコンピューティング)は、データ生成の源泉付近でデータを収集、処理する分散コンピューティングパラダイムである。エッジでのデバイス上の学習は、複数のデバイス間でリアルタイムなデータ共有と協調的な意思決定を容易にするデバイス間無線通信に依存している。これにより、エッジコンピューティングシステムの環境変化への適応性が大幅に向上する。しかし、エッジコンピューティングシステムの規模が大きくなるにつれて、無線通信の帯域が限られているため、デバイス間の通信がボトルネックになっている。本稿では、デバイス間データ伝送の削減とデバイス上での学習の高速化を目的として、暗黙のニューラルネットワーク表現(INR)を利用して、フォグコンピューティングに基づく通信効率の高いデバイス上での学習フレームワークであるResidual-INRを提案し、画像や映像をニューラルネットワークの重みに圧縮する。 Residual-INRは、エッジデバイスからJPEGイメージを収集し、フォグノードのINRフォーマットに圧縮し、デバイス上での学習のために再配布することで、データ転送効率を向上させる。画像の完全符号化に小型のINRと高画質のオブジェクト領域再構成に別個のINRを用いることにより、オブジェクトの品質を維持しながら符号化の冗長性を低減できる。 Residual-INRはエッジデバイス上での学習において有望なソリューションである。また、CPUを使わずにデバイス上での学習を加速し、精度を犠牲にすることなく最大2.9倍のスピードアップを達成する。私たちのコードは、https://github.com/sharclab/Residual-INR.comで利用可能です。 Edge computing is a distributed computing paradigm that collects and processes data at or near the source of data generation. The on-device learning at edge relies on device-to-device wireless communication to facilitate real-time data sharing and collaborative decision-making among multiple devices. This significantly improves the adaptability of the edge computing system to the changing environments. However, as the scale of the edge computing system is getting larger, communication among devices is becoming the bottleneck because of the limited bandwidth of wireless communication leads to large data transfer latency. To reduce the amount of device-to-device data transmission and accelerate on-device learning, in this paper, we propose Residual-INR, a fog computing-based communication-efficient on-device learning framework by utilizing implicit neural representation (INR) to compress images/videos into neural network weights. Residual-INR enhances data transfer efficiency by collecting JPEG images from edge devices, compressing them into INR format at the fog node, and redistributing them for on-device learning. By using a smaller INR for full image encoding and a separate object INR for high-quality object region reconstruction through residual encoding, our technique can reduce the encoding redundancy while maintaining the object quality. Residual-INR is a promising solution for edge on-device learning because it reduces data transmission by up to 5.16 x across a network of 10 edge devices. It also facilitates CPU-free accelerated on-device learning, achieving up to 2.9 x speedup without sacrificing accuracy. Our code is available at: https://github.com/sharclab/Residual-INR.	翻訳日:2024-11-08 11:49:24 公開日:2024-10-18
# 大次元カーネル密度推定器 Kernel Density Estimators in Large Dimensions ( http://arxiv.org/abs/2408.05807v3 ) ライセンス: Link先を確認	Giulio Biroli, Marc Mézard,	(参考訳) 本稿では,高次元分布$\rho(x)$に対するカーネル密度推定について検討する。従来のアプローチでは、大量のデータポイント$n$と固定次元$d$の制限に重点を置いてきた。代わりに、データポイントの数$n$$$y_i$とそれらの次元$d$が、固定比$\alpha=(\log n)/d$で成長する状態を分析する。我々の研究は、カーネルベースの密度$\hat \rho_h^{\mathcal {D}}(x)=\frac{1}{n h^d}\sum_{i=1}^n K\left(\frac{x-y_i}{h}\right)$, 帯域幅$h$: 中央極限定理(CLT)が持つ大帯域幅の古典的レジーム。帯域幅の一定の値の下に$h_{CLT}(\alpha)$ とすると、CLTが故障する。 $\hat\rho_h^{\mathcal {D}}(x)$ for a fixed $x$ from $\rho(x)$の統計は、重尾分布(アルファ安定分布)によって与えられる。特に$h_G(\alpha)$ 以下の値では、$\hat\rho_h^{\mathcal {D}}(x)$ は極値統計によって支配される。高次元多変量ガウスデータの詳細な解析を行う。本稿では,Kullback-Leibler分散に基づく帯域幅の最適しきい値が,本論文で同定された新しい統計体系に含まれることを示す。実践者が知っているように、Kernelが推定した帯域幅の減少は、スムーズな曲線から、データポイントを中心としたピークのコレクションへと変化している。本研究により, この現象は, 異なる統計特性を特徴とする相間の急激な遷移に関連し, 高次元環境下でのケルネル密度推定の新しい知見が得られた。 This paper studies Kernel Density Estimation for a high-dimensional distribution $\rho(x)$. Traditional approaches have focused on the limit of large number of data points $n$ and fixed dimension $d$. We analyze instead the regime where both the number $n$ of data points $y_i$ and their dimensionality $d$ grow with a fixed ratio $\alpha=(\log n)/d$. Our study reveals three distinct statistical regimes for the kernel-based estimate of the density $\hat \rho_h^{\mathcal {D}}(x)=\frac{1}{n h^d}\sum_{i=1}^n K\left(\frac{x-y_i}{h}\right)$, depending on the bandwidth $h$: a classical regime for large bandwidth where the Central Limit Theorem (CLT) holds, which is akin to the one found in traditional approaches. Below a certain value of the bandwidth, $h_{CLT}(\alpha)$, we find that the CLT breaks down. The statistics of $\hat\rho_h^{\mathcal {D}}(x)$ for a fixed $x$ drawn from $\rho(x)$ is given by a heavy-tailed distribution (an alpha-stable distribution). In particular below a value $h_G(\alpha)$, we find that $\hat\rho_h^{\mathcal {D}}(x)$ is governed by extreme value statistics: only a few points in the database matter and give the dominant contribution to the density estimator. We provide a detailed analysis for high-dimensional multivariate Gaussian data. We show that the optimal bandwidth threshold based on Kullback-Leibler divergence lies in the new statistical regime identified in this paper. As known by practitioners, when decreasing the bandwidth a Kernel-estimated estimated changes from a smooth curve to a collections of peaks centred on the data points. Our findings reveal that this general phenomenon is related to sharp transitions between phases characterized by different statistical properties, and offer new insights for Kernel density estimation in high-dimensional settings.	翻訳日:2024-11-08 11:49:24 公開日:2024-10-18
# EasyRec: 勧告のためのシンプルで効果的な言語モデル EasyRec: Simple yet Effective Language Models for Recommendation ( http://arxiv.org/abs/2408.08821v2 ) ライセンス: Link先を確認	Xubin Ren, Chao Huang,	(参考訳) ディープニューラルネットワークは、リコメンダシステムのためのコラボレーティブフィルタリング(CF)において、ユーザとイテムのインタラクションデータから表現を学ぶための強力な技術になっている。しかし、既存の多くのメソッドは、ユニークなユーザIDとアイテムIDに大きく依存しており、十分なトレーニングデータが利用できないような現実的なゼロショット学習シナリオにおいて、うまく機能する能力を制限する。言語モデル(LM)の成功と、その強力な一般化能力に触発されて、重要な疑問が浮かび上がっている。本研究では,テキストに基づく意味理解を協調的な信号とシームレスに統合する,効果的で使いやすいアプローチであるEasyRecを提案する。 EasyRecは、コントラスト学習と協調言語モデルチューニングを組み合わせたテキストビヘイビアアライメントフレームワークを使用して、テキスト強化セマンティックスペースと協調行動情報との強いアライメントを保証する。さまざまな実世界のデータセットにわたる大規模な経験的評価は、特にテキストベースのゼロショットレコメンデーションシナリオにおいて、最先端の代替モデルと比較して、EasyRecの優れたパフォーマンスを示している。さらに、この研究は、プラグイン・アンド・プレイコンポーネントとしてEasyRecをテキスト強化協調フィルタリングフレームワークにシームレスに統合する可能性を強調し、既存のレコメンデーションシステムにより、推奨性能を高め、動的環境における進化するユーザの好みに適応することが可能になる。我々のEasyRecフレームワークの再現性を改善するために、モデル実装の詳細、ソースコード、データセットはリンクで利用可能である。 Deep neural networks have become a powerful technique for learning representations from user-item interaction data in collaborative filtering (CF) for recommender systems. However, many existing methods heavily rely on unique user and item IDs, which limits their ability to perform well in practical zero-shot learning scenarios where sufficient training data may be unavailable. Inspired by the success of language models (LMs) and their strong generalization capabilities, a crucial question arises: How can we harness the potential of language models to empower recommender systems and elevate its generalization capabilities to new heights? In this study, we propose EasyRec - an effective and easy-to-use approach that seamlessly integrates text-based semantic understanding with collaborative signals. EasyRec employs a text-behavior alignment framework, which combines contrastive learning with collaborative language model tuning, to ensure a strong alignment between the text-enhanced semantic space and the collaborative behavior information. Extensive empirical evaluations across diverse real-world datasets demonstrate the superior performance of EasyRec compared to state-of-the-art alternative models, particularly in the challenging text-based zero-shot recommendation scenarios. Furthermore, the study highlights the potential of seamlessly integrating EasyRec as a plug-and-play component into text-enhanced collaborative filtering frameworks, thereby empowering existing recommender systems to elevate their recommendation performance and adapt to the evolving user preferences in dynamic environments. For better result reproducibility of our EasyRec framework, the model implementation details, source code, and datasets are available at the link: https://github.com/HKUDS/EasyRec.	翻訳日:2024-11-08 07:18:07 公開日:2024-10-18
# EasyRec: 勧告のためのシンプルで効果的な言語モデル EasyRec: Simple yet Effective Language Models for Recommendation ( http://arxiv.org/abs/2408.08821v3 ) ライセンス: Link先を確認	Xubin Ren, Chao Huang,	(参考訳) ディープニューラルネットワークは、リコメンダシステムのためのコラボレーティブフィルタリング(CF)において、ユーザとイテムのインタラクションデータから表現を学ぶための強力な技術になっている。しかし、既存の多くのメソッドは、ユニークなユーザIDとアイテムIDに大きく依存しており、十分なトレーニングデータが利用できないような現実的なゼロショット学習シナリオにおいて、うまく機能する能力を制限する。言語モデル(LM)の成功と、その強力な一般化能力に触発されて、重要な疑問が浮かび上がっている。本研究では,テキストに基づく意味理解を協調的な信号とシームレスに統合する,効果的で使いやすいアプローチであるEasyRecを提案する。 EasyRecは、コントラスト学習と協調言語モデルチューニングを組み合わせたテキストビヘイビアアライメントフレームワークを使用して、テキスト強化セマンティックスペースと協調行動情報との強いアライメントを保証する。さまざまな実世界のデータセットにわたる大規模な経験的評価は、特にテキストベースのゼロショットレコメンデーションシナリオにおいて、最先端の代替モデルと比較して、EasyRecの優れたパフォーマンスを示している。さらに、この研究は、プラグイン・アンド・プレイコンポーネントとしてEasyRecをテキスト強化協調フィルタリングフレームワークにシームレスに統合する可能性を強調し、既存のレコメンデーションシステムにより、推奨性能を高め、動的環境における進化するユーザの好みに適応することが可能になる。我々のEasyRecフレームワークの再現性を改善するために、モデル実装の詳細、ソースコード、データセットはリンクで利用可能である。 Deep neural networks have become a powerful technique for learning representations from user-item interaction data in collaborative filtering (CF) for recommender systems. However, many existing methods heavily rely on unique user and item IDs, which limits their ability to perform well in practical zero-shot learning scenarios where sufficient training data may be unavailable. Inspired by the success of language models (LMs) and their strong generalization capabilities, a crucial question arises: How can we harness the potential of language models to empower recommender systems and elevate its generalization capabilities to new heights? In this study, we propose EasyRec - an effective and easy-to-use approach that seamlessly integrates text-based semantic understanding with collaborative signals. EasyRec employs a text-behavior alignment framework, which combines contrastive learning with collaborative language model tuning, to ensure a strong alignment between the text-enhanced semantic space and the collaborative behavior information. Extensive empirical evaluations across diverse real-world datasets demonstrate the superior performance of EasyRec compared to state-of-the-art alternative models, particularly in the challenging text-based zero-shot recommendation scenarios. Furthermore, the study highlights the potential of seamlessly integrating EasyRec as a plug-and-play component into text-enhanced collaborative filtering frameworks, thereby empowering existing recommender systems to elevate their recommendation performance and adapt to the evolving user preferences in dynamic environments. For better result reproducibility of our EasyRec framework, the model implementation details, source code, and datasets are available at the link: https://github.com/HKUDS/EasyRec.	翻訳日:2024-11-08 07:18:07 公開日:2024-10-18
# スパースGPTの高次複雑度解析 A Tighter Complexity Analysis of SparseGPT ( http://arxiv.org/abs/2408.12151v2 ) ライセンス: Link先を確認	Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song,	(参考訳) 本研究では, SparseGPT [Frantar, Alistarh ICML 2023] を$O(d^{3})$から$O(d^{\omega} + d^{2+a+o(1)} + d^{1+\omega(1,1,a)-a})$ の任意の $a \in [0,1]$ に対して, $\omega$ は行列乗算の指数である。特に、現在の$\omega \approx 2.371$[Alman, Duan, Williams, Xu, Xu, Zhou 2024]の場合、ランニングタイムは$O(d^{2.53})$に沸騰する。この実行時間は,[Deng, Song, Weinstein 2022; Brand, Song, Zhou ICML 2024]のような反復メンテナンス問題における遅延更新動作の分析によるものだ。 In this work, we improved the analysis of the running time of SparseGPT [Frantar, Alistarh ICML 2023] from $O(d^{3})$ to $O(d^{\omega} + d^{2+a+o(1)} + d^{1+\omega(1,1,a)-a})$ for any $a \in [0, 1]$, where $\omega$ is the exponent of matrix multiplication. In particular, for the current $\omega \approx 2.371$ [Alman, Duan, Williams, Xu, Xu, Zhou 2024], our running time boils down to $O(d^{2.53})$. This running time is due to the analysis of the lazy update behavior in iterative maintenance problems such as [Deng, Song, Weinstein 2022; Brand, Song, Zhou ICML 2024].	翻訳日:2024-11-08 05:49:00 公開日:2024-10-18
# 量子レインボー符号 Quantum Rainbow Codes ( http://arxiv.org/abs/2408.13130v2 ) ライセンス: Link先を確認	Thomas R. Scruby, Arthur Pesah, Mark Webster,	(参考訳) 色符号とピン符号を一般化した新しい量子誤り訂正符号である虹符号を導入する。レインボー符号は、$0$-simplicesの有効な$(D+1)$-colouringを許容する任意の$D$-次元のsimplicial complex上で定義することができる。本稿では, これらの単純錯体がハイパーグラフ生成物を介して得られた鎖錯体から導出される場合について詳細に検討し, これらの符号をドメイン壁に結合したカラー符号の集合として再解釈することにより, 符号付きキュービットの数と距離が増大するコードファミリ, および$T$および$T^\dag$の超越的応用によって実装された論理的非クリフォードゲートが得られることを示す。これらの技法をZhu et al (arXiv:2310.16982) の準双曲色符号と組み合わせることで、超越的な非クリフォードゲートとパラメータ $[\! [n,O(n),O(log(n))]\! これにより、マジック状態の収率パラメータ $\gamma = \log_d(n/k)$ を任意に小さくすることができる。一方、$\gamma \rightarrow 0 の他の構成とは対照的に、我々の符号は qubit 上でネイティブに定義されており、LDPC であり、論理的な非クリフォードゲートはシングルキュービット(エンタングリングではなく)物理演算で実装できるが、漸近的に良いものではない。 We introduce rainbow codes, a novel class of quantum error correcting codes generalising colour codes and pin codes. Rainbow codes can be defined on any $D$-dimensional simplicial complex that admits a valid $(D+1)$-colouring of its $0$-simplices. We study in detail the case where these simplicial complexes are derived from chain complexes obtained via the hypergraph product and, by reinterpreting these codes as collections of colour codes joined at domain walls, show that we can obtain code families with growing distance and number of encoded qubits as well as logical non-Clifford gates implemented by transversal application of $T$ and $T^\dag$. By combining these techniques with the quasi-hyperbolic colour codes of Zhu et al. (arXiv:2310.16982) we obtain families of codes with transversal non-Clifford gates and parameters $[\![n,O(n),O(log(n))]\!]$ which allow the magic-state yield parameter $\gamma = \log_d(n/k)$ to be made arbitrarily small. In contrast to other recent constructions that achieve $\gamma \rightarrow 0$ our codes are natively defined on qubits, are LDPC, and have logical non-Clifford gates implementable by single-qubit (rather than entangling) physical operations, but are not asymptotically good.	翻訳日:2024-11-08 05:26:28 公開日:2024-10-18
# SciLitLLM:科学文献理解のためのLLMの適応方法 SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding ( http://arxiv.org/abs/2408.15545v3 ) ライセンス: Link先を確認	Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, Hengxing Cai,	(参考訳) 科学的文献の理解は、対象とする情報を抽出し、洞察を得るために不可欠であり、科学的な発見を著しく前進させる。 LLM(Large Language Models)の顕著な成功にもかかわらず、第一に科学的知識の欠如と、第二に専門的な科学的タスクに精通していないことによる科学文献理解の課題に直面している。本研究では,科学文献理解に特化したLLMを開発するために,CPT(Continuous Pre-Turning)とSFT(教師付き微調整)を統合したハイブリッド戦略を提案し,科学的ドメイン知識を同時に注入し,ドメイン固有のタスクの指示追従能力を高める。我々は、PDFテキスト抽出、コンテンツエラー訂正のパース、品質フィルタリング、合成命令生成など、微妙なパイプラインを通じてこれらの課題に対処する。この戦略を応用して、科学文献理解に特化したLLMのスイートSciLitLLMを提示する。これらのモデルは科学文献理解ベンチマークにおいて有望な性能を示す。 1) CPT と SFT を統合し,科学文献理解に LLM を適用し,他の領域にも容易に適用可能な効果的なフレームワークを提案する。 2) LLMに基づく多種多様な科学的命令を生成するための合成法を提案し, より表現の少ない科学領域における微調整のための新しい命令セットであるSciLitInsを提案する。 (3)SciLitLLMは,学術文献理解ベンチマークにおいて有望な性能向上を実現している。 Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks.	翻訳日:2024-11-08 04:30:58 公開日:2024-10-18
# MedDet:効率的な頚椎椎間板ヘルニア検出のための生成的対側蒸留法 MedDet: Generative Adversarial Distillation for Efficient Cervical Disc Herniation Detection ( http://arxiv.org/abs/2409.00204v2 ) ライセンス: Link先を確認	Zeyu Zhang, Nengmin Yi, Shengbo Tan, Ying Cai, Yi Yang, Lei Xu, Qingtai Li, Zhang Yi, Daji Ergu, Yang Zhao,	(参考訳) 頚椎椎間板ヘルニア(Cervical disc herniation, CDH)は、筋骨格障害の1つである。医用画像の自動検出の進歩にもかかわらず、これらの手法の現実的な応用を妨げる2つの大きな課題がある。第一に、計算の複雑さとリソース要求は、リアルタイムアプリケーションにとって大きなギャップを生じさせる。第二に、MRIのノイズは特徴抽出を歪ませることで既存の手法の有効性を低下させる。まず, モデル圧縮と効率向上のために, マルチ教師による単一学習知識の蒸留を活用するMedDetを導入した。さらに、MRIのノイズ耐性を改善するために、2階のnmODEをカスタマイズする。最後に,CDH-1848データセットの総合的な実験を行い,従来の手法と比較して最大5%のmAP改善を実現した。提案手法は,約67.8%のパラメータを,36.9%のFLOPを教師モデルと比較し,推論速度を5倍以上に向上させる。これらの進歩はCDH自動検出の性能と効率を大幅に向上させ、将来的な臨床応用の可能性を示している。プロジェクトのWebサイト https://steve-zeyu-zhang.github.io/MedDet Cervical disc herniation (CDH) is a prevalent musculoskeletal disorder that significantly impacts health and requires labor-intensive analysis from experts. Despite advancements in automated detection of medical imaging, two significant challenges hinder the real-world application of these methods. First, the computational complexity and resource demands present a significant gap for real-time application. Second, noise in MRI reduces the effectiveness of existing methods by distorting feature extraction. To address these challenges, we propose three key contributions: Firstly, we introduced MedDet, which leverages the multi-teacher single-student knowledge distillation for model compression and efficiency, meanwhile integrating generative adversarial training to enhance performance. Additionally, we customize the second-order nmODE to improve the model's resistance to noise in MRI. Lastly, we conducted comprehensive experiments on the CDH-1848 dataset, achieving up to a 5% improvement in mAP compared to previous methods. Our approach also delivers over 5 times faster inference speed, with approximately 67.8% reduction in parameters and 36.9% reduction in FLOPs compared to the teacher model. These advancements significantly enhance the performance and efficiency of automated CDH detection, demonstrating promising potential for future application in clinical practice. See project website https://steve-zeyu-zhang.github.io/MedDet	翻訳日:2024-11-08 03:46:25 公開日:2024-10-18
# GraphInsight: グラフ構造理解のための大規模言語モデルのロック解除 GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding ( http://arxiv.org/abs/2409.03258v2 ) ライセンス: Link先を確認	Yukun Cao, Shuo Han, Zengyi Gao, Zezhong Ding, Xike Xie, S. Kevin Zhou,	(参考訳) 大規模言語モデル(LLM)はグラフ処理の可能性を実証しているが、グラフサイズが大きくなるにつれてグラフ記述シーケンスのプロンプトを通じてグラフィカル構造情報の理解に苦慮している。この課題は「位置バイアス」と呼ばれるグラフ記述配列の異なる位置におけるLLMの不均一メモリ性能に起因する。そこで我々は,マクロおよびマイクロレベルのグラフィカル情報に対するLLMの理解を改善するための新しいフレームワークであるGraphInsightを提案する。 GraphInsightには2つの重要な戦略がある。 1)LCMがより強力なメモリ性能を示す位置に重要なグラフィカル情報を配置し、 2)検索強化世代(RAG)にインスパイアされた,メモリ性能の低い領域に対する軽量な外部知識ベースの検討。さらに、GraphInsightは、これらの2つの戦略を多段階推論を必要とする複合グラフタスクのLLMエージェントプロセスに統合することを検討している。幅広い評価タスクを持つベンチマークに関する広範な実証研究により、グラフインサイトは他のグラフ記述手法(例えば、様々な大きさのグラフ構造を理解する上でのテクニックや並べ替え戦略)を著しく上回っていることが示されている。 Although Large Language Models (LLMs) have demonstrated potential in processing graphs, they struggle with comprehending graphical structure information through prompts of graph description sequences, especially as the graph size increases. We attribute this challenge to the uneven memory performance of LLMs across different positions in graph description sequences, known as ''positional biases''. To address this, we propose GraphInsight, a novel framework aimed at improving LLMs' comprehension of both macro- and micro-level graphical information. GraphInsight is grounded in two key strategies: 1) placing critical graphical information in positions where LLMs exhibit stronger memory performance, and 2) investigating a lightweight external knowledge base for regions with weaker memory performance, inspired by retrieval-augmented generation (RAG). Moreover, GraphInsight explores integrating these two strategies into LLM agent processes for composite graph tasks that require multi-step reasoning. Extensive empirical studies on benchmarks with a wide range of evaluation tasks show that GraphInsight significantly outperforms all other graph description methods (e.g., prompting techniques and reordering strategies) in understanding graph structures of varying sizes.	翻訳日:2024-11-07 23:23:02 公開日:2024-10-18
# 自然言語のプランニングによりコード生成のためのLLM検索が改善 Planning In Natural Language Improves LLM Search For Code Generation ( http://arxiv.org/abs/2409.03733v2 ) ライセンス: Link先を確認	Evan Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, Will Song, Vaskar Nath, Ziwen Han, Sean Hendryx, Summer Yue, Hugh Zhang,	(参考訳) 大規模言語モデル(LLM)では、スケールトレーニング計算が顕著に改善されているが、スケーリング推論計算では、まだ類似のゲインが得られていない。我々は、中核的な欠落成分は多様なLCM出力の欠如であり、モデルが非常によく似ているが誤りな世代を繰り返しサンプリングするため、非効率な探索につながると仮定する。この多様性の欠如は、自然言語の問題を解決するための候補計画を探すことによって緩和することができることを実証的に実証する。この知見に基づいて,HumanEval+,MBPP+,LiveCodeBench(競合コーディングのための汚染のないベンチマーク)にまたがる強力な結果を示す新しい検索アルゴリズムであるPlanSearchを提案する。 PlanSearchは、問題に関するさまざまな観察結果を生成し、これらの観測結果を使用して、問題を解決するための計画を構築します。 PlanSearchは、コードソリューションを直接ではなく自然言語で探索することによって、ベースライン検索法よりもはるかに多様な潜在的なソリューションを探索する。クロード3.5上でPlanSearchを使用することで、LiveCodeBenchで77.0%の最先端パス@200を達成し、検索なしで最高のスコア(pass@1 = 41.4%)と標準繰り返しサンプリング(pass@200 = 60.6%)の両方を上回ります。最後に、分析したモデル、検索アルゴリズム、およびベンチマークにおいて、生成したアイデアに対する多様性の直接的な関数として検索による性能向上を正確に予測できることを示す。コードはhttps://github.com/scaleapi/plansearch.comにある。 While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs, leading to inefficient search due to models repeatedly sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PlanSearch, a novel search algorithm which shows strong results across HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding). PlanSearch generates a diverse set of observations about the problem and then uses these observations to construct plans for solving the problem. By searching over plans in natural language rather than directly over code solutions, PlanSearch explores a significantly more diverse range of potential solutions compared to baseline search methods. Using PlanSearch on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%). Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately predict performance gains due to search as a direct function of the diversity over generated ideas. Code can be found at https://github.com/scaleapi/plansearch.	翻訳日:2024-11-07 23:11:54 公開日:2024-10-18
# 訓練されたエージェント探索によるインタラクティブな生成環境の学習 Learning Generative Interactive Environments By Trained Agent Exploration ( http://arxiv.org/abs/2409.06445v2 ) ライセンス: Link先を確認	Naser Kazemi, Nedko Savov, Danda Paudel, Luc Van Gool,	(参考訳) 世界モデルは、複雑な環境のルールと行動の解釈とシミュレートにおいて、ますます重要になっている。最近のモデルであるGenieは、視覚的に多様な環境からの学習に優れていますが、コストのかかる人為的なデータに依存しています。ランダムエージェントの代替手法が環境を探索するには限界すぎることを観察する。データ生成に強化学習に基づくエージェントを用いてモデルを改善することを提案する。このアプローチは、さまざまなシナリオや環境内の現実的なアクションに対して、モデルを適応し、適切に実行する能力を高める多様なデータセットを生成する。本稿では、Genieをベースにした実装であるGenieReduxモデルを最初にリリースする。また,GenieRedux-Gを導入し,エージェントの容易な動作を利用して,検証中の動作予測の不確実性を判断する。 Coinrun ケーススタディの再現を含む評価の結果,GenieRedux-G は訓練されたエージェント探索を用いて優れた視覚的忠実度と制御性が得られることが示された。提案されたアプローチは再現可能で、スケーラブルで、新しいタイプの環境に適応できる。私たちのコードベースはhttps://github.com/insait-institute/GenieRedux で公開されています。 World models are increasingly pivotal in interpreting and simulating the rules and actions of complex environments. Genie, a recent model, excels at learning from visually diverse environments but relies on costly human-collected data. We observe that their alternative method of using random agents is too limited to explore the environment. We propose to improve the model by employing reinforcement learning based agents for data generation. This approach produces diverse datasets that enhance the model's ability to adapt and perform well across various scenarios and realistic actions within the environment. In this paper, we first release the model GenieRedux - an implementation based on Genie. Additionally, we introduce GenieRedux-G, a variant that uses the agent's readily available actions to factor out action prediction uncertainty during validation. Our evaluation, including a replication of the Coinrun case study, shows that GenieRedux-G achieves superior visual fidelity and controllability using the trained agent exploration. The proposed approach is reproducable, scalable and adaptable to new types of environments. Our codebase is available at https://github.com/insait-institute/GenieRedux .	翻訳日:2024-11-07 22:16:23 公開日:2024-10-18
# LED:夜間の光深度推定 LED: Light Enhanced Depth Estimation at Night ( http://arxiv.org/abs/2409.08031v2 ) ライセンス: Link先を確認	Simon de Moreau, Yasser Almehio, Andrei Bursuc, Hafid El-Idrissi, Bogdan Stanciulescu, Fabien Moutarde,	(参考訳) 夜間カメラによる深度推定は、特に安全なナビゲーションを確保するために正確な深度認識が不可欠である自律運転アプリケーションにおいて、非常に困難な作業である。夜間における知覚システムの信頼性向上を目指しており、日中のデータで訓練されたモデルは、正確なLiDARセンサーがなければ、しばしば失敗する。本研究は,高精細ヘッドライトによって投影されるパターンを活用することで,低照度環境における奥行き推定を大幅に改善する,新しいコスト効率のアプローチであるLight Enhanced Depth(LED)を紹介する。 LEDは、複数の深度推定アーキテクチャ(エンコーダ-デコーダ、Adabins、DepthFormer)において、合成データセットと実際のデータセットの両方において、大幅なパフォーマンス向上をもたらします。さらに,照明領域を越えた性能向上は,シーン理解の全体的向上を示す。最後に、我々はNighttime Synthetic Drive Datasetをリリースした。Nighttime Synthetic Drive Datasetは、49,990の注釈付き画像からなる、新しい合成的で写真リアルなナイトタイムデータセットである。 Nighttime camera-based depth estimation is a highly challenging task, especially for autonomous driving applications, where accurate depth perception is essential for ensuring safe navigation. We aim to improve the reliability of perception systems at night time, where models trained on daytime data often fail in the absence of precise but costly LiDAR sensors. In this work, we introduce Light Enhanced Depth (LED), a novel cost-effective approach that significantly improves depth estimation in low-light environments by harnessing a pattern projected by high definition headlights available in modern vehicles. LED leads to significant performance boosts across multiple depth-estimation architectures (encoder-decoder, Adabins, DepthFormer) both on synthetic and real datasets. Furthermore, increased performances beyond illuminated areas reveal a holistic enhancement in scene understanding. Finally, we release the Nighttime Synthetic Drive Dataset, a new synthetic and photo-realistic nighttime dataset, which comprises 49,990 comprehensively annotated images.	翻訳日:2024-11-07 21:31:36 公開日:2024-10-18
# 大規模言語モデルに基づく生成誤差補正:音声認識、話者タグ付け、感情認識の課題と基礎 Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition ( http://arxiv.org/abs/2409.09785v3 ) ライセンス: Link先を確認	Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke,	(参考訳) 生成AI技術の最近の進歩を踏まえると、大きな言語モデル(LLM)が、凍結した事前訓練された自動音声認識(ASR)モデルからテキストデコード結果を用いて、音響モデリングタスクをどのように強化できるかが重要な疑問である。音声処理における言語モデリングの新機能を探るため,生成音声の書き起こし誤り訂正(GenSEC)の課題について紹介する。この課題は、ASR後の3つの言語モデリングタスクから成っている。 (i)ASR後の転写補正 (二)話者タグ付け、及び (三)感情認識。これらのタスクは、オープンな事前訓練された言語モデルやエージェントベースのAPIを利用することで、音声ベースのインターフェースを扱う将来のLLMベースのエージェントのエミュレートを目的としている。また,ベースライン評価から得られた知見や,今後の評価設計における教訓についても論じる。 Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.	翻訳日:2024-11-07 20:46:36 公開日:2024-10-18
# CSKV:長期シナリオにおけるKVキャッシュのための訓練効率の良いチャネルスライキング CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios ( http://arxiv.org/abs/2409.10593v3 ) ライセンス: Link先を確認	Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang,	(参考訳) 大きな言語モデル(LLM)は、長いコンテキストタスクを処理するために広く採用されている。しかしながら、キー値(KV)キャッシュの大きなメモリオーバーヘッドは、長期コンテキストシナリオにおいて大きな課題を生じさせる。既存のトレーニング不要なKVキャッシュ圧縮手法は、圧縮限界のある量子化とトークンプルーニングに重点を置いており、過度なスパーシリティによってパフォーマンスが著しく低下する可能性がある。他の手法はKVオーバーヘッドが少ないが、かなりのトレーニングオーバーヘッドを必要とする新しいアーキテクチャを設計する。上記の2つの欠点に対処するため、チャネル次元の冗長性をさらに検討し、少ないトレーニングコストでアーキテクチャレベルの設計を適用する。そこで我々は,KVキャッシュ圧縮のための訓練効率の高いチャネルシンキング手法であるCSKVを紹介した:(1)KVキャッシュの特異値分布をまず解析し,チャネル次元に沿った大きな冗長性と圧縮ポテンシャルを明らかにする。そこで本研究では,鍵層と値層を低階分解し,低次元特徴を記憶する手法を提案する。 2) モデル性能を維持するため,ウィンドウベースフル精度KVキャッシュと低精度圧縮KVキャッシュを含む分岐KVキャッシュを導入する。 (3) トレーニングコストを削減するため, 圧縮KVキャッシュの階層的再構成損失を最小限に抑える。大規模な実験により、CSKVはKVキャッシュのメモリオーバーヘッドを80%削減し、モデルの長期コンテキスト能力を維持できることが示された。さらに,本手法を量子化とシームレスに組み合わせることで,メモリオーバーヘッドをさらに低減し,最大95%の圧縮比が得られることを示す。コードはhttps://github.com/wln20/CSKVで入手できる。 Large Language Models (LLMs) have been widely adopted to process long-context tasks. However, the large memory overhead of the key-value (KV) cache poses significant challenges in long-context scenarios. Existing training-free KV cache compression methods typically focus on quantization and token pruning, which have compression limits, and excessive sparsity can lead to severe performance degradation. Other methods design new architectures with less KV overhead but require significant training overhead. To address the above two drawbacks, we further explore the redundancy in the channel dimension and apply an architecture-level design with minor training costs. Therefore, we introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression: (1) We first analyze the singular value distribution of the KV cache, revealing significant redundancy and compression potential along the channel dimension. Based on this observation, we propose using low-rank decomposition for key and value layers and storing the low-dimension features. (2) To preserve model performance, we introduce a bi-branch KV cache, including a window-based full-precision KV cache and a low-precision compressed KV cache. (3) To reduce the training costs, we minimize the layer-wise reconstruction loss for the compressed KV cache instead of retraining the entire LLMs. Extensive experiments show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model's long-context capability. Moreover, we show that our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%. Code is available at https://github.com/wln20/CSKV.	翻訳日:2024-11-07 20:24:11 公開日:2024-10-18
# Fact, Fetch, Reason : Retrieval-Augmented Generation の統一評価 Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation ( http://arxiv.org/abs/2409.12941v1 ) ライセンス: Link先を確認	Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal Faruqui,	(参考訳) 大規模言語モデル(LLM)は、様々な認知タスクにおいて、大幅なパフォーマンス向上を示している。新たなアプリケーションは、LLMを使用して検索強化世代(RAG)機能を強化している。これらのシステムでは、ユーザクエリを理解し、関連する情報を検索し、一貫性と正確な応答を合成するためにLLMが必要である。このようなシステムの現実的な展開が増加する中、包括的評価が重要となる。そこで本研究では,FRAMES (Factuality, Retrieval, And reasoning Measurement Set) を提案する。以前の作業では、これらの機能を分離して評価するためのデータセットとベンチマークが提供されていたが、FRAMESは、エンドツーエンドのRAGシナリオにおけるLLMパフォーマンスのより明確な図を提供する統一されたフレームワークを提供している。私たちのデータセットは、複数のソースからの情報の統合を必要とする、挑戦的なマルチホップ質問で構成されています。本稿では,最先端のLLMでもこの課題に対処し,0.40の精度で検索を行なわないことを示す。提案した多段階探索パイプラインでは精度が大幅に向上し,0.66(>50%)の精度が得られた。我々は、評価ギャップを埋め、より堅牢で有能なRAGシステムの開発を支援することを願っている。 Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.	翻訳日:2024-11-07 12:48:01 公開日:2024-10-18
# Fact, Fetch, Reason : Retrieval-Augmented Generation の統一評価 Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation ( http://arxiv.org/abs/2409.12941v2 ) ライセンス: Link先を確認	Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal Faruqui,	(参考訳) 大規模言語モデル(LLM)は、様々な認知タスクにおいて、大幅なパフォーマンス向上を示している。新たなアプリケーションは、LLMを使用して検索強化世代(RAG)機能を強化している。これらのシステムでは、ユーザクエリを理解し、関連する情報を検索し、一貫性と正確な応答を合成するためにLLMが必要である。このようなシステムの現実的な展開が増加する中、包括的評価が重要となる。そこで本研究では,FRAMES (Factuality, Retrieval, And reasoning Measurement Set) を提案する。以前の作業では、これらの機能を分離して評価するためのデータセットとベンチマークが提供されていたが、FRAMESは、エンドツーエンドのRAGシナリオにおけるLLMパフォーマンスのより明確な図を提供する統一されたフレームワークを提供している。私たちのデータセットは、複数のソースからの情報の統合を必要とする、挑戦的なマルチホップ質問で構成されています。本稿では,最先端のLLMでもこの課題に対処し,0.40の精度で検索を行なわないことを示す。提案した多段階探索パイプラインでは精度が大幅に向上し,0.66(>50%)の精度が得られた。我々は、評価ギャップを埋め、より堅牢で有能なRAGシステムの開発を支援することを願っている。 Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.	翻訳日:2024-11-07 12:48:01 公開日:2024-10-18
# Video-XL:24時間ビデオ理解のための極長ビジョン言語モデル Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding ( http://arxiv.org/abs/2409.14485v2 ) ライセンス: Link先を確認	Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, Bo Zhao,	(参考訳) 現在のMLLM(Multi-modal Large Language Models)は、ビデオ理解における有望な結果を示しているが、非常に長いビデオの処理は今も進行中の課題である。通常、MLLMはLLMの最大コンテキスト長を超える何千ものトークンを扱うのに苦労し、トークン集約による視覚的明瞭度の低下を経験する。もう一つの課題は、大量のビデオトークンから生じる高い計算コストである。これらの課題に対処するために,時間スケールの効率的な映像理解を目的とした超長期視覚言語モデルであるVideo-XLを提案する。具体的には、LLMを効果的な視覚コンデンサとして適用し、視覚コンテキストを高度にコンパクトな形式に凝縮する視覚コンテキストラテント要約を導入することを論じる。広範にわたる実験により,画像データに制限があるにもかかわらず,人気ビデオ理解ベンチマークで有望な結果が得られた。さらに、Video-XLは80GBのGPU上で1024フレームを処理し、Needdle-in-a-Haystack評価においてほぼ100%の精度を実現している。我々は、ビデオ要約、監視異常検出、広告配置識別などの長大なビデオアプリケーションにとって、ビデオ-XLが貴重なツールになることを期待している。 Although current Multi-modal Large Language Models (MLLMs) demonstrate promising results in video understanding, processing extremely long videos remains an ongoing challenge. Typically, MLLMs struggle with handling thousands of tokens that exceed the maximum context length of LLMs, and they experience reduced visual clarity due to token aggregation. Another challenge is the high computational cost stemming from the large number of video tokens. To tackle these issues, we propose Video-XL, an extra-long vision language model designed for efficient hour-scale video understanding. Specifically, we argue that LLMs can be adapted as effective visual condensers and introduce Visual Context Latent Summarization, which condenses visual contexts into highly compact forms. Extensive experiments demonstrate that our model achieves promising results on popular long video understanding benchmarks, despite being trained on limited image data. Moreover, Video-XL strikes a promising balance between efficiency and effectiveness, processing 1024 frames on a single 80GB GPU while achieving nearly 100\% accuracy in the Needle-in-a-Haystack evaluation. We envision Video-XL becoming a valuable tool for long video applications such as video summarization, surveillance anomaly detection, and Ad placement identification.	翻訳日:2024-11-06 22:30:40 公開日:2024-10-18
# Video-XL:24時間ビデオ理解のための極長ビジョン言語モデル Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding ( http://arxiv.org/abs/2409.14485v3 ) ライセンス: Link先を確認	Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, Bo Zhao,	(参考訳) 現在のMLLM(Multi-modal Large Language Models)は、ビデオ理解における有望な結果を示しているが、非常に長いビデオの処理は今も進行中の課題である。一般的にMLLMは、最大コンテキスト長を超える数千の視覚トークンを扱うのに苦労し、トークン集約による情報減衰に悩まされる。もう一つの課題は、大量のビデオトークンから生じる高い計算コストである。これらの課題に対処するために,時間スケールの効率的な映像理解を目的とした超長期視覚言語モデルであるVideo-XLを提案する。具体的には、LLMを効果的なビジュアルコンデンサとして適用し、視覚コンテキストを高度にコンパクトな形式に凝縮する視覚コンテキストラテント要約を提案する。広範にわたる実験により,我々のモデルは,人気ビデオ理解ベンチマークにおいて有望な結果が得られることを示した。例えば、Video-XLはVNBench上の現在の最先端の手法を10倍近い精度で上回る。さらに、Video-XLは効率と効率の両立を示し、80GBのGPU上で2048フレームを処理すると同時に、Needle-in-a-Haystack評価において95%近い精度を実現している。 Although current Multi-modal Large Language Models (MLLMs) demonstrate promising results in video understanding, processing extremely long videos remains an ongoing challenge. Typically, MLLMs struggle with handling thousands of visual tokens that exceed the maximum context length, and they suffer from the information decay due to token aggregation. Another challenge is the high computational cost stemming from the large number of video tokens. To tackle these issues, we propose Video-XL, an extra-long vision language model designed for efficient hour-scale video understanding. Specifically, we argue that LLMs can be adapted as effective visual condensers and propose Visual Context Latent Summarization which condenses visual contexts into highly compact forms. Extensive experiments demonstrate that our model achieves promising results on popular long video understanding benchmarks. For example, Video-XL outperforms the current state-of-the-art method on VNBench by nearly 10\% in accuracy. Moreover, Video-XL presents an impressive balance between efficiency and effectiveness, processing 2048 frames on a single 80GB GPU while achieving nearly 95% accuracy in the Needle-in-a-Haystack evaluation.	翻訳日:2024-11-06 22:30:40 公開日:2024-10-18
# 大規模言語モデルにおける性・人種・年齢バイアスの評価:職業・犯罪シナリオの比較分析 Evaluating Gender, Racial, and Age Biases in Large Language Models: A Comparative Analysis of Occupational and Crime Scenarios ( http://arxiv.org/abs/2409.14583v1 ) ライセンス: Link先を確認	Vishal Mirza, Rahul Kulkarni, Aakanksha Jadhav,	(参考訳) LLM(Large Language Models)の最近の進歩は注目されているが、様々な制約のため、広く採用されている企業はまだ限られている。本稿では, LLM におけるバイアスがユーザビリティ, 信頼性, 公平性に与える影響について検討する。研究者はバイアスを軽減するための戦略を開発しており、例えば、デバイアス層、WinogenderやWinobiasのような特別な参照データセット、人間からのフィードバックによる強化学習(RLHF)などがある。これらの技術は最新のLSMに統合されている。本研究は,2024年に公開された4つのLLM(Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, GPT-4o)における,職業シナリオ,性別,年齢,人種バイアスの性別バイアスを評価する。 LLMは、様々な職業において、男性よりも女性キャラクターが頻繁に描かれており、米国のBLSデータから37%の偏差が示されている。犯罪シナリオでは、FBIのデータからの偏差は性別が54%、人種が28%、年齢が17%である。我々は、性別と人種的偏見を減らす努力が、しばしば1つのサブクラスを過大評価し、問題を悪化させる可能性がある結果をもたらすことを観察する。これらの結果は、現在のバイアス緩和技術の限界を強調し、より効果的なアプローチの必要性を強調している。 Recent advancements in Large Language Models(LLMs) have been notable, yet widespread enterprise adoption remains limited due to various constraints. This paper examines bias in LLMs-a crucial issue affecting their usability, reliability, and fairness. Researchers are developing strategies to mitigate bias, including debiasing layers, specialized reference datasets like Winogender and Winobias, and reinforcement learning with human feedback (RLHF). These techniques have been integrated into the latest LLMs. Our study evaluates gender bias in occupational scenarios and gender, age, and racial bias in crime scenarios across four leading LLMs released in 2024: Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, and GPT-4o. Findings reveal that LLMs often depict female characters more frequently than male ones in various occupations, showing a 37% deviation from US BLS data. In crime scenarios, deviations from US FBI data are 54% for gender, 28% for race, and 17% for age. We observe that efforts to reduce gender and racial bias often lead to outcomes that may over-index one sub-class, potentially exacerbating the issue. These results highlight the limitations of current bias mitigation techniques and underscore the need for more effective approaches.	翻訳日:2024-11-06 22:08:18 公開日:2024-10-18
# 大規模言語モデルにおける性・人種・年齢バイアスの評価:職業・犯罪シナリオの比較分析 Evaluating Gender, Racial, and Age Biases in Large Language Models: A Comparative Analysis of Occupational and Crime Scenarios ( http://arxiv.org/abs/2409.14583v2 ) ライセンス: Link先を確認	Vishal Mirza, Rahul Kulkarni, Aakanksha Jadhav,	(参考訳) LLM(Large Language Models)の最近の進歩は注目されているが、様々な制約のため、広く採用されている企業はまだ限られている。本稿では, LLM におけるバイアスがユーザビリティ, 信頼性, 公平性に与える影響について検討する。研究者はバイアスを軽減するための戦略を開発しており、例えば、デバイアス層、WinogenderやWinobiasのような特別な参照データセット、人間からのフィードバックによる強化学習(RLHF)などがある。これらの技術は最新のLSMに統合されている。本研究は,2024年に公開された4つのLLM(Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, GPT-4o)における,職業シナリオ,性別,年齢,人種バイアスの性別バイアスを評価する。 LLMは、様々な職業において、男性よりも女性キャラクターが頻繁に描かれており、米国のBLSデータから37%の偏差が示されている。犯罪シナリオでは、FBIのデータからの偏差は性別が54%、人種が28%、年齢が17%である。我々は、性別と人種的偏見を減らす努力が、しばしば1つのサブクラスを過大評価し、問題を悪化させる可能性がある結果をもたらすことを観察する。これらの結果は、現在のバイアス緩和技術の限界を強調し、より効果的なアプローチの必要性を強調している。 Recent advancements in Large Language Models(LLMs) have been notable, yet widespread enterprise adoption remains limited due to various constraints. This paper examines bias in LLMs-a crucial issue affecting their usability, reliability, and fairness. Researchers are developing strategies to mitigate bias, including debiasing layers, specialized reference datasets like Winogender and Winobias, and reinforcement learning with human feedback (RLHF). These techniques have been integrated into the latest LLMs. Our study evaluates gender bias in occupational scenarios and gender, age, and racial bias in crime scenarios across four leading LLMs released in 2024: Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, and GPT-4o. Findings reveal that LLMs often depict female characters more frequently than male ones in various occupations, showing a 37% deviation from US BLS data. In crime scenarios, deviations from US FBI data are 54% for gender, 28% for race, and 17% for age. We observe that efforts to reduce gender and racial bias often lead to outcomes that may over-index one sub-class, potentially exacerbating the issue. These results highlight the limitations of current bias mitigation techniques and underscore the need for more effective approaches.	翻訳日:2024-11-06 22:08:18 公開日:2024-10-18
# 耳を聴く耳:多モーダル大言語モデルを用いた音の象徴実験 With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models ( http://arxiv.org/abs/2409.14917v2 ) ライセンス: Link先を確認	Tyler Loakman, Yucheng Li, Chenghua Lin,	(参考訳) 近年,Large Language Models (LLMs) とVision Language Models (VLMs) は,精神言語学的な現象を実験する実験において,人間の代替手段としての能力を示している。しかし,視覚やテキストのモダリティにのみアクセス可能なモデルが,正書法や画像のみからの抽象的推論を通じて,暗黙的に音による現象を理解することができるのか,という疑問がある。そこで本研究では,VLM と LLM の音のシンボリズム(すなわち音と概念の非任意リンクの認識)を実証する能力と,オープンかつクローズドなマルチモーダルモデルの言語とヴィジュアルモジュールのインタープレイを通じて「聴く」能力について分析する。我々は,古典的キキ・ブーバとミル・マールの形状と等級記号課題を再現し,言語的象徴性の人間の判断をLLMと比較するなど,複数の実験を行った。以上の結果から, VLMは人体ラベルとの一致のレベルが異なることが示され, サイリコ実験において, VLMと人体ラベルの対応に必要となるタスク情報がより多く必要となる可能性が示唆された。さらに, マグニチュード・シンボリズムは, VLMが形状シンボリズムよりも識別しやすいパターンであり, 言語的象徴性の理解がモデルサイズに大きく依存していることも確認した。 Recently, Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated aptitude as potential substitutes for human participants in experiments testing psycholinguistic phenomena. However, an understudied question is to what extent models that only have access to vision and text modalities are able to implicitly understand sound-based phenomena via abstract reasoning from orthography and imagery alone. To investigate this, we analyse the ability of VLMs and LLMs to demonstrate sound symbolism (i.e., to recognise a non-arbitrary link between sounds and concepts) as well as their ability to "hear" via the interplay of the language and vision modules of open and closed-source multimodal models. We perform multiple experiments, including replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism tasks, and comparing human judgements of linguistic iconicity with that of LLMs. Our results show that VLMs demonstrate varying levels of agreement with human labels, and more task information may be required for VLMs versus their human counterparts for in silico experimentation. We additionally see through higher maximum agreement levels that Magnitude Symbolism is an easier pattern for VLMs to identify than Shape Symbolism, and that an understanding of linguistic iconicity is highly dependent on model size.	翻訳日:2024-11-06 20:39:08 公開日:2024-10-18
# CNNに基づくBi-GRUモデルを用いた英語攻撃テキストの検出 English offensive text detection using CNN based Bi-GRU model ( http://arxiv.org/abs/2409.15652v2 ) ライセンス: Link先を確認	Tonmoy Roy, Md Robiul Islam, Asif Ahammad Miazee, Anika Antara, Al Amin, Sunjim Hossain,	(参考訳) ここ数年、ソーシャルメディアの利用者数は大幅に増加した。人々はよくソーシャルプラットフォームを通じて自分の考えを共有し、これはヘイトコンテンツの増加につながる。この仮想コミュニティでは、個人が自分の見解を共有し、感情を表現し、写真、ビデオ、ブログなどを投稿する。 FacebookやTwitterのようなソーシャルネットワークサイトは、ワンクリックで大量のコンテンツを共有できるプラットフォームを提供している。しかし、これらのプラットフォームはアップロードされたコンテンツに制限を課していない。この問題を解決するために、不適切なコンテンツを分割するためには、新しいアイデアが実装されなければならない。プロセスを自動化するために多くの研究がなされている。本稿では,テキストが攻撃的であるか否かを分類する新しいBi-GRU-CNNモデルを提案する。 Bi-GRUモデルとCNNモデルの組み合わせは、既存のモデルよりも優れている。 Over the years, the number of users of social media has increased drastically. People frequently share their thoughts through social platforms, and this leads to an increase in hate content. In this virtual community, individuals share their views, express their feelings, and post photos, videos, blogs, and more. Social networking sites like Facebook and Twitter provide platforms to share vast amounts of content with a single click. However, these platforms do not impose restrictions on the uploaded content, which may include abusive language and explicit images unsuitable for social media. To resolve this issue, a new idea must be implemented to divide the inappropriate content. Numerous studies have been done to automate the process. In this paper, we propose a new Bi-GRU-CNN model to classify whether the text is offensive or not. The combination of the Bi-GRU and CNN models outperforms the existing model.	翻訳日:2024-11-06 19:32:29 公開日:2024-10-18
# CNNに基づくBi-GRUモデルを用いた英語攻撃テキストの検出 English offensive text detection using CNN based Bi-GRU model ( http://arxiv.org/abs/2409.15652v3 ) ライセンス: Link先を確認	Tonmoy Roy, Md Robiul Islam, Asif Ahammad Miazee, Anika Antara, Al Amin, Sunjim Hossain,	(参考訳) ここ数年、ソーシャルメディアの利用者数は大幅に増加した。人々はよくソーシャルプラットフォームを通じて自分の考えを共有し、これはヘイトコンテンツの増加につながる。この仮想コミュニティでは、個人が自分の見解を共有し、感情を表現し、写真、ビデオ、ブログなどを投稿する。 FacebookやTwitterのようなソーシャルネットワークサイトは、ワンクリックで大量のコンテンツを共有できるプラットフォームを提供している。しかし、これらのプラットフォームはアップロードされたコンテンツに制限を課していない。この問題を解決するために、不適切なコンテンツを分割するためには、新しいアイデアが実装されなければならない。プロセスを自動化するために多くの研究がなされている。本稿では,テキストが攻撃的であるか否かを分類する新しいBi-GRU-CNNモデルを提案する。 Bi-GRUモデルとCNNモデルの組み合わせは、既存のモデルよりも優れている。 Over the years, the number of users of social media has increased drastically. People frequently share their thoughts through social platforms, and this leads to an increase in hate content. In this virtual community, individuals share their views, express their feelings, and post photos, videos, blogs, and more. Social networking sites like Facebook and Twitter provide platforms to share vast amounts of content with a single click. However, these platforms do not impose restrictions on the uploaded content, which may include abusive language and explicit images unsuitable for social media. To resolve this issue, a new idea must be implemented to divide the inappropriate content. Numerous studies have been done to automate the process. In this paper, we propose a new Bi-GRU-CNN model to classify whether the text is offensive or not. The combination of the Bi-GRU and CNN models outperforms the existing model.	翻訳日:2024-11-06 19:32:29 公開日:2024-10-18
# オープンソースソフトウェアにおけるProtestwareに対する開発者の反応: color.js と es5.ext のケース Developer Reactions to Protestware in Open Source Software: The cases of color.js and es5.ext ( http://arxiv.org/abs/2409.15674v2 ) ライセンス: Link先を確認	Youmei Fan, Dong Wang, Supatsara Wattanakriengkrai, Hathaichanok Damrongsiri, Christoph Treude, Hideaki Hata, Raula Gaikovina Kula,	(参考訳) 保守層が政治や経済のスタンスをとるために自分の仕事を自己破壊することへの懸念が高まっており、これは「抗議者」と呼ばれる慣例である。我々の目的は,このような攻撃に関する議論やコミュニティの受け取り方,開発者がタイムリーに攻撃に反応するかどうかを理解することである。そこで我々は,2つの有名な抗議ウェア,すなわち color.js と es5-ext について検討した。結果として、抗議ウェアの議論はGitHubプラットフォームで急速に広まり、セキュリティ上の脆弱性はソーシャルメディアでより高速であることが示されている。デモウェアの議論の分類を確立させることで、スタンスを表現し、技術的な緩和指示を提供するポストを特定できる。 684件の抗議者関連投稿にテーマ分析を適用し,議論中の5つの主要なテーマを同定した。拡散して反応するわスタンス iii 評判だ iv コミュニケーションのスタイル v. 権利と倫理この作業は、開発者と開発者の両方に、開発者の政治的あるいは社会的行動と、オープンソースコミュニティの集合的幸福との間の健全なバランスを維持するための洞察を提供する。 There is growing concern about maintainers self-sabotaging their work in order to take political or economic stances, a practice referred to as "protestware". Our objective is to understand the discourse around discussions on such an attack, how it is received by the community, and whether developers respond to the attack in a timely manner. We study two notable protestware cases i.e., colors.js and es5-ext. Results indicate that protestware discussions are spread more quickly on the GitHub platform, while security vulnerabilities are faster on social media. By establishing a taxonomy of protestware discussions, we identify posts that express stances and provide technical mitigation instructions. We applied a thematic analysis to 684 protestware related posts to identify five major themes during the discussions: i. disseminate and response, ii. stance, iii. reputation, iv. communicative styles, v. rights and ethics. This work sheds light on the nuanced landscape of protestware discussions, offering insights for both researchers and developers into maintaining a healthy balance between the political or social actions of developers and the collective well-being of the open-source community.	翻訳日:2024-11-06 19:32:29 公開日:2024-10-18
# アテンションヘッド活性化パターンを交互に変更した超微調整アチエーブ高速タスク適応 Supervised Fine-Tuning Achieve Rapid Task Adaption Via Alternating Attention Head Activation Patterns ( http://arxiv.org/abs/2409.15820v2 ) ライセンス: Link先を確認	Yang Zhao, Li Du, Xiao Ding, Kai Xiong, Ting Liu, Bing Qin,	(参考訳) 複雑なタスクにおけるLLMのパフォーマンスはまだ不十分です。重要な問題は、LLMがデータ駆動スキーマで学習しているのに対して、これらの複雑なタスクに関する命令は、収集や構築が困難であることだ。逆に顕著な現象は、LLMが事前訓練の段階で得られた十分な事前知識で、より単純なタスクでより速く学習できることである。したがって、そのような急激な一般化の前提条件とメカニズムが解明できれば、複雑なタスクを学習するLLMの効率性と有効性を高めることができる。そこで本稿では,SFTプロセスが注視パターンの観点から,下流タスクにLLMを適用する過程を解析するために,勾配に基づく手法を用いる。 1) SFTにおけるタスク固有の注意を選択的に活性化する; 2) 複雑なタスクのアクティベーションパターンは基本的なタスクパターンの組み合わせである; 3) 少数のパラメータの変化は、少数のサンプルに対してSFT後のアクティベーションパターンに大きな影響を及ぼす可能性がある。 LLMs' performance on complex tasks is still unsatisfactory. A key issue is that presently LLMs learn in a data-driven schema, while the instructions about these complex tasks are both scarce and hard to collect or construct. On the contrary, a prominent phenomenon is that LLMs can learn rather fast on simpler tasks with adequate prior knowledge captured during pretraining stage. Thus, if the prerequisite and mechanism of such rapid generalization could be elucidated, it could enhance the efficiency and effectiveness of the LLM's ability to learn complex tasks. Thus, in this paper, we employ a gradient-based method, to dissect the process that the SFT process adapts LLMs to downstream tasks via the perspective of attention patterns. We find that: (1) LLMs selectively activate task-specific attention heads during SFT; (2) activation patterns for complex tasks are combinations of basic task patterns; and (3) changes in a few parameters can significantly impact activation patterns after SFT on a small number of samples.Based on these insights, experiments are conducted to actually enhance the efficiency and effectiveness of SFT.	翻訳日:2024-11-06 19:21:13 公開日:2024-10-18
# 同時音声翻訳におけるグラディエント・コンフリクトの緩和のためのモジュラー・ベース・ストラテジー A Modular-based Strategy for Mitigating Gradient Conflicts in Simultaneous Speech Translation ( http://arxiv.org/abs/2409.15911v2 ) ライセンス: Link先を確認	Xiaoqian Liu, Yangfan Du, Jianjin Wang, Yuan Ge, Chen Xu, Tong Xiao, Guocheng Chen, Jingbo Zhu,	(参考訳) 同時音声翻訳(SimulST)は、ストリーミング音声入力を継続的に処理しながらターゲット言語テキストを生成し、重要なリアルタイム課題を提示する。マルチタスク学習は、SimulSTのパフォーマンスを向上させるためにしばしば使用されるが、一次タスクと補助タスクの最適化競合を導入し、全体的な効率を損なう可能性がある。既存のモデルレベルのコンフリクト解決方法は、非効率を悪化させ、高いGPUメモリ消費をもたらすこのタスクには適していない。これらの課題に対処するため,よりきめ細かいモジュラレベルでの衝突を検知し,勾配予測を用いて解決するMGCM(Modular Gradient Conflict Mitigation)戦略を提案する。実験の結果,MGCMは特に中・高遅延条件下でのSimulST性能を著しく改善し,オフラインタスクにおいて0.68BLEUのスコアアップを達成した。さらにMGCMは、他の競合緩和手法と比較して、GPUメモリ消費を95%以上削減し、SimulSTタスクの堅牢なソリューションとして確立している。 Simultaneous Speech Translation (SimulST) involves generating target language text while continuously processing streaming speech input, presenting significant real-time challenges. Multi-task learning is often employed to enhance SimulST performance but introduces optimization conflicts between primary and auxiliary tasks, potentially compromising overall efficiency. The existing model-level conflict resolution methods are not well-suited for this task which exacerbates inefficiencies and leads to high GPU memory consumption. To address these challenges, we propose a Modular Gradient Conflict Mitigation (MGCM) strategy that detects conflicts at a finer-grained modular level and resolves them utilizing gradient projection. Experimental results demonstrate that MGCM significantly improves SimulST performance, particularly under medium and high latency conditions, achieving a 0.68 BLEU score gain in offline tasks. Additionally, MGCM reduces GPU memory consumption by over 95\% compared to other conflict mitigation methods, establishing it as a robust solution for SimulST tasks.	翻訳日:2024-11-06 19:21:13 公開日:2024-10-18
# ダイヤモンド中の窒素空孔中心を用いた広視野マイクロ波磁界イメージング Wide-field microwave magnetic field imaging with nitrogen-vacancy centers in diamond ( http://arxiv.org/abs/2409.16528v2 ) ライセンス: Link先を確認	Luca Basso, Pauli Kehayias, Jacob Henshaw, Gajadhar Joshi, Michael P. Lilly, Matthew B. Jordan, Andrew M. Mounce,	(参考訳) マイクロ波(MW)磁場のマイクロスケール横方向分解能の非侵襲イメージングは、MW技術や集積回路故障解析などの様々な応用において重要である。ダイヤモンド窒素空洞(NV)中心磁力計は理想的なツールとして登場し、$\mu$mスケールの解像度、ミリスケールの視野、高感度、様々なサンプルと互換性のある非侵襲イメージングを提供する。しかし、これまでは、静磁場や低周波磁場のイメージングや、MW磁場のイメージングにおいて、NVスピン遷移を駆動するのと同じマイクロ波デバイスを直接特徴付けるために主に用いられてきた。本研究では、ダイヤモンド中のNV中心アンサンブルを用いて、差分測定プロトコルを用いた試験装置によって生成されたMW磁場の広視野イメージングを行う。顕微鏡は、NVスピン状態間のRabi振動を誘導するMWループを備え、装置アンダーテストからのMWフィールドは、Rabi周波数の局所的な偏差によって測定される。この微分プロトコルは2.57 GHz MW の磁場マップを$\sim$ 9 $\mu$T Hz$^{-1/2}$で、合計測定期間は$T = 357$ sで、340\times 340$$\mu$m$^2$ビューと$\mu$mスケールの空間分解能とDUT入力パワーダイナミックレンジが30dBである。この研究は、差動ラビの周波数測定に基づく新しいNV磁気メトリプロトコルを実証し、標準ラビ磁気メトリで直接測定することが難しい弱いMW磁場のイメージングまで、NV広視野イメージング能力を拡張した。 Non-invasive imaging of microwave (MW) magnetic fields with microscale lateral resolution is pivotal for various applications, such as MW technologies and integrated circuit failure analysis. Diamond nitrogen-vacancy (NV) center magnetometry has emerged as an ideal tool, offering $\mu$m-scale resolution, millimeter-scale field of view, high sensitivity, and non-invasive imaging compatible with diverse samples. However, up until now, it has been predominantly used for imaging of static or low-frequency magnetic fields or, concerning MW field imaging, to directly characterize the same microwave device used to drive the NV spin transitions. In this work we leverage an NV center ensemble in diamond for wide-field imaging of MW magnetic fields generated by a test device employing a differential measurement protocol. The microscope is equipped with a MW loop to induce Rabi oscillations between NV spin states, and the MW field from the device-under-test is measured through local deviations in the Rabi frequency. This differential protocol yields magnetic field maps of a 2.57 GHz MW field with a sensitivity of $\sim$ 9 $\mu$T Hz$^{-1/2}$ for a total measurement duration of $T = 357$ s, covering a $340\times340$ $\mu$m$^2$ field of view with a $\mu$m-scale spatial resolution and a DUT input power dynamic range of 30 dB. This work demonstrates a novel NV magnetometry protocol, based on differential Rabi frequency measurement, that extends NV wide-field imaging capabilities to imaging of weak MW magnetic fields that would be difficult to measure directly through standard NV Rabi magnetometry.	翻訳日:2024-11-06 17:30:16 公開日:2024-10-18
# マルチモーダル分類のためのマルチモーダル混合コントラスト学習による共有関係の調和 Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification ( http://arxiv.org/abs/2409.17777v2 ) ライセンス: Link先を確認	Raja Kumar, Raghav Singhal, Pranamya Kulkarni, Deval Mehta, Kshitij Jadhav,	(参考訳) 深いマルチモーダル学習は、対照的な学習を活用して、モダリティをまたいだ明示的な1対1の関係を捉えることで、顕著な成功を収めた。しかし、実世界のデータは単純な対関係を超えて共有関係を示すことが多い。マルチモーダルデータに固有のニュアンス付き共有関係を抽出するマルチモーダル混合コントラスト学習手法であるM3CoLを提案する。我々の重要な貢献はミックスアップに基づくコントラッシブ・ロスであり、あるモダリティから混合サンプルを他のモダリティから対応するサンプルと整列させ、それら間の共有関係を捉えることによって、ロバストな表現を学ぶ。マルチモーダル分類タスクでは,Mixupに基づくコントラスト損失を補足して,統合モジュールと単調予測モジュールを統合してトレーニング中の補助的監視を行うフレームワークを導入する。多様なデータセット(N24News、ROSMAP、BRCA、Food-101)の広範な実験を通じて、M3CoLが共有マルチモーダル関係を効果的に捉え、ドメイン間の一般化を実証する。 N24News、ROSMAP、BRCAでは最先端の手法より優れており、Food-101では同等のパフォーマンスを達成している。我々の研究は、堅牢なマルチモーダル学習のための共有関係の学習の重要性を強調し、将来の研究に有望な道を開く。 Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.	翻訳日:2024-11-06 16:00:56 公開日:2024-10-18
# GPUテンソルコア上の大規模言語モデルの効率的な任意精度高速化 Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores ( http://arxiv.org/abs/2409.17870v2 ) ライセンス: Link先を確認	Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang,	(参考訳) 大規模言語モデル(LLM)は広く応用されているが、効率的な推論では課題に直面している。量子化法は計算要求を減らすが、任意の精度の超低ビット量子化はGPUTensor Coreの限られたサポートと非効率的なメモリ管理によって妨げられ、最適以下の加速につながる。これらの課題に対処するために,任意の精度 LLM に対する包括的加速法を提案する。その中心となるのは、並列コンピューティングを容易にし、対称量子化をサポートし、データの冗長性を効果的に低減する新しいバイポーラ-INTデータフォーマットである。これに基づいて、任意の精度行列乗算方式を実装し、ビットレベルで行列を分解・復元し、GPUTensor Coreの利用を最大化しながら柔軟な精度を実現する。さらに,データレイアウトを最適化した効率的な行列前処理手法を開発した。最後に、高速共有メモリを戦略的に活用し、カーネル実行速度を大幅に向上し、メモリアクセスレイテンシを最小化するデータリカバリ指向メモリ管理システムを設計する。実験の結果,NVIDIAのCUTLASSと比較して,行列乗算の最大2.4倍の高速化が得られた。 LLMに組み込むと、最大6.7\timesの推論加速が達成される。これらの改良によりLLM推論効率が大幅に向上し、LLMのより広範かつ応答性の高い応用が可能となった。 Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs. At its core, we introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization, effectively reducing data redundancy. Building on this, we implement an arbitrary precision matrix multiplication scheme that decomposes and recovers matrices at the bit level, enabling flexible precision while maximizing GPU Tensor Core utilization. Furthermore, we develop an efficient matrix preprocessing method that optimizes data layout for subsequent computations. Finally, we design a data recovery-oriented memory management system that strategically utilizes fast shared memory, significantly enhancing kernel execution speed and minimizing memory access latency. Experimental results demonstrate our approach's effectiveness, with up to 2.4\times speedup in matrix multiplication compared to NVIDIA's CUTLASS. When integrated into LLMs, we achieve up to 6.7\times inference acceleration. These improvements significantly enhance LLM inference efficiency, enabling broader and more responsive applications of LLMs.	翻訳日:2024-11-06 16:00:56 公開日:2024-10-18
# Show and Guide: Instructional-Plan Grounded Vision and Language Model Show and Guide: Instructional-Plan Grounded Vision and Language Model ( http://arxiv.org/abs/2409.19074v1 ) ライセンス: Link先を確認	Diogo Glória-Silva, David Semedo, João Magalhães,	(参考訳) 複雑な手続き計画を通じてユーザを誘導することは、視覚的に図示された計画手順を持つことが、効果的な計画ガイダンスを提供するために不可欠である、本質的にマルチモーダルなタスクである。しかしながら、計画追従言語モデル(LM)に関する既存の研究は、しばしばマルチモーダルな入力と出力ができない。本研究では,MM-PlanLLMについて述べる。MM-PlanLLMは,テキスト計画と視覚情報の両方を活用することで,ユーザによる指導作業の実行を支援するための,最初のマルチモーダルLLMである。具体的には、ユーザクエリに基づいて関連するステップビデオセグメントを検索するConversational Video Moment Retrievalと、計画の次のステップを生成するVisually-Informed Step Generationである。 MM-PlanLLMは,マルチタスク・マルチステージ・アプローチを用いて訓練され,マルチモーダル・インストラクショナル・プラン・セマンティック・レイヤにモデルを徐々に公開し,マルチモーダル・テキスト・対話をプラン・グラウンドで実現する。さらに,本モデルでは,テキスト・プラン・ステップとインストラクショナル・ビデオ・モーメントの相互時間的および計画的構造的表現を提供する。 Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.	翻訳日:2024-11-06 04:30:57 公開日:2024-10-18
# Show and Guide: Instructional-Plan Grounded Vision and Language Model Show and Guide: Instructional-Plan Grounded Vision and Language Model ( http://arxiv.org/abs/2409.19074v2 ) ライセンス: Link先を確認	Diogo Glória-Silva, David Semedo, João Magalhães,	(参考訳) 複雑な手続き計画を通じてユーザを誘導することは、視覚的に図示された計画手順を持つことが、効果的な計画ガイダンスを提供するために不可欠である、本質的にマルチモーダルなタスクである。しかしながら、計画追従言語モデル(LM)に関する既存の研究は、しばしばマルチモーダルな入力と出力ができない。本研究では,MM-PlanLLMについて述べる。MM-PlanLLMは,テキスト計画と視覚情報の両方を活用することで,ユーザによる指導作業の実行を支援するための,最初のマルチモーダルLLMである。具体的には、ユーザクエリに基づいて関連するステップビデオセグメントを検索するConversational Video Moment Retrievalと、計画の次のステップを生成するVisually-Informed Step Generationである。 MM-PlanLLMは,マルチタスク・マルチステージ・アプローチを用いて訓練され,マルチモーダル・インストラクショナル・プラン・セマンティック・レイヤにモデルを徐々に公開し,マルチモーダル・テキスト・対話をプラン・グラウンドで実現する。さらに,本モデルでは,テキスト・プラン・ステップとインストラクショナル・ビデオ・モーメントの相互時間的および計画的構造的表現を提供する。 Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.	翻訳日:2024-11-06 04:30:57 公開日:2024-10-18
# Show and Guide: Instructional-Plan Grounded Vision and Language Model Show and Guide: Instructional-Plan Grounded Vision and Language Model ( http://arxiv.org/abs/2409.19074v3 ) ライセンス: Link先を確認	Diogo Glória-Silva, David Semedo, João Magalhães,	(参考訳) 複雑な手続き計画を通じてユーザを誘導することは、視覚的に図示された計画手順を持つことが、効果的な計画ガイダンスを提供するために不可欠である、本質的にマルチモーダルなタスクである。しかしながら、計画追従言語モデル(LM)に関する既存の研究は、しばしばマルチモーダルな入力と出力ができない。本研究では,MM-PlanLLMについて述べる。MM-PlanLLMは,テキスト計画と視覚情報の両方を活用することで,ユーザによる指導作業の実行を支援するための,最初のマルチモーダルLLMである。具体的には、ユーザクエリに基づいて関連するステップビデオセグメントを検索するConversational Video Moment Retrievalと、計画の次のステップを生成するVisually-Informed Step Generationである。 MM-PlanLLMは,マルチタスク・マルチステージ・アプローチを用いて訓練され,マルチモーダル・インストラクショナル・プラン・セマンティック・レイヤにモデルを徐々に公開し,マルチモーダル・テキスト・対話をプラン・グラウンドで実現する。さらに,本モデルでは,テキスト・プラン・ステップとインストラクショナル・ビデオ・モーメントの相互時間的および計画的構造的表現を提供する。 Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.	翻訳日:2024-11-06 04:30:57 公開日:2024-10-18
# 2D-TPE:大規模言語モデルのための2次元位置符号化によるテーブル理解 2D-TPE: Two-Dimensional Positional Encoding Enhances Table Understanding for Large Language Models ( http://arxiv.org/abs/2409.19700v1 ) ライセンス: Link先を確認	Jia-Nan Li, Jian Guan, Wei Wu, Zhengtao Yu, Rui Yan,	(参考訳) テーブルは、構造化された情報を簡潔に表現するために、様々な領域にまたがってユビキタスである。大きな言語モデル(LLM)を表データの推論に活用することは、積極的に探求された方向性を表している。しかし、典型的なLLMは1次元〜(1D)の入力しかサポートしていないため、既存の手法では2次元〜(2D)のテーブル構造をトークンの列に平らにすることで、空間的関係を著しく破壊し、必然的に重要な文脈情報が失われてしまう。本稿では,2つの厳密なプロキシタスクを通してテーブルの空間情報をキャプチャする際のLCMの性能に対する,そのような平坦化操作の有害な影響を実証的に実証する。次に,この課題に対処するため,単純な位置符号化手法である ``2D-TPE' (Two-dimensional Table Positional Encoding) を導入する。 2D-TPEにより、各アテンションヘッドは、出席するコンテキスト内のトークンの置換順序を動的に選択することができる。 2D-TPEは、計算効率を保ちながら重要な空間情報を失うリスクを効果的に軽減し、テーブル構造をよりよく保存する。 5つのベンチマークによる大規模な実験により、2D-TPEは強いベースラインよりも優れており、テーブル構造を正確なテーブル理解のために保存することの重要性が強調されている。包括的解析により、ベースラインよりも大きなテーブルに対する2D-TPEのスケーラビリティが大幅に向上することが明らかになった。 Tables are ubiquitous across various domains for concisely representing structured information. Empowering large language models (LLMs) to reason over tabular data represents an actively explored direction. However, since typical LLMs only support one-dimensional~(1D) inputs, existing methods often flatten the two-dimensional~(2D) table structure into a sequence of tokens, which can severely disrupt the spatial relationships and result in an inevitable loss of vital contextual information. In this paper, we first empirically demonstrate the detrimental impact of such flattening operations on the performance of LLMs in capturing the spatial information of tables through two elaborate proxy tasks. Subsequently, we introduce a simple yet effective positional encoding method, termed ``2D-TPE'' (Two-Dimensional Table Positional Encoding), to address this challenge. 2D-TPE enables each attention head to dynamically select a permutation order of tokens within the context for attending to them, where each permutation represents a distinct traversal mode for the table, such as column-wise or row-wise traversal. 2D-TPE effectively mitigates the risk of losing essential spatial information while preserving computational efficiency, thus better preserving the table structure. Extensive experiments across five benchmarks demonstrate that 2D-TPE outperforms strong baselines, underscoring the importance of preserving the table structure for accurate table comprehension. Comprehensive analysis further reveals the substantially better scalability of 2D-TPE to large tables than baselines.	翻訳日:2024-11-05 21:29:26 公開日:2024-10-18
# 2D-TPE:大規模言語モデルのための2次元位置符号化によるテーブル理解 2D-TPE: Two-Dimensional Positional Encoding Enhances Table Understanding for Large Language Models ( http://arxiv.org/abs/2409.19700v2 ) ライセンス: Link先を確認	Jia-Nan Li, Jian Guan, Wei Wu, Zhengtao Yu, Rui Yan,	(参考訳) テーブルは、構造化された情報を簡潔に表現するために、様々な領域にまたがってユビキタスである。大きな言語モデル(LLM)を表データの推論に活用することは、積極的に探求された方向性を表している。しかし、典型的なLLMは1次元〜(1D)の入力しかサポートしていないため、既存の手法では2次元〜(2D)のテーブル構造をトークンの列に平らにすることで、空間的関係を著しく破壊し、必然的に重要な文脈情報が失われてしまう。本稿では,2つの厳密なプロキシタスクを通してテーブルの空間情報をキャプチャする際のLCMの性能に対する,そのような平坦化操作の有害な影響を実証的に実証する。次に,この課題に対処するため,単純な位置符号化手法である ``2D-TPE' (Two-dimensional Table Positional Encoding) を導入する。 2D-TPEにより、各アテンションヘッドは、出席するコンテキスト内のトークンの置換順序を動的に選択することができる。 2D-TPEは、計算効率を保ちながら重要な空間情報を失うリスクを効果的に軽減し、テーブル構造をよりよく保存する。 5つのベンチマークによる大規模な実験により、2D-TPEは強いベースラインよりも優れており、テーブル構造を正確なテーブル理解のために保存することの重要性が強調されている。包括的解析により、ベースラインよりも大きなテーブルに対する2D-TPEのスケーラビリティが大幅に向上することが明らかになった。 Tables are ubiquitous across various domains for concisely representing structured information. Empowering large language models (LLMs) to reason over tabular data represents an actively explored direction. However, since typical LLMs only support one-dimensional~(1D) inputs, existing methods often flatten the two-dimensional~(2D) table structure into a sequence of tokens, which can severely disrupt the spatial relationships and result in an inevitable loss of vital contextual information. In this paper, we first empirically demonstrate the detrimental impact of such flattening operations on the performance of LLMs in capturing the spatial information of tables through two elaborate proxy tasks. Subsequently, we introduce a simple yet effective positional encoding method, termed ``2D-TPE'' (Two-Dimensional Table Positional Encoding), to address this challenge. 2D-TPE enables each attention head to dynamically select a permutation order of tokens within the context for attending to them, where each permutation represents a distinct traversal mode for the table, such as column-wise or row-wise traversal. 2D-TPE effectively mitigates the risk of losing essential spatial information while preserving computational efficiency, thus better preserving the table structure. Extensive experiments across five benchmarks demonstrate that 2D-TPE outperforms strong baselines, underscoring the importance of preserving the table structure for accurate table comprehension. Comprehensive analysis further reveals the substantially better scalability of 2D-TPE to large tables than baselines.	翻訳日:2024-11-05 21:29:26 公開日:2024-10-18
# 2D-TPE:大規模言語モデルのための2次元位置符号化によるテーブル理解 2D-TPE: Two-Dimensional Positional Encoding Enhances Table Understanding for Large Language Models ( http://arxiv.org/abs/2409.19700v3 ) ライセンス: Link先を確認	Jia-Nan Li, Jian Guan, Wei Wu, Zhengtao Yu, Rui Yan,	(参考訳) テーブルは、構造化された情報を簡潔に表現するために、様々な領域にまたがってユビキタスである。大きな言語モデル(LLM)を表データの推論に活用することは、積極的に探求された方向性を表している。しかし、典型的なLLMは1次元〜(1D)の入力しかサポートしていないため、既存の手法では2次元〜(2D)のテーブル構造をトークンの列に平らにすることで、空間的関係を著しく破壊し、必然的に重要な文脈情報が失われてしまう。本稿では,2つの厳密なプロキシタスクを通してテーブルの空間情報をキャプチャする際のLCMの性能に対する,そのような平坦化操作の有害な影響を実証的に実証する。次に,この課題に対処するため,単純な位置符号化手法である ``2D-TPE' (Two-dimensional Table Positional Encoding) を導入する。 2D-TPEにより、各アテンションヘッドは、出席するコンテキスト内のトークンの置換順序を動的に選択することができる。 2D-TPEは、計算効率を保ちながら重要な空間情報を失うリスクを効果的に軽減し、テーブル構造をよりよく保存する。 5つのベンチマークによる大規模な実験により、2D-TPEは強いベースラインよりも優れており、テーブル構造を正確なテーブル理解のために保存することの重要性が強調されている。包括的解析により、ベースラインよりも大きなテーブルに対する2D-TPEのスケーラビリティが大幅に向上することが明らかになった。 Tables are ubiquitous across various domains for concisely representing structured information. Empowering large language models (LLMs) to reason over tabular data represents an actively explored direction. However, since typical LLMs only support one-dimensional~(1D) inputs, existing methods often flatten the two-dimensional~(2D) table structure into a sequence of tokens, which can severely disrupt the spatial relationships and result in an inevitable loss of vital contextual information. In this paper, we first empirically demonstrate the detrimental impact of such flattening operations on the performance of LLMs in capturing the spatial information of tables through two elaborate proxy tasks. Subsequently, we introduce a simple yet effective positional encoding method, termed ``2D-TPE'' (Two-Dimensional Table Positional Encoding), to address this challenge. 2D-TPE enables each attention head to dynamically select a permutation order of tokens within the context for attending to them, where each permutation represents a distinct traversal mode for the table, such as column-wise or row-wise traversal. 2D-TPE effectively mitigates the risk of losing essential spatial information while preserving computational efficiency, thus better preserving the table structure. Extensive experiments across five benchmarks demonstrate that 2D-TPE outperforms strong baselines, underscoring the importance of preserving the table structure for accurate table comprehension. Comprehensive analysis further reveals the substantially better scalability of 2D-TPE to large tables than baselines.	翻訳日:2024-11-05 21:29:26 公開日:2024-10-18
# 釣り情報に基づく大規模言語モデルを用いた効率的なカリキュラムフェデレーション学習 Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models ( http://arxiv.org/abs/2410.00131v1 ) ライセンス: Link先を確認	Ji Liu, Jiaxiang Ren, Ruoming Jin, Zijie Zhang, Yang Zhou, Patrick Valduriez, Dejing Dou,	(参考訳) 分散データでモデルを協調的にトレーニングするための有望なパラダイムとして、フェデレートラーニング(FL)は、LLM(Large Language Models)に活用することができる。 LLMは巨大なサイズに対応するが、トレーニングデータの規模は大幅に増加し、膨大な計算量と通信コストがもたらされる。トレーニングデータは一般に非独立で、Identically Distributed(非IID)であり、各デバイスで適応的なデータ処理を必要とする。低ランク適応(LoRA)は、微調整プロセスで更新するパラメータの規模を著しく削減できるが、LLMのすべてのレイヤの低ランクパラメータを転送するのには、まだ十分な時間を要する。本稿では,フィッシャー情報に基づく効率的なカリキュラムフェデレーション学習フレームワーク(FibecFed)について,適応型フェデレーション学習と効率的なスパースパラメータ更新の2つの新しい手法を提案する。まず,各装置内のデータを適応的にサンプリングし,FL微調整プロセスの有効性を向上させるための漁師情報に基づく手法を提案する。第2に,グローバルアグリゲーションのための適切なレイヤとLoRAによる局所更新のためのスパースパラメータを動的に選択し,FL微調整プロセスの効率化を図る。 10のデータセットに基づく大規模な実験結果によると、FibecFedは17のベースラインアプローチと比較して優れた性能(正確性では最大45.35%)と微調整速度(最大98.61%高速)を達成している。 As a promising paradigm to collaboratively train models with decentralized data, Federated Learning (FL) can be exploited to fine-tune Large Language Models (LLMs). While LLMs correspond to huge size, the scale of the training data significantly increases, which leads to tremendous amounts of computation and communication costs. The training data is generally non-Independent and Identically Distributed (non-IID), which requires adaptive data processing within each device. Although Low Rank Adaptation (LoRA) can significantly reduce the scale of parameters to update in the fine-tuning process, it still takes unaffordable time to transfer the low-rank parameters of all the layers in LLMs. In this paper, we propose a Fisher Information-based Efficient Curriculum Federated Learning framework (FibecFed) with two novel methods, i.e., adaptive federated curriculum learning and efficient sparse parameter update. First, we propose a fisher information-based method to adaptively sample data within each device to improve the effectiveness of the FL fine-tuning process. Second, we dynamically select the proper layers for global aggregation and sparse parameters for local update with LoRA so as to improve the efficiency of the FL fine-tuning process. Extensive experimental results based on 10 datasets demonstrate that FibecFed yields excellent performance (up to 45.35% in terms of accuracy) and superb fine-tuning speed (up to 98.61% faster) compared with 17 baseline approaches).	翻訳日:2024-11-05 14:40:28 公開日:2024-10-18
# 釣り情報に基づく大規模言語モデルを用いた効率的なカリキュラムフェデレーション学習 Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models ( http://arxiv.org/abs/2410.00131v2 ) ライセンス: Link先を確認	Ji Liu, Jiaxiang Ren, Ruoming Jin, Zijie Zhang, Yang Zhou, Patrick Valduriez, Dejing Dou,	(参考訳) 分散データでモデルを協調的にトレーニングするための有望なパラダイムとして、フェデレートラーニング(FL)は、LLM(Large Language Models)に活用することができる。 LLMは巨大なサイズに対応するが、トレーニングデータの規模は大幅に増加し、膨大な計算量と通信コストがもたらされる。トレーニングデータは一般に非独立で、Identically Distributed(非IID)であり、各デバイスで適応的なデータ処理を必要とする。低ランク適応(LoRA)は、微調整プロセスで更新するパラメータの規模を著しく削減できるが、LLMのすべてのレイヤの低ランクパラメータを転送するのには、まだ十分な時間を要する。本稿では,フィッシャー情報に基づく効率的なカリキュラムフェデレーション学習フレームワーク(FibecFed)について,適応型フェデレーション学習と効率的なスパースパラメータ更新の2つの新しい手法を提案する。まず,各装置内のデータを適応的にサンプリングし,FL微調整プロセスの有効性を向上させるための漁師情報に基づく手法を提案する。第2に,グローバルアグリゲーションのための適切なレイヤとLoRAによる局所更新のためのスパースパラメータを動的に選択し,FL微調整プロセスの効率化を図る。 10のデータセットに基づく大規模な実験結果によると、FibecFedは17のベースラインアプローチと比較して優れた性能(正確性では最大45.35%)と微調整速度(最大98.61%高速)を達成している。 As a promising paradigm to collaboratively train models with decentralized data, Federated Learning (FL) can be exploited to fine-tune Large Language Models (LLMs). While LLMs correspond to huge size, the scale of the training data significantly increases, which leads to tremendous amounts of computation and communication costs. The training data is generally non-Independent and Identically Distributed (non-IID), which requires adaptive data processing within each device. Although Low Rank Adaptation (LoRA) can significantly reduce the scale of parameters to update in the fine-tuning process, it still takes unaffordable time to transfer the low-rank parameters of all the layers in LLMs. In this paper, we propose a Fisher Information-based Efficient Curriculum Federated Learning framework (FibecFed) with two novel methods, i.e., adaptive federated curriculum learning and efficient sparse parameter update. First, we propose a fisher information-based method to adaptively sample data within each device to improve the effectiveness of the FL fine-tuning process. Second, we dynamically select the proper layers for global aggregation and sparse parameters for local update with LoRA so as to improve the efficiency of the FL fine-tuning process. Extensive experimental results based on 10 datasets demonstrate that FibecFed yields excellent performance (up to 45.35% in terms of accuracy) and superb fine-tuning speed (up to 98.61% faster) compared with 17 baseline approaches).	翻訳日:2024-11-05 14:40:28 公開日:2024-10-18
# 反射木探索と自己学習による自律型AIエージェントの改善 Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning ( http://arxiv.org/abs/2410.02052v1 ) ライセンス: Link先を確認	Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu,	(参考訳) 自律エージェントは、複雑な多段階意思決定タスクを自動化する大きな可能性を証明している。しかし、GPT-4oのような最先端のビジョン言語モデル(VLM)でさえ、特に複雑なWeb環境や長期計画タスクにおいて、人間レベルの性能に欠ける。これらの制限に対処するために、GPT-4oを動力とするAIエージェントの能力を高めるために設計された新しいテストタイムアルゴリズムであるReflective Monte Carlo Tree Search (R-MCTS)を導入する。 R-MCTSは従来のMCTSを拡張します 1) 比較反射を取り入れることで、エージェントは過去の相互作用から学び、探索効率を動的に改善することができる。 2) 信頼性のある状態評価を行うためにマルチエージェントの議論を用いる。さらに, R-MCTS 生成木トラバーサルを用いた自己学習により GPT-4o を微調整することにより, エージェントの性能を向上させる。挑戦的な VisualWebArena ベンチマークでは,GPT-4o ベースの R-MCTS エージェントが,従来の最先端技術と比較して,さまざまなタスクに対して 6% から 30% の相対的な改善を実現している。さらに,テストタイム検索から得られる知識を,微調整によりGPT-4oに効果的に戻すことができることを示す。微調整の GPT-4o は R-MCTS の性能の 97% と一致し、テスト時に 4 倍の計算量を削減した。さらに, 微調整GPT-4oモデルでは, 現状が成功に繋がらないことを検知した場合に, 環境探索, 状態評価, 実行可能な状態へのバックトラックを行うことができることを示した。さらに,本研究は,R-MCTSを用いたデータ収集とテスト時間の両方のトレーニングにおける計算スケーリング特性を実証する。これらの結果は,試験時間探索と自己学習によるエージェントアプリケーションに対するVLMの推論と計画能力を高めるための有望な研究方向を示唆している。 Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent's performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97% of R-MCTS's performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' reasoning and planning capabilities for agentic applications via test-time search and self-learning.	翻訳日:2024-11-04 09:15:24 公開日:2024-10-18
# 反射型MCTSと探索学習を用いたAIエージェントの探索 Teaching AI Agents to Search with Reflective-MCTS and Exploratory Learning ( http://arxiv.org/abs/2410.02052v2 ) ライセンス: Link先を確認	Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu,	(参考訳) 自律エージェントは、複雑な多段階意思決定タスクを自動化する大きな可能性を証明している。しかし、GPT-4oのような最先端のビジョン言語モデル(VLM)でさえ、特に複雑なWeb環境や長期計画タスクにおいて、人間レベルの性能に欠ける。これらの制約に対処するため,リフレクティブモンテカルロ木探索 (R-MCTS) と探索学習 (Exploratory Learning) を提案し,エージェントアプリケーションのためのo1ライクなモデルを構築する。 R-MCTSはAIエージェントがその場で意思決定空間を探索する能力を高めるために設計された新しいテストタイムアルゴリズムである。 R-MCTSは従来のMCTSを拡張します 1) 比較反射を取り入れることで、エージェントは過去の相互作用から学び、探索効率を動的に改善することができる。 2) 信頼性のある状態評価を行うためにマルチエージェントの議論を用いる。次に,探索学習(Exploratory Learning)という,外部探索アルゴリズムに頼らずに,エージェントに推論時間での探索を教える新しい学習戦略を紹介する。挑戦的な VisualWebArena ベンチマークでは,GPT-4o ベースの R-MCTS エージェントが,従来の最先端技術と比較して,さまざまなタスクに対して 6% から 30% の相対的な改善を実現している。さらに,テストタイム検索から得られる経験を,微調整によりGPT-4oに効果的に戻すことができることを示す。 GPT-4oの探索学習 1)現在の状態が成功に繋がらないことを検出すると、環境を探索し、状態を評価し、実行可能なものにバックトラックする能力を示す。 2) R-MCTSの性能は87%に相当し, 計算能力は大幅に低下した。特に、我々の研究は、R-MCTSによるデータ収集とテスト時間の両方のトレーニングにおける計算スケーリング特性を実証しています。これらの結果は,試験時間探索と自己学習によるエージェントアプリケーションに対するVLMの推論と計画能力を高めるための有望な研究方向を示唆している。 Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we present Reflective Monte Carlo Tree Search (R-MCTS) and Exploratory Learning to build o1-like models for agentic applications. We first introduce R-MCTS, a novel test-time algorithm designed to enhance the ability of AI agents to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the experience gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. After Exploratory Learning, GPT-4o 1) demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success, and 2) matches 87% of R-MCTS's performance while using significantly less compute. Notably, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' reasoning and planning capabilities for agentic applications via test-time search and self-learning.	翻訳日:2024-11-04 09:15:24 公開日:2024-10-18
# ExACT:AIエージェントにリフレクティブMCTSと探索学習を指導する ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning ( http://arxiv.org/abs/2410.02052v3 ) ライセンス: Link先を確認	Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu,	(参考訳) 自律エージェントは、複雑な多段階意思決定タスクを自動化する大きな可能性を証明している。しかし、GPT-4oのような最先端のビジョン言語モデル(VLM)でさえ、特に複雑なWeb環境や長期のタスクにおいて、人間レベルのパフォーマンスに欠ける。これらの制約に対処するため,エージェントアプリケーション用のo1ライクなモデルを構築するために,テスト時検索と自己学習を組み合わせたExACTを提案する。リフレクティブモンテカルロ木探索(Reflective Monte Carlo Tree Search, R-MCTS)は、AIエージェントがその場で意思決定空間を探索する能力を高めるために設計された新しいテストタイムアルゴリズムである。 R-MCTSは従来のMCTSを拡張します 1) 比較反射を取り入れることで、エージェントは過去の相互作用から学び、探索効率を動的に改善することができる。 2) 信頼性のある状態評価にマルチエージェントの議論を用いる。次に,探索学習(Exploratory Learning)という,外部探索アルゴリズムに頼らずに,エージェントに推論時間での探索を教える新しい学習戦略を紹介する。挑戦的なVisualWebArenaベンチマークでは、GPT-4oベースのR-MCTSエージェントが、以前の最先端と比較して、さまざまなタスクに対して6%から30%の相対的な改善を実現しています。さらに,テストタイム検索から得られる知識と経験を,微調整により効率的に GPT-4o に戻すことができることを示す。 GPT-4oの探索学習 1)現在の状態が成功に繋がらないことを検出すると、環境を探索し、状態を評価し、実行可能なものにバックトラックする能力を示す。 2) R-MCTSの性能は87%に相当し, 計算能力は大幅に低下した。特に、我々の研究は、R-MCTSによるデータ収集とテスト時間の両方のトレーニングにおける計算スケーリング特性を実証しています。これらの結果は,試験時間探索と自己学習を通じて,エージェントアプリケーションに対するVLMの能力を高めるための有望な研究方向を示唆している。 Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon tasks. To address these limitations, we present ExACT, an approach to combine test-time search and self-learning to build o1-like models for agentic applications. We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test time algorithm designed to enhance AI agents' ability to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate for reliable state evaluation. Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms. On the challenging VisualWebArena benchmark, our GPT-4o based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge and experience gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. After Exploratory Learning, GPT-4o 1) demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success, and 2) matches 87% of R-MCTS's performance while using significantly less compute. Notably, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' capabilities for agentic applications via test-time search and self-learning.	翻訳日:2024-11-04 09:15:24 公開日:2024-10-18
# DomainLynx: 拡張されたドメインスクワット検出のための大規模言語モデルを活用する DomainLynx: Leveraging Large Language Models for Enhanced Domain Squatting Detection ( http://arxiv.org/abs/2410.02095v1 ) ライセンス: Link先を確認	Daiki Chiba, Hiroki Nakano, Takashi Koide,	(参考訳) ドメイン・スクワットはインターネットのセキュリティにとって重大な脅威となり、攻撃者はますます高度な技術を用いている。本研究では,Large Language Models (LLMs) を利用した新しい複合AIシステムであるDomainLynxを紹介した。トップランクドメインの事前定義されたパターンに焦点を当てた既存の方法とは異なり、DomainLynxは、新しいしゃがみ技術を特定し、あまり目立たないブランドを保護するのに長けている。システムのアーキテクチャは、高度なデータ処理、インテリジェントなドメインペアリング、LLMによる脅威評価を統合している。重要な点として、DomainLynxはLLM幻覚を緩和し、信頼性とコンテキスト認識の検出を保証する特別なコンポーネントを組み込んでいる。このアプローチは、Certificate Transparencyログ、Passive DNSレコード、ゾーンファイルなど、さまざまなソースからの大規模なセキュリティデータの効率的な分析を可能にする。 DomainLynxは、1,649個のスクワットドメインのキュレートデータセットに基づいて、Llama-3-70Bを用いて94.7\%の精度を達成した。 2億900万のドメインから34,359のスクワットドメインを検出し、ベースラインの手法を2.5倍上回った。この研究は、進化するドメインしゃがみの脅威と戦うために、多目的で正確で適応可能なツールを提供することで、インターネットのセキュリティを向上する。 DomainLynxのアプローチは、より堅牢でAI駆動のサイバーセキュリティソリューションの道を開き、幅広いオンラインエンティティの保護を強化し、より安全なデジタルエコシステムに寄与する。 Domain squatting poses a significant threat to Internet security, with attackers employing increasingly sophisticated techniques. This study introduces DomainLynx, an innovative compound AI system leveraging Large Language Models (LLMs) for enhanced domain squatting detection. Unlike existing methods focusing on predefined patterns for top-ranked domains, DomainLynx excels in identifying novel squatting techniques and protecting less prominent brands. The system's architecture integrates advanced data processing, intelligent domain pairing, and LLM-powered threat assessment. Crucially, DomainLynx incorporates specialized components that mitigate LLM hallucinations, ensuring reliable and context-aware detection. This approach enables efficient analysis of vast security data from diverse sources, including Certificate Transparency logs, Passive DNS records, and zone files. Evaluated on a curated dataset of 1,649 squatting domains, DomainLynx achieved 94.7\% accuracy using Llama-3-70B. In a month-long real-world test, it detected 34,359 squatting domains from 2.09 million new domains, outperforming baseline methods by 2.5 times. This research advances Internet security by providing a versatile, accurate, and adaptable tool for combating evolving domain squatting threats. DomainLynx's approach paves the way for more robust, AI-driven cybersecurity solutions, enhancing protection for a broader range of online entities and contributing to a safer digital ecosystem.	翻訳日:2024-11-04 08:55:37 公開日:2024-10-18
# DomainLynx: 拡張されたドメインスクワット検出のための大規模言語モデルを活用する DomainLynx: Leveraging Large Language Models for Enhanced Domain Squatting Detection ( http://arxiv.org/abs/2410.02095v2 ) ライセンス: Link先を確認	Daiki Chiba, Hiroki Nakano, Takashi Koide,	(参考訳) ドメイン・スクワットはインターネットのセキュリティにとって重大な脅威となり、攻撃者はますます高度な技術を用いている。本研究では,Large Language Models (LLMs) を利用した新しい複合AIシステムであるDomainLynxを紹介した。トップランクドメインの事前定義されたパターンに焦点を当てた既存の方法とは異なり、DomainLynxは、新しいしゃがみ技術を特定し、あまり目立たないブランドを保護するのに長けている。システムのアーキテクチャは、高度なデータ処理、インテリジェントなドメインペアリング、LLMによる脅威評価を統合している。重要な点として、DomainLynxはLLM幻覚を緩和し、信頼性とコンテキスト認識の検出を保証する特別なコンポーネントを組み込んでいる。このアプローチは、Certificate Transparencyログ、Passive DNSレコード、ゾーンファイルなど、さまざまなソースからの大規模なセキュリティデータの効率的な分析を可能にする。 DomainLynxは、1,649個のスクワットドメインのキュレートデータセットに基づいて、Llama-3-70Bを用いて94.7\%の精度を達成した。 2億900万のドメインから34,359のスクワットドメインを検出し、ベースラインの手法を2.5倍上回った。この研究は、進化するドメインしゃがみの脅威と戦うために、多目的で正確で適応可能なツールを提供することで、インターネットのセキュリティを向上する。 DomainLynxのアプローチは、より堅牢でAI駆動のサイバーセキュリティソリューションの道を開き、幅広いオンラインエンティティの保護を強化し、より安全なデジタルエコシステムに寄与する。 Domain squatting poses a significant threat to Internet security, with attackers employing increasingly sophisticated techniques. This study introduces DomainLynx, an innovative compound AI system leveraging Large Language Models (LLMs) for enhanced domain squatting detection. Unlike existing methods focusing on predefined patterns for top-ranked domains, DomainLynx excels in identifying novel squatting techniques and protecting less prominent brands. The system's architecture integrates advanced data processing, intelligent domain pairing, and LLM-powered threat assessment. Crucially, DomainLynx incorporates specialized components that mitigate LLM hallucinations, ensuring reliable and context-aware detection. This approach enables efficient analysis of vast security data from diverse sources, including Certificate Transparency logs, Passive DNS records, and zone files. Evaluated on a curated dataset of 1,649 squatting domains, DomainLynx achieved 94.7\% accuracy using Llama-3-70B. In a month-long real-world test, it detected 34,359 squatting domains from 2.09 million new domains, outperforming baseline methods by 2.5 times. This research advances Internet security by providing a versatile, accurate, and adaptable tool for combating evolving domain squatting threats. DomainLynx's approach paves the way for more robust, AI-driven cybersecurity solutions, enhancing protection for a broader range of online entities and contributing to a safer digital ecosystem.	翻訳日:2024-11-04 08:55:37 公開日:2024-10-18
# DomainDynamics: ドメイン名に対するライフサイクル対応のリスクタイムライン構築 DomainDynamics: Lifecycle-Aware Risk Timeline Construction for Domain Names ( http://arxiv.org/abs/2410.02096v1 ) ライセンス: Link先を確認	Daiki Chiba, Hiroki Nakano, Takashi Koide,	(参考訳) サイバー攻撃における悪意のあるドメイン名による永続的な脅威は、効果的な検出メカニズムの緊急の必要性を浮き彫りにしている。従来の機械学習手法は、そのようなドメインを識別できるが、歴史的データに大きく依存しているため、しばしば偽陽性と偽陰性率に悩まされる。従来のアプローチでは、ドメイン名の動的な性質、その目的と所有権が進化し、時代遅れまたは無関係なリスクアセスメントを生じさせる可能性がある。このような欠点に対処するため,ドメイン名のリスクをライフサイクルの段階から予測する新しいシステムであるDomainDynamicsを紹介した。 DomainDynamicsは各ドメインのタイムラインを構築し、各ドメインの特徴を様々な点で評価し、情報的、時間的リスク決定を行う。マルウェアやフィッシングインシデントから85,000以上の実際の悪意のあるドメインを含む評価実験において、DomainDynamicsは検出率を大幅に改善し、偽陽性率0.41\%の82.58\%を達成した。この性能は以前の研究や商業サービスを上回るもので、検出能力を大幅に向上させる。 The persistent threat posed by malicious domain names in cyber-attacks underscores the urgent need for effective detection mechanisms. Traditional machine learning methods, while capable of identifying such domains, often suffer from high false positive and false negative rates due to their extensive reliance on historical data. Conventional approaches often overlook the dynamic nature of domain names, the purposes and ownership of which may evolve, potentially rendering risk assessments outdated or irrelevant. To address these shortcomings, we introduce DomainDynamics, a novel system designed to predict domain name risks by considering their lifecycle stages. DomainDynamics constructs a timeline for each domain, evaluating the characteristics of each domain at various points in time to make informed, temporal risk determinations. In an evaluation experiment involving over 85,000 actual malicious domains from malware and phishing incidents, DomainDynamics demonstrated a significant improvement in detection rates, achieving an 82.58\% detection rate with a low false positive rate of 0.41\%. This performance surpasses that of previous studies and commercial services, improving detection capability substantially.	翻訳日:2024-11-04 08:55:37 公開日:2024-10-18
# DomainDynamics: ドメイン名に対するライフサイクル対応のリスクタイムライン構築 DomainDynamics: Lifecycle-Aware Risk Timeline Construction for Domain Names ( http://arxiv.org/abs/2410.02096v2 ) ライセンス: Link先を確認	Daiki Chiba, Hiroki Nakano, Takashi Koide,	(参考訳) サイバー攻撃における悪意のあるドメイン名による永続的な脅威は、効果的な検出メカニズムの緊急の必要性を浮き彫りにしている。従来の機械学習手法は、そのようなドメインを識別できるが、歴史的データに大きく依存しているため、しばしば偽陽性と偽陰性率に悩まされる。従来のアプローチでは、ドメイン名の動的な性質、その目的と所有権が進化し、時代遅れまたは無関係なリスクアセスメントを生じさせる可能性がある。このような欠点に対処するため,ドメイン名のリスクをライフサイクルの段階から予測する新しいシステムであるDomainDynamicsを紹介した。 DomainDynamicsは各ドメインのタイムラインを構築し、各ドメインの特徴を様々な点で評価し、情報的、時間的リスク決定を行う。マルウェアやフィッシングインシデントから85,000以上の実際の悪意のあるドメインを含む評価実験において、DomainDynamicsは検出率を大幅に改善し、偽陽性率0.41\%の82.58\%を達成した。この性能は以前の研究や商業サービスを上回るもので、検出能力を大幅に向上させる。 The persistent threat posed by malicious domain names in cyber-attacks underscores the urgent need for effective detection mechanisms. Traditional machine learning methods, while capable of identifying such domains, often suffer from high false positive and false negative rates due to their extensive reliance on historical data. Conventional approaches often overlook the dynamic nature of domain names, the purposes and ownership of which may evolve, potentially rendering risk assessments outdated or irrelevant. To address these shortcomings, we introduce DomainDynamics, a novel system designed to predict domain name risks by considering their lifecycle stages. DomainDynamics constructs a timeline for each domain, evaluating the characteristics of each domain at various points in time to make informed, temporal risk determinations. In an evaluation experiment involving over 85,000 actual malicious domains from malware and phishing incidents, DomainDynamics demonstrated a significant improvement in detection rates, achieving an 82.58\% detection rate with a low false positive rate of 0.41\%. This performance surpasses that of previous studies and commercial services, improving detection capability substantially.	翻訳日:2024-11-04 08:55:37 公開日:2024-10-18
# DomainHarvester: 頻繁に訪れるが信頼できるドメイン名 DomainHarvester: Harvesting Infrequently Visited Yet Trustworthy Domain Names ( http://arxiv.org/abs/2410.02097v1 ) ライセンス: Link先を確認	Daiki Chiba, Hiroki Nakano, Takashi Koide,	(参考訳) サイバーセキュリティでは、安全なウェブサイトと潜在的な脅威を区別する上で、許容リストが重要な役割を果たす。従来は、ウェブサイトの人気に重きを置き、しばしば頻繁に訪れた正規のドメインを見落としていた。本稿では、信頼に値するが頻繁に訪れるドメインを含む許容リストを生成するシステムであるDomainHarvesterを紹介する。 Webのハイパーリンク構造を活用する革新的なボトムアップ手法を採用することで、DomainHarvesterは、正当だが表現の浅いドメインを特定します。このシステムはシードURLを使ってドメイン名を収集し、Transformerベースのアプローチで機械学習を使用して信頼性を評価する。 DomainHarvesterは、2つの異なる許容リストを開発した。既存の6つのトップリストと比較して、DomainHarvesterの許容リストは、最小のオーバーラップ、グローバルな4\%、ローカルな0.1\%を示し、悪意のあるドメインを含むリスクを著しく低減し、セキュリティを向上している。この研究の貢献は相当なもので、信頼に値するが表現されていないドメインの見過ごされた側面を浮き彫りにし、従来の人気ベースのメトリクスを超えるDomainHarvesterを導入した。提案手法は,特に非英語圏において,ユーザや企業にとって重要な優位性を提供するため,許容リストの傾向と精度を高める。 In cybersecurity, allow lists play a crucial role in distinguishing safe websites from potential threats. Conventional methods for compiling allow lists, focusing heavily on website popularity, often overlook infrequently visited legitimate domains. This paper introduces DomainHarvester, a system aimed at generating allow lists that include trustworthy yet infrequently visited domains. By adopting an innovative bottom-up methodology that leverages the web's hyperlink structure, DomainHarvester identifies legitimate yet underrepresented domains. The system uses seed URLs to gather domain names, employing machine learning with a Transformer-based approach to assess their trustworthiness. DomainHarvester has developed two distinct allow lists: one with a global focus and another emphasizing local relevance. Compared to six existing top lists, DomainHarvester's allow lists show minimal overlaps, 4\% globally and 0.1\% locally, while significantly reducing the risk of including malicious domains, thereby enhancing security. The contributions of this research are substantial, illuminating the overlooked aspect of trustworthy yet underrepresented domains and introducing DomainHarvester, a system that goes beyond traditional popularity-based metrics. Our methodology enhances the inclusivity and precision of allow lists, offering significant advantages to users and businesses worldwide, especially in non-English speaking regions.	翻訳日:2024-11-04 08:55:37 公開日:2024-10-18
# DomainHarvester: 頻繁に訪れるが信頼できるドメイン名 DomainHarvester: Harvesting Infrequently Visited Yet Trustworthy Domain Names ( http://arxiv.org/abs/2410.02097v2 ) ライセンス: Link先を確認	Daiki Chiba, Hiroki Nakano, Takashi Koide,	(参考訳) サイバーセキュリティでは、安全なウェブサイトと潜在的な脅威を区別する上で、許容リストが重要な役割を果たす。従来は、ウェブサイトの人気に重きを置き、しばしば頻繁に訪れた正規のドメインを見落としていた。本稿では、信頼に値するが頻繁に訪れるドメインを含む許容リストを生成するシステムであるDomainHarvesterを紹介する。 Webのハイパーリンク構造を活用する革新的なボトムアップ手法を採用することで、DomainHarvesterは、正当だが表現の浅いドメインを特定します。このシステムはシードURLを使ってドメイン名を収集し、Transformerベースのアプローチで機械学習を使用して信頼性を評価する。 DomainHarvesterは、2つの異なる許容リストを開発した。既存の6つのトップリストと比較して、DomainHarvesterの許容リストは、最小のオーバーラップ、グローバルな4\%、ローカルな0.1\%を示し、悪意のあるドメインを含むリスクを著しく低減し、セキュリティを向上している。この研究の貢献は相当なもので、信頼に値するが表現されていないドメインの見過ごされた側面を浮き彫りにし、従来の人気ベースのメトリクスを超えるDomainHarvesterを導入した。提案手法は,特に非英語圏において,ユーザや企業にとって重要な優位性を提供するため,許容リストの傾向と精度を高める。 In cybersecurity, allow lists play a crucial role in distinguishing safe websites from potential threats. Conventional methods for compiling allow lists, focusing heavily on website popularity, often overlook infrequently visited legitimate domains. This paper introduces DomainHarvester, a system aimed at generating allow lists that include trustworthy yet infrequently visited domains. By adopting an innovative bottom-up methodology that leverages the web's hyperlink structure, DomainHarvester identifies legitimate yet underrepresented domains. The system uses seed URLs to gather domain names, employing machine learning with a Transformer-based approach to assess their trustworthiness. DomainHarvester has developed two distinct allow lists: one with a global focus and another emphasizing local relevance. Compared to six existing top lists, DomainHarvester's allow lists show minimal overlaps, 4\% globally and 0.1\% locally, while significantly reducing the risk of including malicious domains, thereby enhancing security. The contributions of this research are substantial, illuminating the overlooked aspect of trustworthy yet underrepresented domains and introducing DomainHarvester, a system that goes beyond traditional popularity-based metrics. Our methodology enhances the inclusivity and precision of allow lists, offering significant advantages to users and businesses worldwide, especially in non-English speaking regions.	翻訳日:2024-11-04 08:55:37 公開日:2024-10-18
# コンテキスト文書埋め込み Contextual Document Embeddings ( http://arxiv.org/abs/2410.02525v1 ) ライセンス: Link先を確認	John X. Morris, Alexander M. Rush,	(参考訳) 複雑なドキュメントの埋め込みは、ニューラル検索の中心である。主要なパラダイムは、個々のドキュメントに直接エンコーダを実行することによって、埋め込みをトレーニングし、構築することである。本研究では,これらの埋め込みは,検索対象のユースケースに対して暗黙的に文脈外であるとともに,文脈的文書の埋め込みは,文脈的文書の埋め込みと類似した文脈的文書の両方を考慮に入れるべきである,と論じる。第1に,文書をバッチ内のコンテキスト損失に明示的に組み込んだコントラスト学習目標,第2に,隣接する文書情報を符号化された表現に明示的にエンコードする新しいコンテクストアーキテクチャを提案する。その結果,両手法はいくつかの設定においてビエンコーダよりも優れた性能を示し,特にドメイン外発音の差が見られた。 MTEBベンチマークでは、強い負のマイニング、スコアの蒸留、データセット固有の命令、GPU内サンプル共有、あるいは非常に大きなバッチサイズなしで、最先端の結果が得られます。本手法は, 対照的な学習データセットやビエンコーダの性能向上に有効である。 Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.	翻訳日:2024-11-04 02:41:38 公開日:2024-10-18
# コンテキスト文書埋め込み Contextual Document Embeddings ( http://arxiv.org/abs/2410.02525v2 ) ライセンス: Link先を確認	John X. Morris, Alexander M. Rush,	(参考訳) 複雑なドキュメントの埋め込みは、ニューラル検索の中心である。主要なパラダイムは、個々のドキュメントに直接エンコーダを実行することによって、埋め込みをトレーニングし、構築することである。本研究では,これらの埋め込みは,検索対象のユースケースに対して暗黙的に文脈外であるとともに,文脈的文書の埋め込みは,文脈的文書の埋め込みと類似した文脈的文書の両方を考慮に入れるべきである,と論じる。第1に,文書をバッチ内のコンテキスト損失に明示的に組み込んだコントラスト学習目標,第2に,隣接する文書情報を符号化された表現に明示的にエンコードする新しいコンテクストアーキテクチャを提案する。その結果,両手法はいくつかの設定においてビエンコーダよりも優れた性能を示し,特にドメイン外発音の差が見られた。 MTEBベンチマークでは、強い負のマイニング、スコアの蒸留、データセット固有の命令、GPU内サンプル共有、あるいは非常に大きなバッチサイズなしで、最先端の結果が得られます。本手法は, 対照的な学習データセットやビエンコーダの性能向上に有効である。 Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.	翻訳日:2024-11-04 02:41:38 公開日:2024-10-18
# コンテキスト文書埋め込み Contextual Document Embeddings ( http://arxiv.org/abs/2410.02525v3 ) ライセンス: Link先を確認	John X. Morris, Alexander M. Rush,	(参考訳) 複雑なドキュメントの埋め込みは、ニューラル検索の中心である。主要なパラダイムは、個々のドキュメントに直接エンコーダを実行することによって、埋め込みをトレーニングし、構築することである。本研究では,これらの埋め込みは,検索対象のユースケースに対して暗黙的に文脈外であるとともに,文脈的文書の埋め込みは,文脈的文書の埋め込みと類似した文脈的文書の両方を考慮に入れるべきである,と論じる。第1に,文書をバッチ内のコンテキスト損失に明示的に組み込んだコントラスト学習目標,第2に,隣接する文書情報を符号化された表現に明示的にエンコードする新しいコンテクストアーキテクチャを提案する。その結果,両手法はいくつかの設定においてビエンコーダよりも優れた性能を示し,特にドメイン外発音の差が見られた。 MTEBベンチマークでは、強い負のマイニング、スコアの蒸留、データセット固有の命令、GPU内サンプル共有、あるいは非常に大きなバッチサイズなしで、最先端の結果が得られます。本手法は, 対照的な学習データセットやビエンコーダの性能向上に有効である。 Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.	翻訳日:2024-11-04 02:41:38 公開日:2024-10-18
# マルチラベル多視点行動認識のための行動選択学習 Action Selection Learning for Multi-label Multi-view Action Recognition ( http://arxiv.org/abs/2410.03302v1 ) ライセンス: Link先を確認	Trung Thanh Nguyen, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide,	(参考訳) マルチラベル・マルチビュー・アクション認識は、複数のカメラが捉えた非トリミングビデオから複数の同時またはシーケンシャルなアクションを認識することを目的としている。既存の作業は、フレームレベルで各アクションのオンセットとオフセットがラベル付けされる、強力なラベルを持つ狭い領域におけるマルチビューアクション認識に焦点を当てている。本研究は,映像レベルのラベルが弱い広帯域領域を撮影するために,カメラを分散した実世界のシナリオに焦点を当てた。本稿では,多視点行動選択学習(Multi-view Action Selection Learning,Multi-view Action Selection Learning)という手法を提案する。提案手法は,多視点映像から空間的特徴と時間的特徴を抽出する多視点空間変換器ビデオエンコーダを含む。アクション選択学習は、ビデオレベルにおいて弱いラベルから得られた擬似基底構造を用いて、フレームレベルで採用され、アクション認識の最も関連性の高いフレームを特定する。 MM-Officeデータセットを用いた実世界のオフィス環境における実験は,既存手法と比較して提案手法の優れた性能を示す。 Multi-label multi-view action recognition aims to recognize multiple concurrent or sequential actions from untrimmed videos captured by multiple cameras. Existing work has focused on multi-view action recognition in a narrow area with strong labels available, where the onset and offset of each action are labeled at the frame-level. This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level. We propose the method named MultiASL (Multi-view Action Selection Learning), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints. The proposed method includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos. Action Selection Learning is employed at the frame-level, using pseudo ground-truth obtained from weak labels at the video-level, to identify the most relevant frames for action recognition. Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods.	翻訳日:2024-11-02 23:08:51 公開日:2024-10-18
# マルチラベル多視点行動認識のための行動選択学習 Action Selection Learning for Multi-label Multi-view Action Recognition ( http://arxiv.org/abs/2410.03302v2 ) ライセンス: Link先を確認	Trung Thanh Nguyen, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide,	(参考訳) マルチラベル・マルチビュー・アクション認識は、複数のカメラが捉えた非トリミングビデオから複数の同時またはシーケンシャルなアクションを認識することを目的としている。既存の作業は、フレームレベルで各アクションのオンセットとオフセットがラベル付けされる、強力なラベルを持つ狭い領域におけるマルチビューアクション認識に焦点を当てている。本研究は,映像レベルのラベルが弱い広帯域領域を撮影するために,カメラを分散した実世界のシナリオに焦点を当てた。本稿では,多視点行動選択学習(Multi-view Action Selection Learning,Multi-view Action Selection Learning)という手法を提案する。提案手法は,多視点映像から空間的特徴と時間的特徴を抽出する多視点空間変換器ビデオエンコーダを含む。アクション選択学習は、ビデオレベルにおいて弱いラベルから得られた擬似基底構造を用いて、フレームレベルで採用され、アクション認識の最も関連性の高いフレームを特定する。 MM-Officeデータセットを用いた実世界のオフィス環境における実験は,既存手法と比較して提案手法の優れた性能を示す。 Multi-label multi-view action recognition aims to recognize multiple concurrent or sequential actions from untrimmed videos captured by multiple cameras. Existing work has focused on multi-view action recognition in a narrow area with strong labels available, where the onset and offset of each action are labeled at the frame-level. This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level. We propose the method named MultiASL (Multi-view Action Selection Learning), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints. The proposed method includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos. Action Selection Learning is employed at the frame-level, using pseudo ground-truth obtained from weak labels at the video-level, to identify the most relevant frames for action recognition. Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods.	翻訳日:2024-11-02 23:08:51 公開日:2024-10-18
# マルチラベル多視点行動認識のための行動選択学習 Action Selection Learning for Multi-label Multi-view Action Recognition ( http://arxiv.org/abs/2410.03302v3 ) ライセンス: Link先を確認	Trung Thanh Nguyen, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide,	(参考訳) マルチラベル・マルチビュー・アクション認識は、複数のカメラが捉えた非トリミングビデオから複数の同時またはシーケンシャルなアクションを認識することを目的としている。既存の作業は、フレームレベルで各アクションのオンセットとオフセットがラベル付けされる、強力なラベルを持つ狭い領域におけるマルチビューアクション認識に焦点を当てている。本研究は,映像レベルのラベルが弱い広帯域領域を撮影するために,カメラを分散した実世界のシナリオに焦点を当てた。本稿では,多視点行動選択学習(MultiASL)という手法を提案する。提案手法は,多視点映像から空間的特徴と時間的特徴を抽出する多視点空間変換器ビデオエンコーダを含む。アクション選択学習は、ビデオレベルにおいて弱いラベルから得られた擬似基底構造を用いて、フレームレベルで採用され、アクション認識の最も関連性の高いフレームを特定する。 MM-Officeデータセットを用いた実世界のオフィス環境における実験は,既存手法と比較して提案手法の優れた性能を示す。ソースコードはhttps://github.com/thanhhff/MultiASL/で入手できる。 Multi-label multi-view action recognition aims to recognize multiple concurrent or sequential actions from untrimmed videos captured by multiple cameras. Existing work has focused on multi-view action recognition in a narrow area with strong labels available, where the onset and offset of each action are labeled at the frame-level. This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level. We propose the method named Multi-view Action Selection Learning (MultiASL), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints. The proposed method includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos. Action Selection Learning is employed at the frame-level, using pseudo ground-truth obtained from weak labels at the video-level, to identify the most relevant frames for action recognition. Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods. The source code is available at https://github.com/thanhhff/MultiASL/.	翻訳日:2024-11-02 23:08:51 公開日:2024-10-18
# 委員会ベースのブロックチェーンのための実践的ライトクライアント Practical Light Clients for Committee-Based Blockchains ( http://arxiv.org/abs/2410.03347v1 ) ライセンス: Link先を確認	Frederik Armknecht, Ghassan Karame, Malcom Mohamed, Christiane Weis,	(参考訳) ユーザが専用のブロックチェーンフルノードをセットアップする必要がなくなるため、ライトクライアントは文学において注目を集めている。文献はライトクライアントのインスタンス化を特徴としているが、ほとんどのライトクライアントプロトコルは、長いオフラインフェーズを最適化し、検証すべきブロックヘッダが非常にダイナミックなバリデータによって署名されていることを暗黙的に仮定している。本稿では,そのことを述べる。 (i)ほとんどのライトクライアントは1週間以上オフラインで、 (ii)バリデータはほとんどの許可されたブロックチェーンや、CosmosやPollydotといった無許可のブロックチェーンでは、大幅に変更される可能性は低い。そこで本研究では,このような現実的な仮定を最適化し,既存のプロトコルと比較した場合の通信量や計算コストを最小限に抑える,新しい実用システムを提案する。我々のソリューションのプロトタイプ実装により、我々のプロトコルは、エンドツーエンドのレイテンシで最大90$と40000\times$(参照)、そして文学からの2つの最先端ライトクライアントインスタンスと比較した場合、最大1000$と10000\times$(参照)のより小さな証明サイズを実現していることを示す。 Light clients are gaining increasing attention in the literature since they obviate the need for users to set up dedicated blockchain full nodes. While the literature features a number of light client instantiations, most light client protocols optimize for long offline phases and implicitly assume that the block headers to be verified are signed by highly dynamic validators. In this paper, we show that (i) most light clients are rarely offline for more than a week, and (ii) validators are unlikely to drastically change in most permissioned blockchains and in a number of permissionless blockchains, such as Cosmos and Polkadot. Motivated by these findings, we propose a novel practical system that optimizes for such realistic assumptions and achieves minimal communication and computational costs for light clients when compared to existing protocols. By means of a prototype implementation of our solution, we show that our protocol achieves a reduction by up to $90$ and $40000\times$ (respectively) in end-to-end latency and up to $1000$ and $10000\times$ (respectively) smaller proof size when compared to two state-of-the-art light client instantiations from the literature.	翻訳日:2024-11-02 22:48:52 公開日:2024-10-18
# 委員会ベースのブロックチェーンのための実践的ライトクライアント Practical Light Clients for Committee-Based Blockchains ( http://arxiv.org/abs/2410.03347v2 ) ライセンス: Link先を確認	Frederik Armknecht, Ghassan Karame, Malcom Mohamed, Christiane Weis,	(参考訳) ユーザが専用のブロックチェーンフルノードをセットアップする必要がなくなるため、ライトクライアントは文学において注目を集めている。文献はライトクライアントのインスタンス化を特徴としているが、ほとんどのライトクライアントプロトコルは、長いオフラインフェーズを最適化し、検証すべきブロックヘッダが非常にダイナミックなバリデータによって署名されていることを暗黙的に仮定している。本稿では,そのことを述べる。 (i)ほとんどのライトクライアントは1週間以上オフラインで、 (ii)バリデータはほとんどの許可されたブロックチェーンや、CosmosやPollydotといった無許可のブロックチェーンでは、大幅に変更される可能性は低い。そこで本研究では,このような現実的な仮定を最適化し,既存のプロトコルと比較した場合の通信量や計算コストを最小限に抑える,新しい実用システムを提案する。我々のソリューションのプロトタイプ実装により、我々のプロトコルは、エンドツーエンドのレイテンシで最大90$と40000\times$(参照)、そして文学からの2つの最先端ライトクライアントインスタンスと比較した場合、最大1000$と10000\times$(参照)のより小さな証明サイズを実現していることを示す。 Light clients are gaining increasing attention in the literature since they obviate the need for users to set up dedicated blockchain full nodes. While the literature features a number of light client instantiations, most light client protocols optimize for long offline phases and implicitly assume that the block headers to be verified are signed by highly dynamic validators. In this paper, we show that (i) most light clients are rarely offline for more than a week, and (ii) validators are unlikely to drastically change in most permissioned blockchains and in a number of permissionless blockchains, such as Cosmos and Polkadot. Motivated by these findings, we propose a novel practical system that optimizes for such realistic assumptions and achieves minimal communication and computational costs for light clients when compared to existing protocols. By means of a prototype implementation of our solution, we show that our protocol achieves a reduction by up to $90$ and $40000\times$ (respectively) in end-to-end latency and up to $1000$ and $10000\times$ (respectively) smaller proof size when compared to two state-of-the-art light client instantiations from the literature.	翻訳日:2024-11-02 22:48:52 公開日:2024-10-18
# すべての拡散モデル活性化が差別的特徴として評価されたわけではない Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features ( http://arxiv.org/abs/2410.03558v3 ) ライセンス: Link先を確認	Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Qingming Huang,	(参考訳) 拡散モデルは当初、画像生成のために設計されている。近年の研究では、それらのバックボーンの内部シグナルはアクティベーションと呼ばれ、セマンティックセグメンテーションのような様々な識別タスクの高密度な特徴として機能することが示されている。多数の活性化が与えられたとき、小さいが効果的な部分集合を選択することは根本的な問題を引き起こす。この目的のために、この分野の初期の研究は、アクティベーションの識別能力を大規模に定量的に比較した。しかし,アテンションスコアの計算に使用されるクエリやキーなど,潜在的なアクティベーションが評価されていないことが判明した。さらに、近年の拡散アーキテクチャの進歩は、組み込みViTモジュールなど、多くの新しいアクティベーションをもたらす。どちらも、アクティベーションの選択は未解決のままだが、見落としている。この問題に対処するため,本論文では,より広い範囲のアクティベーション評価を行ない,さらなる一歩を踏み出した。アクティベーションの大幅な増加を考えると、完全な定量的比較はもはや実行されない。その代わり、これらのアクティベーションの性質を理解し、簡単な質的評価によって、明確に劣るアクティベーションを事前にフィルタリングできるようにしたいと考えている。注意深い解析の後、拡散モデル間で普遍的な3つの性質を発見し、本研究は特定のモデルを超えることができる。そこで本研究では,複数の拡散モデルに対する効率的な特徴選択法を提案する。最後に,複数の識別タスクを対象とした実験により,SOTAの競合相手に対する手法の優位性を検証した。私たちのコードはhttps://github.com/Darkbblue/generic-diffusion-feature.comで公開されています。 Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a large-scale quantitative comparison of the discriminative ability of the activations. However, we find that many potential activations have not been evaluated, such as the queries and keys used to compute attention scores. Moreover, recent advancements in diffusion architectures bring many new activations, such as those within embedded ViT modules. Both combined, activation selection remains unresolved but overlooked. To tackle this issue, this paper takes a further step with a much broader range of activations evaluated. Considering the significant increase in activations, a full-scale quantitative comparison is no longer operational. Instead, we seek to understand the properties of these activations, such that the activations that are clearly inferior can be filtered out in advance via simple qualitative evaluation. After careful analysis, we discover three properties universal among diffusion models, enabling this study to go beyond specific models. On top of this, we present effective feature selection solutions for several popular diffusion models. Finally, the experiments across multiple discriminative tasks validate the superiority of our method over the SOTA competitors. Our code is available at https://github.com/Darkbblue/generic-diffusion-feature.	翻訳日:2024-11-02 21:39:44 公開日:2024-10-18
# すべての拡散モデル活性化が差別的特徴として評価されたわけではない Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features ( http://arxiv.org/abs/2410.03558v1 ) ライセンス: Link先を確認	Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Qingming Huang,	(参考訳) 拡散モデルは当初、画像生成のために設計されている。近年の研究では、それらのバックボーンの内部シグナルはアクティベーションと呼ばれ、セマンティックセグメンテーションのような様々な識別タスクの高密度な特徴として機能することが示されている。多数の活性化が与えられたとき、小さいが効果的な部分集合を選択することは根本的な問題を引き起こす。この目的のために、この分野の初期の研究は、アクティベーションの識別能力を大規模に定量的に比較した。しかし,アテンションスコアの計算に使用されるクエリやキーなど,潜在的なアクティベーションが評価されていないことが判明した。さらに、近年の拡散アーキテクチャの進歩は、組み込みViTモジュールなど、多くの新しいアクティベーションをもたらす。どちらも、アクティベーションの選択は未解決のままだが、見落としている。この問題に対処するため,本論文では,より広い範囲のアクティベーション評価を行ない,さらなる一歩を踏み出した。アクティベーションの大幅な増加を考えると、完全な定量的比較はもはや実行されない。その代わり、これらのアクティベーションの性質を理解し、簡単な質的評価によって、明確に劣るアクティベーションを事前にフィルタリングできるようにしたいと考えている。注意深い解析の後、拡散モデル間で普遍的な3つの性質を発見し、本研究は特定のモデルを超えることができる。そこで本研究では,複数の拡散モデルに対する効率的な特徴選択法を提案する。最後に,複数の識別タスクを対象とした実験により,SOTAの競合相手に対する手法の優位性を検証した。私たちのコードはhttps://github.com/Darkbblue/generic-diffusion-feature.comで公開されています。 Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a large-scale quantitative comparison of the discriminative ability of the activations. However, we find that many potential activations have not been evaluated, such as the queries and keys used to compute attention scores. Moreover, recent advancements in diffusion architectures bring many new activations, such as those within embedded ViT modules. Both combined, activation selection remains unresolved but overlooked. To tackle this issue, this paper takes a further step with a much broader range of activations evaluated. Considering the significant increase in activations, a full-scale quantitative comparison is no longer operational. Instead, we seek to understand the properties of these activations, such that the activations that are clearly inferior can be filtered out in advance via simple qualitative evaluation. After careful analysis, we discover three properties universal among diffusion models, enabling this study to go beyond specific models. On top of this, we present effective feature selection solutions for several popular diffusion models. Finally, the experiments across multiple discriminative tasks validate the superiority of our method over the SOTA competitors. Our code is available at https://github.com/Darkbblue/generic-diffusion-feature.	翻訳日:2024-11-02 21:29:56 公開日:2024-10-18
# すべての拡散モデル活性化が差別的特徴として評価されたわけではない Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features ( http://arxiv.org/abs/2410.03558v2 ) ライセンス: Link先を確認	Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Qingming Huang,	(参考訳) 拡散モデルは当初、画像生成のために設計されている。近年の研究では、それらのバックボーンの内部シグナルはアクティベーションと呼ばれ、セマンティックセグメンテーションのような様々な識別タスクの高密度な特徴として機能することが示されている。多数の活性化が与えられたとき、小さいが効果的な部分集合を選択することは根本的な問題を引き起こす。この目的のために、この分野の初期の研究は、アクティベーションの識別能力を大規模に定量的に比較した。しかし,アテンションスコアの計算に使用されるクエリやキーなど,潜在的なアクティベーションが評価されていないことが判明した。さらに、近年の拡散アーキテクチャの進歩は、組み込みViTモジュールなど、多くの新しいアクティベーションをもたらす。どちらも、アクティベーションの選択は未解決のままだが、見落としている。この問題に対処するため,本論文では,より広い範囲のアクティベーション評価を行ない,さらなる一歩を踏み出した。アクティベーションの大幅な増加を考えると、完全な定量的比較はもはや実行されない。その代わり、これらのアクティベーションの性質を理解し、簡単な質的評価によって、明確に劣るアクティベーションを事前にフィルタリングできるようにしたいと考えている。注意深い解析の後、拡散モデル間で普遍的な3つの性質を発見し、本研究は特定のモデルを超えることができる。そこで本研究では,複数の拡散モデルに対する効率的な特徴選択法を提案する。最後に,複数の識別タスクを対象とした実験により,SOTAの競合相手に対する手法の優位性を検証した。私たちのコードはhttps://github.com/Darkbblue/generic-diffusion-feature.comで公開されています。 Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a large-scale quantitative comparison of the discriminative ability of the activations. However, we find that many potential activations have not been evaluated, such as the queries and keys used to compute attention scores. Moreover, recent advancements in diffusion architectures bring many new activations, such as those within embedded ViT modules. Both combined, activation selection remains unresolved but overlooked. To tackle this issue, this paper takes a further step with a much broader range of activations evaluated. Considering the significant increase in activations, a full-scale quantitative comparison is no longer operational. Instead, we seek to understand the properties of these activations, such that the activations that are clearly inferior can be filtered out in advance via simple qualitative evaluation. After careful analysis, we discover three properties universal among diffusion models, enabling this study to go beyond specific models. On top of this, we present effective feature selection solutions for several popular diffusion models. Finally, the experiments across multiple discriminative tasks validate the superiority of our method over the SOTA competitors. Our code is available at https://github.com/Darkbblue/generic-diffusion-feature.	翻訳日:2024-11-02 21:29:56 公開日:2024-10-18
# BlockFound: 異常検出のためのカスタムブロックチェーン基盤モデル BlockFound: Customized blockchain foundation model for anomaly detection ( http://arxiv.org/abs/2410.04039v1 ) ライセンス: Link先を確認	Jiahao Yu, Xian Wu, Hao Liu, Wenbo Guo, Xinyu Xing,	(参考訳) 異常なブロックチェーントランザクション検出のための,カスタマイズされた基盤モデルであるBlockFoundを提案する。ルールベースのシステムに依存したり、オフザシェルフの大規模言語モデルを直接適用する既存の方法とは異なり、BlockFoundでは、ブロックチェーントランザクションのユニークなデータ構造をモデル化するための、一連のカスタマイズされた設計を導入している。まず、ブロックチェーントランザクションは、ブロックチェーン固有のトークン、テキスト、数値を含むマルチモーダルである。我々は、これらのマルチモーダル入力を処理するためにモジュール化されたトークンーザを設計し、異なるモーダル間で情報のバランスをとる。第2に,より長いシーケンスを扱うために,RoPE埋め込みとFlashAttentionを用いた事前学習のためのマスク言語学習機構を設計する。基礎モデルを訓練した後、我々はさらに、異常検出のための新しい検出法を設計する。 EthereumとSolanaトランザクションに関する大規模な評価は、偽陽性率を低く保ちながら、異常検出におけるBlockFoundの異常な能力を示している。注目すべきは、BlockFoundがSolana上の異常なトランザクションを高精度に検出する唯一の方法であることだ。この作業は、ブロックチェーンのための新しい基盤モデルを提供するだけでなく、ブロックチェーンデータにLLMを適用するための新しいベンチマークも設定する。 We propose BlockFound, a customized foundation model for anomaly blockchain transaction detection. Unlike existing methods that rely on rule-based systems or directly apply off-the-shelf large language models, BlockFound introduces a series of customized designs to model the unique data structure of blockchain transactions. First, a blockchain transaction is multi-modal, containing blockchain-specific tokens, texts, and numbers. We design a modularized tokenizer to handle these multi-modal inputs, balancing the information across different modalities. Second, we design a customized mask language learning mechanism for pretraining with RoPE embedding and FlashAttention for handling longer sequences. After training the foundation model, we further design a novel detection method for anomaly detection. Extensive evaluations on Ethereum and Solana transactions demonstrate BlockFound's exceptional capability in anomaly detection while maintaining a low false positive rate. Remarkably, BlockFound is the only method that successfully detects anomalous transactions on Solana with high accuracy, whereas all other approaches achieved very low or zero detection recall scores. This work not only provides new foundation models for blockchain but also sets a new benchmark for applying LLMs in blockchain data.	翻訳日:2024-11-02 14:30:41 公開日:2024-10-18
# BlockFound: 異常検出のためのカスタムブロックチェーン基盤モデル BlockFound: Customized blockchain foundation model for anomaly detection ( http://arxiv.org/abs/2410.04039v2 ) ライセンス: Link先を確認	Jiahao Yu, Xian Wu, Hao Liu, Wenbo Guo, Xinyu Xing,	(参考訳) 異常なブロックチェーントランザクション検出のための,カスタマイズされた基盤モデルであるBlockFoundを提案する。ルールベースのシステムに依存したり、オフザシェルフの大規模言語モデルを直接適用する既存の方法とは異なり、BlockFoundでは、ブロックチェーントランザクションのユニークなデータ構造をモデル化するための、一連のカスタマイズされた設計を導入している。まず、ブロックチェーントランザクションは、ブロックチェーン固有のトークン、テキスト、数値を含むマルチモーダルである。我々は、これらのマルチモーダル入力を処理するためにモジュール化されたトークンーザを設計し、異なるモーダル間で情報のバランスをとる。第2に,より長いシーケンスを扱うために,RoPE埋め込みとFlashAttentionを用いた事前学習のためのマスク言語学習機構を設計する。基礎モデルを訓練した後、我々はさらに、異常検出のための新しい検出法を設計する。 EthereumとSolanaトランザクションに関する大規模な評価は、偽陽性率を低く保ちながら、異常検出におけるBlockFoundの異常な能力を示している。注目すべきは、BlockFoundがSolana上の異常なトランザクションを高精度に検出する唯一の方法であることだ。この作業は、ブロックチェーンのための新しい基盤モデルを提供するだけでなく、ブロックチェーンデータにLLMを適用するための新しいベンチマークも設定する。 We propose BlockFound, a customized foundation model for anomaly blockchain transaction detection. Unlike existing methods that rely on rule-based systems or directly apply off-the-shelf large language models, BlockFound introduces a series of customized designs to model the unique data structure of blockchain transactions. First, a blockchain transaction is multi-modal, containing blockchain-specific tokens, texts, and numbers. We design a modularized tokenizer to handle these multi-modal inputs, balancing the information across different modalities. Second, we design a customized mask language learning mechanism for pretraining with RoPE embedding and FlashAttention for handling longer sequences. After training the foundation model, we further design a novel detection method for anomaly detection. Extensive evaluations on Ethereum and Solana transactions demonstrate BlockFound's exceptional capability in anomaly detection while maintaining a low false positive rate. Remarkably, BlockFound is the only method that successfully detects anomalous transactions on Solana with high accuracy, whereas all other approaches achieved very low or zero detection recall scores. This work not only provides new foundation models for blockchain but also sets a new benchmark for applying LLMs in blockchain data.	翻訳日:2024-11-02 14:30:41 公開日:2024-10-18
# BlockFound: 異常検出のためのカスタムブロックチェーン基盤モデル BlockFound: Customized blockchain foundation model for anomaly detection ( http://arxiv.org/abs/2410.04039v3 ) ライセンス: Link先を確認	Jiahao Yu, Xian Wu, Hao Liu, Wenbo Guo, Xinyu Xing,	(参考訳) 異常なブロックチェーントランザクション検出のための,カスタマイズされた基盤モデルであるBlockFoundを提案する。ルールベースのシステムに依存したり、オフザシェルフの大規模言語モデルを直接適用する既存の方法とは異なり、BlockFoundでは、ブロックチェーントランザクションのユニークなデータ構造をモデル化するための、一連のカスタマイズされた設計を導入している。まず、ブロックチェーントランザクションは、ブロックチェーン固有のトークン、テキスト、数値を含むマルチモーダルである。我々は、これらのマルチモーダル入力を処理するためにモジュール化されたトークンーザを設計し、異なるモーダル間で情報のバランスをとる。第2に,より長いシーケンスを扱うために,RoPE埋め込みとFlashAttentionを用いた事前学習のためのマスク言語学習機構を設計する。基礎モデルを訓練した後、我々はさらに、異常検出のための新しい検出法を設計する。 EthereumとSolanaトランザクションに関する大規模な評価は、偽陽性率を低く保ちながら、異常検出におけるBlockFoundの異常な能力を示している。注目すべきは、BlockFoundがSolana上の異常なトランザクションを高精度に検出する唯一の方法であることだ。この作業は、ブロックチェーンのための新しい基盤モデルを提供するだけでなく、ブロックチェーンデータにLLMを適用するための新しいベンチマークも設定する。 We propose BlockFound, a customized foundation model for anomaly blockchain transaction detection. Unlike existing methods that rely on rule-based systems or directly apply off-the-shelf large language models, BlockFound introduces a series of customized designs to model the unique data structure of blockchain transactions. First, a blockchain transaction is multi-modal, containing blockchain-specific tokens, texts, and numbers. We design a modularized tokenizer to handle these multi-modal inputs, balancing the information across different modalities. Second, we design a customized mask language learning mechanism for pretraining with RoPE embedding and FlashAttention for handling longer sequences. After training the foundation model, we further design a novel detection method for anomaly detection. Extensive evaluations on Ethereum and Solana transactions demonstrate BlockFound's exceptional capability in anomaly detection while maintaining a low false positive rate. Remarkably, BlockFound is the only method that successfully detects anomalous transactions on Solana with high accuracy, whereas all other approaches achieved very low or zero detection recall scores. This work not only provides new foundation models for blockchain but also sets a new benchmark for applying LLMs in blockchain data.	翻訳日:2024-11-02 14:30:41 公開日:2024-10-18
# 超多段階:難易度長文課題の真理 Hyper-multi-step: The Truth Behind Difficult Long-context Tasks ( http://arxiv.org/abs/2410.04422v1 ) ライセンス: Link先を確認	Yijiong Yu,	(参考訳) 長期コンテキスト言語モデル(LCLM: Long-context Language Model)は、その広範なコンテキストウィンドウによって特徴付けられるようになり、ますます人気が高まっている。一方、多くの長期コンテキストベンチマークでは、最も先進的なLCLMでさえ完成に苦しむ課題が提示されている。しかし、様々な長文課題の根底にある源泉は研究されることはめったにない。このギャップを埋めるために,我々は,複数の項目の同時検索を必要とする「マルチマッチング検索」と,検索基準内で論理的判断を必要とする「論理ベース検索」という2つの基本課題から,それらの難易度を示す実験を行った。これらの2つの問題は一見単純だが、実際にはLCLMの能力を超えている。この発見は、LLMがより高度なロングコンテキストタスクに苦しむ理由を説明することができ、ソリューションを再考するためのより正確な視点を提供する。 Long-context language models (LCLM), characterized by their extensive context window, is becoming increasingly popular. Meanwhile, many long-context benchmarks present challenging tasks that even the most advanced LCLMs struggle to complete. However, the underlying sources of various challenging long-context tasks have seldom been studied. To bridge this gap, we conduct experiments to indicate their difficulty stems primarily from two basic issues: "multi-matching retrieval," which requires the simultaneous retrieval of multiple items, and "logic-based retrieval," which necessitates logical judgment within retrieval criteria. These two problems, while seemingly straightforward, actually exceed the capabilities of LCLMs because they are proven to be hyper-multi-step (demanding numerous steps to solve) in nature. This finding could explain why LLMs struggle with more advanced long-context tasks, providing a more accurate perspective for rethinking solutions for them.	翻訳日:2024-11-02 07:51:01 公開日:2024-10-18
# 超多段階:難易度長文課題の真理 Hyper-multi-step: The Truth Behind Difficult Long-context Tasks ( http://arxiv.org/abs/2410.04422v2 ) ライセンス: Link先を確認	Yijiong Yu,	(参考訳) 長期コンテキスト言語モデル(LCLM: Long-context Language Model)は、その広範なコンテキストウィンドウによって特徴付けられるようになり、ますます人気が高まっている。一方、多くの長期コンテキストベンチマークでは、最も先進的なLCLMでさえ完成に苦しむ課題が提示されている。しかし、様々な長文課題の根底にある源泉は研究されることはめったにない。このギャップを埋めるために,我々は,複数の項目の同時検索を必要とする「マルチマッチング検索」と,検索基準内で論理的判断を必要とする「論理ベース検索」という2つの基本課題から,それらの難易度を示す実験を行った。これらの2つの問題は一見単純だが、実際にはLCLMの能力を超えている。この発見は、LLMがより高度なロングコンテキストタスクに苦しむ理由を説明することができ、ソリューションを再考するためのより正確な視点を提供する。 Long-context language models (LCLM), characterized by their extensive context window, is becoming increasingly popular. Meanwhile, many long-context benchmarks present challenging tasks that even the most advanced LCLMs struggle to complete. However, the underlying sources of various challenging long-context tasks have seldom been studied. To bridge this gap, we conduct experiments to indicate their difficulty stems primarily from two basic issues: "multi-matching retrieval," which requires the simultaneous retrieval of multiple items, and "logic-based retrieval," which necessitates logical judgment within retrieval criteria. These two problems, while seemingly straightforward, actually exceed the capabilities of LCLMs because they are proven to be hyper-multi-step (demanding numerous steps to solve) in nature. This finding could explain why LLMs struggle with more advanced long-context tasks, providing a more accurate perspective for rethinking solutions for them.	翻訳日:2024-11-02 07:51:01 公開日:2024-10-18
# 超多段階:難易度長文課題の真理 Hyper-multi-step: The Truth Behind Difficult Long-context Tasks ( http://arxiv.org/abs/2410.04422v3 ) ライセンス: Link先を確認	Yijiong Yu, Ma Xiufa, Fang Jianwei, Zhi Xu, Su Guangyao, Wang Jiancheng, Yongfeng Huang, Zhixiao Qi, Wei Wang, Weifeng Liu, Ran Chen, Ji Pei,	(参考訳) 長期コンテキスト言語モデル(LCLM: Long-context Language Model)は、その広範なコンテキストウィンドウによって特徴付けられるようになり、ますます人気が高まっている。一方、多くの長期コンテキストベンチマークでは、最も先進的なLCLMでさえ完成に苦しむ課題が提示されている。しかし、様々な長文課題の根底にある源泉は研究されることはめったにない。このギャップを埋めるために,我々は,複数の項目の同時検索を必要とする「マルチマッチング検索」と,検索基準内で論理的判断を必要とする「論理ベース検索」という2つの基本課題から,それらの難易度を示す実験を行った。これらの2つの問題は一見単純だが、実際にはLCLMの能力を超えている。この発見は、LLMがより高度なロングコンテキストタスクに苦しむ理由を説明することができ、ソリューションを再考するためのより正確な視点を提供する。 Long-context language models (LCLM), characterized by their extensive context window, is becoming increasingly popular. Meanwhile, many long-context benchmarks present challenging tasks that even the most advanced LCLMs struggle to complete. However, the underlying sources of various challenging long-context tasks have seldom been studied. To bridge this gap, we conduct experiments to indicate their difficulty stems primarily from two basic issues: "multi-matching retrieval," which requires the simultaneous retrieval of multiple items, and "logic-based retrieval," which necessitates logical judgment within retrieval criteria. These two problems, while seemingly straightforward, actually exceed the capabilities of LCLMs because they are proven to be hyper-multi-step (demanding numerous steps to solve) in nature. This finding could explain why LLMs struggle with more advanced long-context tasks, providing a more accurate perspective for rethinking solutions for them.	翻訳日:2024-11-02 07:51:01 公開日:2024-10-18
# 超多段階:難易度長文課題の真理 Hyper-multi-step: The Truth Behind Difficult Long-context Tasks ( http://arxiv.org/abs/2410.04422v4 ) ライセンス: Link先を確認	Yijiong Yu, Ma Xiufa, Fang Jianwei, Zhi Xu, Su Guangyao, Wang Jiancheng, Yongfeng Huang, Zhixiao Qi, Wei Wang, Weifeng Liu, Ran Chen, Ji Pei,	(参考訳) 長期コンテキスト言語モデル(LCLM: Long-context Language Model)は、その広範なコンテキストウィンドウによって特徴付けられるようになり、ますます人気が高まっている。一方、多くの長期コンテキストベンチマークでは、最も先進的なLCLMでさえ完成に苦しむ課題が提示されている。しかし、様々な長文課題の根底にある源泉は研究されることはめったにない。このギャップを埋めるために,我々は,複数の項目の同時検索を必要とする「マルチマッチング検索」と,検索基準内で論理的判断を必要とする「論理ベース検索」という2つの基本課題から,それらの難易度を示す実験を行った。これらの2つの問題は一見単純だが、実際にはLCLMの能力を超えている。この発見は、LLMがより高度なロングコンテキストタスクに苦しむ理由を説明することができ、ソリューションを再考するためのより正確な視点を提供する。 Long-context language models (LCLM), characterized by their extensive context window, is becoming increasingly popular. Meanwhile, many long-context benchmarks present challenging tasks that even the most advanced LCLMs struggle to complete. However, the underlying sources of various challenging long-context tasks have seldom been studied. To bridge this gap, we conduct experiments to indicate their difficulty stems primarily from two basic issues: "multi-matching retrieval," which requires the simultaneous retrieval of multiple items, and "logic-based retrieval," which necessitates logical judgment within retrieval criteria. These two problems, while seemingly straightforward, actually exceed the capabilities of LCLMs because they are proven to be hyper-multi-step (demanding numerous steps to solve) in nature. This finding could explain why LLMs struggle with more advanced long-context tasks, providing a more accurate perspective for rethinking solutions for them.	翻訳日:2024-11-02 07:51:01 公開日:2024-10-18
# Gödel Agent: 再帰的自己改善のための自己参照エージェントフレームワーク Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement ( http://arxiv.org/abs/2410.04444v1 ) ライセンス: Link先を確認	Xunjian Yin, Xinyi Wang, Liangming Pan, Xiaojun Wan, William Yang Wang,	(参考訳) 大規模言語モデル(LLM)の急速な進歩により、さまざまなタスクにわたるAI駆動エージェントの能力が大幅に向上した。しかし、固定パイプラインアルゴリズムや事前定義されたメタラーニングフレームワークをベースとする既存のエージェントシステムは、人間の設計したコンポーネントの制限によりエージェント設計空間全体を探索できないため、グローバルな最適なエージェント設計を見逃す可能性がある。本稿では,G\"odel Machineにインスパイアされた自己進化型フレームワークであるG\"odel Agentを紹介する。 G\"odel Agent"はLSMを活用して、プロンプトを通じて高レベルな目的のみによってガイドされる、自身のロジックと振る舞いを動的に変更する。数学的推論および複雑なエージェントタスクの実験結果は、G\"odel Agent"の実装が連続的な自己改善を実現し、パフォーマンス、効率、一般化性において手作業によるエージェントを超越することを示した。 The rapid advancement of large language models (LLMs) has significantly enhanced the capabilities of AI-driven agents across various tasks. However, existing agentic systems, whether based on fixed pipeline algorithms or pre-defined meta-learning frameworks, cannot search the whole agent design space due to the restriction of human-designed components, and thus might miss the globally optimal agent design. In this paper, we introduce G\"odel Agent, a self-evolving framework inspired by the G\"odel machine, enabling agents to recursively improve themselves without relying on predefined routines or fixed optimization algorithms. G\"odel Agent leverages LLMs to dynamically modify its own logic and behavior, guided solely by high-level objectives through prompting. Experimental results on mathematical reasoning and complex agent tasks demonstrate that implementation of G\"odel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.	翻訳日:2024-11-02 07:25:54 公開日:2024-10-18
# Gödel Agent: 再帰的自己改善のための自己参照エージェントフレームワーク Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement ( http://arxiv.org/abs/2410.04444v2 ) ライセンス: Link先を確認	Xunjian Yin, Xinyi Wang, Liangming Pan, Xiaojun Wan, William Yang Wang,	(参考訳) 大規模言語モデル(LLM)の急速な進歩により、さまざまなタスクにわたるAI駆動エージェントの能力が大幅に向上した。しかし、固定パイプラインアルゴリズムや事前定義されたメタラーニングフレームワークをベースとする既存のエージェントシステムは、人間の設計したコンポーネントの制限によりエージェント設計空間全体を探索できないため、グローバルな最適なエージェント設計を見逃す可能性がある。本稿では,G\"odel Machineにインスパイアされた自己進化型フレームワークであるG\"odel Agentを紹介する。 G\"odel Agent"はLSMを活用して、プロンプトを通じて高レベルな目的のみによってガイドされる、自身のロジックと振る舞いを動的に変更する。数学的推論および複雑なエージェントタスクの実験結果は、G\"odel Agent"の実装が連続的な自己改善を実現し、パフォーマンス、効率、一般化性において手作業によるエージェントを超越することを示した。 The rapid advancement of large language models (LLMs) has significantly enhanced the capabilities of AI-driven agents across various tasks. However, existing agentic systems, whether based on fixed pipeline algorithms or pre-defined meta-learning frameworks, cannot search the whole agent design space due to the restriction of human-designed components, and thus might miss the globally optimal agent design. In this paper, we introduce G\"odel Agent, a self-evolving framework inspired by the G\"odel machine, enabling agents to recursively improve themselves without relying on predefined routines or fixed optimization algorithms. G\"odel Agent leverages LLMs to dynamically modify its own logic and behavior, guided solely by high-level objectives through prompting. Experimental results on mathematical reasoning and complex agent tasks demonstrate that implementation of G\"odel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.	翻訳日:2024-11-02 07:25:54 公開日:2024-10-18
# $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on generalization $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization ( http://arxiv.org/abs/2410.04717v1 ) ライセンス: Link先を確認	Dylan Zhang, Justin Wang, Francois Charton,	(参考訳) 大規模言語モデル(LLM)が様々なタスクにまたがって効果的であるためには、命令を理解し、正確に従うことが不可欠である。本研究では,モデルが未知の命令を一般化するための鍵となる要素を厳格に検討し,インストラクションチューニングのためのデータの収集をガイドするための洞察を提供する。チューリング完全マルコフアルゴリズムにインスパイアされた制御実験を通して、そのような一般化が$\textbf{only emerges}$であることを示す。また、限られた領域内での多様化は、堅牢な一般化を保証するのに失敗することが明らかとなった。対照的に、制約付きデータ予算の下でも、ドメイン間のデータの多様化はモデルの適応性を著しく向上させる。分析をさらに現実世界のシナリオに拡張し、$\textit{$\textbf{specialist}$}$と$\textit{$\textbf{ generalist}$}$モデルの微調整を含む。どちらの場合も、私たちはそれを証明します。 1)データサイズを一定に保ちながら、確立したデータセットの多様性を高め、より良いパフォーマンスを実現することができる。 2) データのスケールアップにおいて, 命令の意味を多様化することは, 類似データの量を増やすことよりも効果的である。我々の研究は、特にモデルパフォーマンスを最適化する際に、専門家とジェネラリストの両方のシナリオでトレーニングデータを拡張する際に、データセットの照合に重要な洞察を提供する。コアドメインを超えてデータを拡張したトレーニングスペシャリストモデルは、パフォーマンスが大幅に向上する一方、ジェネラリストモデルは、幅広いアプリケーションにまたがる全体的な命令追従能力を向上する多様なデータミックスの恩恵を受ける。以上の結果から, 戦略的多様化の重要な役割を強調し, データ品質向上のための明確なガイドラインを提供する。 Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we rigorously examine the key factors that enable models to generalize to unseen instructions, providing insights to guide the collection of data for instruction-tuning. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization $\textbf{only emerges}$ when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model's adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of $\textit{$\textbf{specialist}$}$ and $\textit{$\textbf{generalist}$}$ models. In both cases, we demonstrate that 1) better performance can be achieved by increasing the diversity of an established dataset while keeping the data size constant, and 2) when scaling up the data, diversifying the semantics of instructions is more effective than simply increasing the quantity of similar data. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.	翻訳日:2024-11-02 02:27:38 公開日:2024-10-18
# $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on generalization $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization ( http://arxiv.org/abs/2410.04717v2 ) ライセンス: Link先を確認	Dylan Zhang, Justin Wang, Francois Charton,	(参考訳) 大規模言語モデル(LLM)が様々なタスクにまたがって効果的であるためには、命令を理解し、正確に従うことが不可欠である。本研究では,モデルが未知の命令を一般化するための鍵となる要素を厳格に検討し,インストラクションチューニングのためのデータの収集をガイドするための洞察を提供する。チューリング完全マルコフアルゴリズムにインスパイアされた制御実験を通して、そのような一般化が$\textbf{only emerges}$であることを示す。また、限られた領域内での多様化は、堅牢な一般化を保証するのに失敗することが明らかとなった。対照的に、制約付きデータ予算の下でも、ドメイン間のデータの多様化はモデルの適応性を著しく向上させる。分析をさらに現実世界のシナリオに拡張し、$\textit{$\textbf{specialist}$}$と$\textit{$\textbf{ generalist}$}$モデルの微調整を含む。どちらの場合も、私たちはそれを証明します。 1)データサイズを一定に保ちながら、確立したデータセットの多様性を高め、より良いパフォーマンスを実現することができる。 2) データのスケールアップにおいて, 命令の意味を多様化することは, 類似データの量を増やすことよりも効果的である。我々の研究は、特にモデルパフォーマンスを最適化する際に、専門家とジェネラリストの両方のシナリオでトレーニングデータを拡張する際に、データセットの照合に重要な洞察を提供する。コアドメインを超えてデータを拡張したトレーニングスペシャリストモデルは、パフォーマンスが大幅に向上する一方、ジェネラリストモデルは、幅広いアプリケーションにまたがる全体的な命令追従能力を向上する多様なデータミックスの恩恵を受ける。以上の結果から, 戦略的多様化の重要な役割を強調し, データ品質向上のための明確なガイドラインを提供する。 Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we rigorously examine the key factors that enable models to generalize to unseen instructions, providing insights to guide the collection of data for instruction-tuning. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization $\textbf{only emerges}$ when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model's adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of $\textit{$\textbf{specialist}$}$ and $\textit{$\textbf{generalist}$}$ models. In both cases, we demonstrate that 1) better performance can be achieved by increasing the diversity of an established dataset while keeping the data size constant, and 2) when scaling up the data, diversifying the semantics of instructions is more effective than simply increasing the quantity of similar data. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.	翻訳日:2024-11-02 02:27:38 公開日:2024-10-18
# $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on generalization $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization ( http://arxiv.org/abs/2410.04717v3 ) ライセンス: Link先を確認	Dylan Zhang, Justin Wang, Francois Charton,	(参考訳) 大規模言語モデル(LLM)が様々なタスクにまたがって効果的であるためには、命令を理解し、正確に従うことが不可欠である。本研究では,モデルが未知の命令を一般化するための鍵となる要素を厳格に検討し,インストラクションチューニングのためのデータの収集をガイドするための洞察を提供する。チューリング完全マルコフアルゴリズムにインスパイアされた制御実験を通して、そのような一般化が$\textbf{only emerges}$であることを示す。また、限られた領域内での多様化は、堅牢な一般化を保証するのに失敗することが明らかとなった。対照的に、制約付きデータ予算の下でも、ドメイン間のデータの多様化はモデルの適応性を著しく向上させる。分析をさらに現実世界のシナリオに拡張し、$\textit{$\textbf{specialist}$}$と$\textit{$\textbf{ generalist}$}$モデルの微調整を含む。どちらの場合も、私たちはそれを証明します。 1)データサイズを一定に保ちながら、確立したデータセットの多様性を高め、より良いパフォーマンスを実現することができる。 2) データのスケールアップにおいて, 命令の意味を多様化することは, 類似データの量を増やすことよりも効果的である。我々の研究は、特にモデルパフォーマンスを最適化する際に、専門家とジェネラリストの両方のシナリオでトレーニングデータを拡張する際に、データセットの照合に重要な洞察を提供する。コアドメインを超えてデータを拡張したトレーニングスペシャリストモデルは、パフォーマンスが大幅に向上する一方、ジェネラリストモデルは、幅広いアプリケーションにまたがる全体的な命令追従能力を向上する多様なデータミックスの恩恵を受ける。以上の結果から, 戦略的多様化の重要な役割を強調し, データ品質向上のための明確なガイドラインを提供する。 Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we rigorously examine the key factors that enable models to generalize to unseen instructions, providing insights to guide the collection of data for instruction-tuning. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization $\textbf{only emerges}$ when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model's adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of $\textit{$\textbf{specialist}$}$ and $\textit{$\textbf{generalist}$}$ models. In both cases, we demonstrate that 1) better performance can be achieved by increasing the diversity of an established dataset while keeping the data size constant, and 2) when scaling up the data, diversifying the semantics of instructions is more effective than simply increasing the quantity of similar data. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.	翻訳日:2024-11-02 02:27:38 公開日:2024-10-18
# PredFormer: トランスフォーマーは効果的な時空間予測学習者である PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners ( http://arxiv.org/abs/2410.04733v1 ) ライセンス: Link先を確認	Yujin Tang, Lu Qi, Fei Xie, Xiangtai Li, Chao Ma, Ming-Hsuan Yang,	(参考訳) 時空間予測学習法は一般的に2つのカテゴリに分類される: 並列化と性能の課題に直面するリカレントベースアプローチと、エンコーダデコーダアーキテクチャとして畳み込みニューラルネットワーク(CNN)を用いるリカレントフリー手法である。これらの手法は強い帰納バイアスの恩恵を受けるが、スケーラビリティと一般化を犠牲にすることが多い。本稿では、時空間予測学習のための純粋なトランスフォーマーベースのフレームワークであるPredFormerを提案する。視覚変換器 (ViT) の設計に動機づけられたPredFormerは、十分に設計されたGated Transformerブロックを活用し、フル、ファクタ化、インターリーブされた空間的注意を含む3Dの注意機構を包括的に分析した。 PredFormerは、リカレントフリーでトランスフォーマーベースの設計なので、シンプルで効率的で、従来の方法よりも大幅にパフォーマンスが良い。合成および実世界のデータセットに関する大規模な実験は、PredFormerが最先端のパフォーマンスを達成することを実証している。移動 MNIST では、PredFormer は SimVP と比較して 51.3% の MSE 削減を実現している。 TaxiBJ の場合、MSE は 33.1% 減少し、FPS は 533 から 2364 に増加した。さらにWeatherBenchでは、MSIを11.1%削減し、FPSを196から404に強化している。これらの性能は精度と効率の両方で向上し、現実世界のアプリケーションにPredFormerの可能性を実証している。ソースコードはhttps://github.com/yyyujintang/PredFormer.comで公開される。 Spatiotemporal predictive learning methods generally fall into two categories: recurrent-based approaches, which face challenges in parallelization and performance, and recurrent-free methods, which employ convolutional neural networks (CNNs) as encoder-decoder architectures. These methods benefit from strong inductive biases but often at the expense of scalability and generalization. This paper proposes PredFormer, a pure transformer-based framework for spatiotemporal predictive learning. Motivated by the Vision Transformers (ViT) design, PredFormer leverages carefully designed Gated Transformer blocks, following a comprehensive analysis of 3D attention mechanisms, including full-, factorized-, and interleaved- spatial-temporal attention. With its recurrent-free, transformer-based design, PredFormer is both simple and efficient, significantly outperforming previous methods by large margins. Extensive experiments on synthetic and real-world datasets demonstrate that PredFormer achieves state-of-the-art performance. On Moving MNIST, PredFormer achieves a 51.3% reduction in MSE relative to SimVP. For TaxiBJ, the model decreases MSE by 33.1% and boosts FPS from 533 to 2364. Additionally, on WeatherBench, it reduces MSE by 11.1% while enhancing FPS from 196 to 404. These performance gains in both accuracy and efficiency demonstrate PredFormer's potential for real-world applications. The source code will be released at https://github.com/yyyujintang/PredFormer.	翻訳日:2024-11-02 02:17:53 公開日:2024-10-18
# PredFormer: トランスフォーマーは効果的な時空間予測学習者である PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners ( http://arxiv.org/abs/2410.04733v2 ) ライセンス: Link先を確認	Yujin Tang, Lu Qi, Fei Xie, Xiangtai Li, Chao Ma, Ming-Hsuan Yang,	(参考訳) 時空間予測学習法は一般的に2つのカテゴリに分類される: 並列化と性能の課題に直面するリカレントベースアプローチと、エンコーダデコーダアーキテクチャとして畳み込みニューラルネットワーク(CNN)を用いるリカレントフリー手法である。これらの手法は強い帰納バイアスの恩恵を受けるが、スケーラビリティと一般化を犠牲にすることが多い。本稿では、時空間予測学習のための純粋なトランスフォーマーベースのフレームワークであるPredFormerを提案する。ビジョントランスフォーマー (ViT) の設計により、PredFormer は慎重に設計されたGated Transformer ブロックを活用し、フル、ファクタ化、インターリーブされた空間的時間的注意を含む3Dアテンションメカニズムを包括的に分析した。 PredFormerは、リカレントフリーでトランスフォーマーベースの設計なので、シンプルで効率的で、従来の方法よりも大幅にパフォーマンスが良い。合成および実世界のデータセットに関する大規模な実験は、PredFormerが最先端のパフォーマンスを達成することを実証している。移動 MNIST では、PredFormer は SimVP と比較して 51.3% の MSE 削減を実現している。 TaxiBJ の場合、MSE は 33.1% 減少し、FPS は 533 から 2364 に増加した。さらにWeatherBenchでは、MSIを11.1%削減し、FPSを196から404に強化している。これらの性能は精度と効率の両方で向上し、現実世界のアプリケーションにPredFormerの可能性を実証している。ソースコードはhttps://github.com/yyyujintang/PredFormerで公開される。 Spatiotemporal predictive learning methods generally fall into two categories: recurrent-based approaches, which face challenges in parallelization and performance, and recurrent-free methods, which employ convolutional neural networks (CNNs) as encoder-decoder architectures. These methods benefit from strong inductive biases but often at the expense of scalability and generalization. This paper proposes PredFormer, a pure transformer-based framework for spatiotemporal predictive learning. Motivated by the Vision Transformers (ViT) design, PredFormer leverages carefully designed Gated Transformer blocks, following a comprehensive analysis of 3D attention mechanisms, including full-, factorized-, and interleaved-spatial-temporal attention. With its recurrent-free, transformer-based design, PredFormer is both simple and efficient, significantly outperforming previous methods by large margins. Extensive experiments on synthetic and real-world datasets demonstrate that PredFormer achieves state-of-the-art performance. On Moving MNIST, PredFormer achieves a 51.3% reduction in MSE relative to SimVP. For TaxiBJ, the model decreases MSE by 33.1% and boosts FPS from 533 to 2364. Additionally, on WeatherBench, it reduces MSE by 11.1% while enhancing FPS from 196 to 404. These performance gains in both accuracy and efficiency demonstrate PredFormer's potential for real-world applications. The source code will be released at https://github.com/yyyujintang/PredFormer .	翻訳日:2024-11-02 02:17:53 公開日:2024-10-18
# セグメンテーションモデルの有効性について:サーベイ On Efficient Variants of Segment Anything Model: A Survey ( http://arxiv.org/abs/2410.04960v1 ) ライセンス: Link先を確認	Xiaorui Sun, Jun Liu, Heng Tao Shen, Xiaofeng Zhu, Ping Hu,	(参考訳) Segment Anything Model (SAM) は画像分割タスクの基本モデルであり、多様なアプリケーションにまたがる強力な一般化で知られている。しかし、その素晴らしいパフォーマンスには、計算とリソースの要求が大幅に伴うため、モバイルデバイスのようなリソースに制限された環境でのデプロイが困難になる。これを解決するために、精度を犠牲にすることなく効率を高めるために様々なSAM変種が提案されている。この調査は、これらの効率的なSAM変種に関する最初の包括的なレビューを提供する。私たちはこの研究の動機を探ることから始めます。次に,SAMにおけるコア技術とモデル加速度について述べる。これに続いて、様々な加速度戦略を詳細に分析し、アプローチによって分類する。最後に、これらの手法を統一的かつ広範囲に評価し、その効率と精度を代表ベンチマークで評価し、全体的な性能をはっきりと比較する。 The Segment Anything Model (SAM) is a foundational model for image segmentation tasks, known for its strong generalization across diverse applications. However, its impressive performance comes with significant computational and resource demands, making it challenging to deploy in resource-limited environments such as mobile devices. To address this, a variety of SAM variants have been proposed to enhance efficiency without sacrificing accuracy. This survey provides the first comprehensive review of these efficient SAM variants. We begin by exploring the motivations driving this research. We then present core techniques used in SAM and model acceleration. This is followed by an in-depth analysis of various acceleration strategies, categorized by approach. Finally, we offer a unified and extensive evaluation of these methods, assessing their efficiency and accuracy on representative benchmarks, and providing a clear comparison of their overall performance.	翻訳日:2024-11-02 01:07:35 公開日:2024-10-18
# セグメンテーションモデルの有効性について:サーベイ On Efficient Variants of Segment Anything Model: A Survey ( http://arxiv.org/abs/2410.04960v2 ) ライセンス: Link先を確認	Xiaorui Sun, Jun Liu, Heng Tao Shen, Xiaofeng Zhu, Ping Hu,	(参考訳) Segment Anything Model (SAM) は画像分割タスクの基本モデルであり、多様なアプリケーションにまたがる強力な一般化で知られている。しかし、その素晴らしいパフォーマンスには、計算とリソースの要求が大幅に伴うため、エッジデバイスのようなリソースに制限された環境でのデプロイが困難になる。これを解決するために、精度を保ちながら効率を高めるために様々なSAM変種が提案されている。この調査は、これらの効率的なSAM変種に関する最初の包括的なレビューを提供する。私たちはこの研究の動機を探ることから始めます。次に,SAMにおけるコア技術とモデル加速度について述べる。これに続いて、SAM加速戦略の詳細な調査、アプローチによる分類、今後の研究方向性の議論が続く。最後に、これらの手法を様々なハードウェアで統一的かつ広範囲に評価し、その効率と精度を代表ベンチマークで評価し、全体的な性能を比較した。 The Segment Anything Model (SAM) is a foundational model for image segmentation tasks, known for its strong generalization across diverse applications. However, its impressive performance comes with significant computational and resource demands, making it challenging to deploy in resource-limited environments such as edge devices. To address this, a variety of SAM variants have been proposed to enhance efficiency while keeping accuracy. This survey provides the first comprehensive review of these efficient SAM variants. We begin by exploring the motivations driving this research. We then present core techniques used in SAM and model acceleration. This is followed by a detailed exploration of SAM acceleration strategies, categorized by approach, and a discussion of several future research directions. Finally, we offer a unified and extensive evaluation of these methods across various hardware, assessing their efficiency and accuracy on representative benchmarks, and providing a clear comparison of their overall performance.	翻訳日:2024-11-02 01:07:35 公開日:2024-10-18
# 知識編集によるマルチホップファクチュアルリコールのためのローカテ-テ-エジット Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing ( http://arxiv.org/abs/2410.06331v1 ) ライセンス: Link先を確認	Zhuoran Zhang, Yongxiang Li, Zijian Kan, Keyuan Cheng, Lijie Hu, Di Wang,	(参考訳) 位置情報編集のパラダイムは、Large Language Models (LLMs)における知識編集(KE)に大きな可能性を示している。従来の手法はシングルホップのファクトリコールタスクではうまく機能するが、新しく編集された知識を含むマルチホップのファクトリコールタスクには一貫して苦労する。本稿では,マルチホップタスクにおいて,より深いMDP層から暗黙的な主観的知識を抽出する傾向があることを明らかにする。この区別は、マルチホップクエリにおける現在のメソッドのパフォーマンスが低かったことを説明する。そこで我々は,浅層層と深層層の両方を編集する新しい位置対応KE手法 IFMET を提案する。 IFMETは複数のホップ編集プロンプトと補足セットを使用して、異なる推論段階における知識の発見と修正を行っている。実験結果から,IFMETは複数ホップのファクトリコールタスクの性能を著しく向上し,従来の位置対応手法の限界を克服できることが示された。 The locate-then-edit paradigm has shown significant promise for knowledge editing (KE) in Large Language Models (LLMs). While previous methods perform well on single-hop fact recall tasks, they consistently struggle with multi-hop factual recall tasks involving newly edited knowledge. In this paper, leveraging tools in mechanistic interpretability, we first identify that in multi-hop tasks, LLMs tend to retrieve implicit subject knowledge from deeper MLP layers, unlike single-hop tasks, which rely on earlier layers. This distinction explains the poor performance of current methods in multi-hop queries, as they primarily focus on editing shallow layers, leaving deeper layers unchanged. To address this, we propose IFMET, a novel locate-then-edit KE approach designed to edit both shallow and deep MLP layers. IFMET employs multi-hop editing prompts and supplementary sets to locate and modify knowledge across different reasoning stages. Experimental results demonstrate that IFMET significantly improves performance on multi-hop factual recall tasks, effectively overcoming the limitations of previous locate-then-edit methods.	翻訳日:2024-11-01 06:29:16 公開日:2024-10-18
# 知識編集によるマルチホップファクチュアルリコールのためのローカテ-テ-エジット Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing ( http://arxiv.org/abs/2410.06331v2 ) ライセンス: Link先を確認	Zhuoran Zhang, Yongxiang Li, Zijian Kan, Keyuan Cheng, Lijie Hu, Di Wang,	(参考訳) 位置情報編集のパラダイムは、Large Language Models (LLMs)における知識編集(KE)に大きな可能性を示している。従来の手法はシングルホップのファクトリコールタスクではうまく機能するが、新しく編集された知識を含むマルチホップのファクトリコールタスクには一貫して苦労する。本稿では,マルチホップタスクにおいて,より深いMDP層から暗黙的な主観的知識を抽出する傾向があることを明らかにする。この区別は、マルチホップクエリにおける現在のメソッドのパフォーマンスが低かったことを説明する。そこで我々は,浅層層と深層層の両方を編集する新しい位置対応KE手法 IFMET を提案する。 IFMETは複数のホップ編集プロンプトと補足セットを使用して、異なる推論段階における知識の発見と修正を行っている。実験結果から,IFMETは複数ホップのファクトリコールタスクの性能を著しく向上し,従来の位置対応手法の限界を克服できることが示された。 The locate-then-edit paradigm has shown significant promise for knowledge editing (KE) in Large Language Models (LLMs). While previous methods perform well on single-hop fact recall tasks, they consistently struggle with multi-hop factual recall tasks involving newly edited knowledge. In this paper, leveraging tools in mechanistic interpretability, we first identify that in multi-hop tasks, LLMs tend to retrieve implicit subject knowledge from deeper MLP layers, unlike single-hop tasks, which rely on earlier layers. This distinction explains the poor performance of current methods in multi-hop queries, as they primarily focus on editing shallow layers, leaving deeper layers unchanged. To address this, we propose IFMET, a novel locate-then-edit KE approach designed to edit both shallow and deep MLP layers. IFMET employs multi-hop editing prompts and supplementary sets to locate and modify knowledge across different reasoning stages. Experimental results demonstrate that IFMET significantly improves performance on multi-hop factual recall tasks, effectively overcoming the limitations of previous locate-then-edit methods.	翻訳日:2024-11-01 06:19:07 公開日:2024-10-18
# どれだけ難しいのか?MITREの攻撃キャンペーンを、攻撃木とcATMロジックで定量化する How hard can it be? Quantifying MITRE attack campaigns with attack trees and cATM logic ( http://arxiv.org/abs/2410.06692v1 ) ライセンス: Link先を確認	Stefano M. Nicoletti, Milan Lopuhaä-Zwakenberg, Mariëlle Stoelinga, Fabio Massacci, Carlos E. Budde,	(参考訳) サイバー脅威の状況は、一日でさらに複雑になる。 Advanced Persistent Threatsは、サイバーセキュリティの実践者が守らなければならない攻撃を組織的に行う。このような組織化された攻撃の例としては、Dream Job、Wocao、WannaCry、SolarWinds Compromiseなどがある。どのリスクが最も脅かされているか、どのキャンペーンが守られるかを評価するには、サイバーセキュリティの専門家に適切なツールボックスを設ける必要がある。特に、彼らはできるはずです。 (a)野生に記録された各攻撃キャンペーンの確率値 b)これらの価値を確実かつ透過的に運用し、キャンペーン間で定量的比較を行う。これにより、セキュリティの専門家は、透明性と説明責任のある定量的にインフォームドされた意思決定を実行できるようになる。本稿では,(1)MITREナレッジベースにおけるデータ駆動方式による攻撃キャンペーンの可能性の定量化,(2)MITREインテリジェンスデータの自動モデリング手法の導入による攻撃キャンペーンの可能性を考察する。さらに,cATM形式論理に基づく比較を行うための計算フレームワークを提案し,これをオープンソースのPythonツールに実装する。最後に、我々のアプローチは、すべてのMITREキャンペーンの可能性を定量化し、WocaoとDream JobのMITREキャンペーン -- 提案されたアプローチで生成された -- を、伝統的に構築された攻撃ツリーモデルに対して比較することで検証し、我々の方法論がモデリングの取り組みにおいて大幅に軽量であり、それでもすべての量的関連データをキャプチャできることを示す。 The landscape of cyber threats grows more complex by the day. Advanced Persistent Threats carry out systematic attack campaigns against which cybersecurity practitioners must defend. Examples of such organized attacks are operations Dream Job, Wocao, WannaCry or the SolarWinds Compromise. To evaluate which risks are most threatening, and which campaigns to prioritize against when defending, cybersecurity experts must be equipped with the right toolbox. In particular, they must be able to (a) obtain likelihood values for each attack campaign recorded in the wild and (b) reliably and transparently operationalize these values to carry out quantitative comparisons among campaigns. This will allow security experts to perform quantitatively-informed decision making that is transparent and accountable. In this paper we construct such a framework by: (1) quantifying the likelihood of attack campaigns via data-driven procedures on the MITRE knowledge base and (2) introducing a methodology for automatic modelling of MITRE intelligence data: this is complete in the sense that it captures any attack campaign via template attack tree models. (3) We further propose a computational framework to carry out this comparisons based on the cATM formal logic, and implement this into an open-source Python tool. Finally, we validate our approach by quantifying the likelihood of all MITRE campaigns, and comparing the likelihood of the Wocao and Dream Job MITRE campaigns -- generated with our proposed approach -- against "ad hoc" traditionally-built attack tree models, demonstrating how our methodology is substantially lighter in modelling effort, and still capable of capturing all the quantitative relevant data.	翻訳日:2024-11-01 04:10:03 公開日:2024-10-18
# どれだけ難しいのか?MITREの攻撃キャンペーンを、攻撃木とcATMロジックで定量化する How hard can it be? Quantifying MITRE attack campaigns with attack trees and cATM logic ( http://arxiv.org/abs/2410.06692v2 ) ライセンス: Link先を確認	Stefano M. Nicoletti, Milan Lopuhaä-Zwakenberg, Mariëlle Stoelinga, Fabio Massacci, Carlos E. Budde,	(参考訳) サイバー脅威の状況は、一日でさらに複雑になる。 Advanced Persistent Threatsは、サイバーセキュリティの実践者が守らなければならない攻撃を組織的に行う。このような組織化された攻撃の例としては、Dream Job、Wocao、WannaCry、SolarWinds Compromiseなどがある。どのリスクが最も脅かされているか、どのキャンペーンが守られるかを評価するには、サイバーセキュリティの専門家に適切なツールボックスを設ける必要がある。特に、彼らはできるはずです。 (a)野生に記録された各攻撃キャンペーンの確率値 b)これらの価値を確実かつ透過的に運用し、キャンペーン間で定量的比較を行う。これにより、セキュリティの専門家は、透明性と説明責任のある定量的にインフォームドされた意思決定を実行できるようになる。本稿では,(1)MITREナレッジベースにおけるデータ駆動方式による攻撃キャンペーンの可能性の定量化,(2)MITREインテリジェンスデータの自動モデリング手法の導入による攻撃キャンペーンの可能性を考察する。さらに,cATM形式論理に基づく比較を行うための計算フレームワークを提案し,これをオープンソースのPythonツールに実装する。最後に、我々のアプローチは、すべてのMITREキャンペーンの可能性を定量化し、WocaoとDream JobのMITREキャンペーン -- 提案されたアプローチで生成された -- を、伝統的に構築された攻撃ツリーモデルに対して比較することで検証し、我々の方法論がモデリングの取り組みにおいて大幅に軽量であり、それでもすべての量的関連データをキャプチャできることを示す。 The landscape of cyber threats grows more complex by the day. Advanced Persistent Threats carry out systematic attack campaigns against which cybersecurity practitioners must defend. Examples of such organized attacks are operations Dream Job, Wocao, WannaCry or the SolarWinds Compromise. To evaluate which risks are most threatening, and which campaigns to prioritize against when defending, cybersecurity experts must be equipped with the right toolbox. In particular, they must be able to (a) obtain likelihood values for each attack campaign recorded in the wild and (b) reliably and transparently operationalize these values to carry out quantitative comparisons among campaigns. This will allow security experts to perform quantitatively-informed decision making that is transparent and accountable. In this paper we construct such a framework by: (1) quantifying the likelihood of attack campaigns via data-driven procedures on the MITRE knowledge base and (2) introducing a methodology for automatic modelling of MITRE intelligence data: this is complete in the sense that it captures any attack campaign via template attack tree models. (3) We further propose a computational framework to carry out this comparisons based on the cATM formal logic, and implement this into an open-source Python tool. Finally, we validate our approach by quantifying the likelihood of all MITRE campaigns, and comparing the likelihood of the Wocao and Dream Job MITRE campaigns -- generated with our proposed approach -- against "ad hoc" traditionally-built attack tree models, demonstrating how our methodology is substantially lighter in modelling effort, and still capable of capturing all the quantitative relevant data.	翻訳日:2024-11-01 04:10:03 公開日:2024-10-18
# クイディットのシンメトリー強化反断熱量子アルゴリズム Symmetry-enhanced Counterdiabatic Quantum Algorithm for Qudits ( http://arxiv.org/abs/2410.06710v1 ) ライセンス: Link先を確認	Alberto Bottarelli, Mikel Garcia de Andoin, Pranav Chandarana, Koushik Paul, Xi Chen, Mikel Sanz, Philipp Hauke,	(参考訳) 量子ビットベースの変分量子アルゴリズムは近年急速に発展してきたが、まだいくつかの課題に直面している。そこで本研究では,量子ビットの代わりにキューディットを用いた対称性を持つディジタル対ダイアバティック量子アルゴリズムを提案する。このアプローチは従来の変動回路と比較して3種類の圧縮を提供する。第一に、回路深さの圧縮は反断熱プロトコルによって放射される。第二に、問題に関する情報は量子ビットを量子ビットに置き換えることで圧縮され、問題のより効率的な表現が可能となる。と。最後に、システムの対称性を利用することでパラメータの数を削減できる。グラフベースの最適化問題Max-3-Cutと高絡み合った状態準備であるqutrit W状態に対処することで、このアプローチを説明する。数値的な結果から,回路深度が低く,測定オーバーヘッドが小さいほどコンバージェンスが向上することが明らかとなった。この研究により、浅い変分量子回路の設計がより良くなり、短期的なキュートデバイスにおける実装の実現可能性が改善される。 Qubit-based variational quantum algorithms have undergone rapid development in recent years but still face several challenges. In this context, we propose a symmetry-enhanced digitized counterdiabatic quantum algorithm utilizing qudits instead of qubits. This approach offers three types of compression as compared to with respect to conventional variational circuits. First, compression in the circuit depth is rachieveduced by counterdiabatic protocols. Second, information about the problem is compressed by replacing qubits with qudits, allowing for a more efficient representation of the problem.. Lastly, the number of parameters is reduced by employing the symmetries of the system. We illustrate this approach by tackling a graph-based optimization problem Max-3-Cut and a highly-entangled state preparation, the qutrit W state. As our numerical results show, we achieve a better convergence with a lower circuit depth and less measurement overhead in all the cases considered. This work leads to a better design of shallow variational quantum circuits, improving the feasibility of their implementation on near-term qudit devices.	翻訳日:2024-11-01 04:10:03 公開日:2024-10-18
# クイディットのシンメトリー強化反断熱量子アルゴリズム Symmetry-enhanced Counterdiabatic Quantum Algorithm for Qudits ( http://arxiv.org/abs/2410.06710v2 ) ライセンス: Link先を確認	Alberto Bottarelli, Mikel Garcia de Andoin, Pranav Chandarana, Koushik Paul, Xi Chen, Mikel Sanz, Philipp Hauke,	(参考訳) 量子ビットベースの変分量子アルゴリズムは近年急速に発展してきたが、まだいくつかの課題に直面している。そこで本研究では,量子ビットの代わりにキューディットを用いた対称性を持つディジタル対ダイアバティック量子アルゴリズムを提案する。このアプローチは従来の変動回路と比較して3種類の圧縮を提供する。第一に、回路深さの圧縮は反断熱プロトコルによって達成される。第二に、問題に関する情報は量子ビットを量子ビットに置き換えることで圧縮され、問題のより効率的な表現が可能となる。最後に、システムの対称性を利用することでパラメータの数を削減できる。グラフベースの最適化問題Max-3-Cutと高絡み合った状態準備であるqutrit W状態に対処することで、このアプローチを説明する。数値的な結果から,回路深度が低く,測定オーバーヘッドが小さいほどコンバージェンスが向上することが明らかとなった。この研究により、浅い変分量子回路の設計がより良くなり、短期的なキューディットデバイスにおける実装の実現可能性が改善される。 Qubit-based variational quantum algorithms have undergone rapid development in recent years but still face several challenges. In this context, we propose a symmetry-enhanced digitized counterdiabatic quantum algorithm utilizing qudits instead of qubits. This approach offers three types of compression as compared to with respect to conventional variational circuits. First, compression in the circuit depth is achieved by counterdiabatic protocols. Second, information about the problem is compressed by replacing qubits with qudits, allowing for a more efficient representation of the problem. Lastly, the number of parameters is reduced by employing the symmetries of the system. We illustrate this approach by tackling a graph-based optimization problem Max-3-Cut and a highly-entangled state preparation, the qutrit W state. As our numerical results show, we achieve a better convergence with a lower circuit depth and less measurement overhead in all the cases considered. This work leads to a better design of shallow variational quantum circuits, improving the feasibility of their implementation on near-term qudit devices	翻訳日:2024-11-01 04:10:03 公開日:2024-10-18
# Suppress Content Shift: オフザシェルフ生成技術による拡散特性の改善 Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques ( http://arxiv.org/abs/2410.06719v1 ) ライセンス: Link先を確認	Benyuan Meng, Qianqian Xu, Zitai Wang, Zhiyong Yang, Xiaochun Cao, Qingming Huang,	(参考訳) 拡散モデルは強力な生成モデルであり、この能力は識別にも適用できる。事前訓練された拡散モデルの内的活性化は、識別的タスク、すなわち拡散機能の特徴として機能する。拡散の特徴は、コンテンツシフトと呼ばれる隠れた、普遍的な現象によって妨げられていることがわかりました。具体的に言うと、ある物体の正確な形状など、特徴と入力画像の間には内容の違いがある。本稿では,拡散モデルに固有の特徴として内容変化の原因を見いだし,拡散特性におけるこの現象の広範な存在を示唆する。さらに実験的研究は、コンテンツシフトが視覚的に知覚できない場合でも、その負の影響は無視できないことを示唆している。そこで本研究では,拡散特性の全体的な品質を高めるため,コンテンツシフトを抑制することを提案する。具体的には、ノイズの多い入力からイメージを復元する過程で、コンテンツシフトは情報ドリフトと関連し、オフザシェルフ生成技術がコンテンツシフト抑制のツールになる可能性を指摘した。さらに,本手法の有効性を効果的に評価し,提案手法の実装を行うための実用的なガイドラインであるGATEを提案する。単純さにもかかわらず、提案手法は様々なタスクやデータセットにおいて優れた結果をもたらし、拡散機能のための汎用的なブースターとしての可能性を検証している。私たちのコードはhttps://github.com/Darkbblue/diffusion-content-shiftで利用可能です。 Diffusion models are powerful generative models, and this capability can also be applied to discrimination. The inner activations of a pre-trained diffusion model can serve as features for discriminative tasks, namely, diffusion feature. We discover that diffusion feature has been hindered by a hidden yet universal phenomenon that we call content shift. To be specific, there are content differences between features and the input image, such as the exact shape of a certain object. We locate the cause of content shift as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature. Further empirical study also indicates that its negative impact is not negligible even when content shift is not visually perceivable. Hence, we propose to suppress content shift to enhance the overall quality of diffusion features. Specifically, content shift is related to the information drift during the process of recovering an image from the noisy input, pointing out the possibility of turning off-the-shelf generation techniques into tools for content shift suppression. We further propose a practical guideline named GATE to efficiently evaluate the potential benefit of a technique and provide an implementation of our methodology. Despite the simplicity, the proposed approach has achieved superior results on various tasks and datasets, validating its potential as a generic booster for diffusion features. Our code is available at https://github.com/Darkbblue/diffusion-content-shift.	翻訳日:2024-11-01 04:00:11 公開日:2024-10-18
# Suppress Content Shift: オフザシェルフ生成技術による拡散特性の改善 Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques ( http://arxiv.org/abs/2410.06719v2 ) ライセンス: Link先を確認	Benyuan Meng, Qianqian Xu, Zitai Wang, Zhiyong Yang, Xiaochun Cao, Qingming Huang,	(参考訳) 拡散モデルは強力な生成モデルであり、この能力は識別にも適用できる。事前訓練された拡散モデルの内的活性化は、識別的タスク、すなわち拡散機能の特徴として機能する。拡散の特徴は、コンテンツシフトと呼ばれる隠れた、普遍的な現象によって妨げられていることがわかりました。具体的に言うと、ある物体の正確な形状など、特徴と入力画像の間には内容の違いがある。本稿では,拡散モデルに固有の特徴として内容変化の原因を見いだし,拡散特性におけるこの現象の広範な存在を示唆する。さらに実験的研究は、コンテンツシフトが視覚的に知覚できない場合でも、その負の影響は無視できないことを示唆している。そこで本研究では,拡散特性の全体的な品質を高めるため,コンテンツシフトを抑制することを提案する。具体的には、ノイズの多い入力からイメージを復元する過程で、コンテンツシフトは情報ドリフトと関連し、オフザシェルフ生成技術がコンテンツシフト抑制のツールになる可能性を指摘した。さらに,本手法の有効性を効果的に評価し,提案手法の実装を行うための実用的なガイドラインであるGATEを提案する。単純さにもかかわらず、提案手法は様々なタスクやデータセットにおいて優れた結果をもたらし、拡散機能のための汎用的なブースターとしての可能性を検証している。私たちのコードはhttps://github.com/Darkbblue/diffusion-content-shiftで利用可能です。 Diffusion models are powerful generative models, and this capability can also be applied to discrimination. The inner activations of a pre-trained diffusion model can serve as features for discriminative tasks, namely, diffusion feature. We discover that diffusion feature has been hindered by a hidden yet universal phenomenon that we call content shift. To be specific, there are content differences between features and the input image, such as the exact shape of a certain object. We locate the cause of content shift as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature. Further empirical study also indicates that its negative impact is not negligible even when content shift is not visually perceivable. Hence, we propose to suppress content shift to enhance the overall quality of diffusion features. Specifically, content shift is related to the information drift during the process of recovering an image from the noisy input, pointing out the possibility of turning off-the-shelf generation techniques into tools for content shift suppression. We further propose a practical guideline named GATE to efficiently evaluate the potential benefit of a technique and provide an implementation of our methodology. Despite the simplicity, the proposed approach has achieved superior results on various tasks and datasets, validating its potential as a generic booster for diffusion features. Our code is available at https://github.com/Darkbblue/diffusion-content-shift.	翻訳日:2024-11-01 04:00:11 公開日:2024-10-18
# Suppress Content Shift: オフザシェルフ生成技術による拡散特性の改善 Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques ( http://arxiv.org/abs/2410.06719v3 ) ライセンス: Link先を確認	Benyuan Meng, Qianqian Xu, Zitai Wang, Zhiyong Yang, Xiaochun Cao, Qingming Huang,	(参考訳) 拡散モデルは強力な生成モデルであり、この能力は識別にも適用できる。事前訓練された拡散モデルの内的活性化は、識別的タスク、すなわち拡散機能の特徴として機能する。拡散の特徴は、コンテンツシフトと呼ばれる隠れた、普遍的な現象によって妨げられていることがわかりました。具体的に言うと、ある物体の正確な形状など、特徴と入力画像の間には内容の違いがある。本稿では,拡散モデルに固有の特徴として内容変化の原因を見いだし,拡散特性におけるこの現象の広範な存在を示唆する。さらに実験的研究は、コンテンツシフトが視覚的に知覚できない場合でも、その負の影響は無視できないことを示唆している。そこで本研究では,拡散特性の全体的な品質を高めるため,コンテンツシフトを抑制することを提案する。具体的には、ノイズの多い入力からイメージを復元する過程で、コンテンツシフトは情報ドリフトと関連し、オフザシェルフ生成技術がコンテンツシフト抑制のツールになる可能性を指摘した。さらに,本手法の有効性を効果的に評価し,提案手法の実装を行うための実用的なガイドラインであるGATEを提案する。単純さにもかかわらず、提案手法は様々なタスクやデータセットにおいて優れた結果をもたらし、拡散機能のための汎用的なブースターとしての可能性を検証している。私たちのコードはhttps://github.com/Darkbblue/diffusion-content-shiftで利用可能です。 Diffusion models are powerful generative models, and this capability can also be applied to discrimination. The inner activations of a pre-trained diffusion model can serve as features for discriminative tasks, namely, diffusion feature. We discover that diffusion feature has been hindered by a hidden yet universal phenomenon that we call content shift. To be specific, there are content differences between features and the input image, such as the exact shape of a certain object. We locate the cause of content shift as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature. Further empirical study also indicates that its negative impact is not negligible even when content shift is not visually perceivable. Hence, we propose to suppress content shift to enhance the overall quality of diffusion features. Specifically, content shift is related to the information drift during the process of recovering an image from the noisy input, pointing out the possibility of turning off-the-shelf generation techniques into tools for content shift suppression. We further propose a practical guideline named GATE to efficiently evaluate the potential benefit of a technique and provide an implementation of our methodology. Despite the simplicity, the proposed approach has achieved superior results on various tasks and datasets, validating its potential as a generic booster for diffusion features. Our code is available at https://github.com/Darkbblue/diffusion-content-shift.	翻訳日:2024-11-01 04:00:11 公開日:2024-10-18
# 核融合に基づくフォトニック量子コンピューティングスキームの量子エミッタへの応用 Tailoring fusion-based photonic quantum computing schemes to quantum emitters ( http://arxiv.org/abs/2410.06784v1 ) ライセンス: Link先を確認	Ming Lai Chan, Thomas J. Bell, Love A. Pettersson, Susan X. Chen, Patrick Yard, Anders Søndberg Sørensen, Stefano Paesani,	(参考訳) 核融合に基づく量子計算は、核融合ゲートによって小さなフォトニック資源状態が同時に絡み合って測定される、有望な量子計算モデルである。リソース状態は量子エミッタによって決定的に生成され、融合は浅い線形光学回路のみを必要とする。本稿では,量子エミッタの能力とノイズモデルに合わせた融合型アーキテクチャを提案する。本研究では,光子損失の8%,エミッタの光子識別率の4%,スピンノイズの閾値が一般的なスピン光子界面のコヒーレンス時間よりかなり低い値で,物理誤差機構に対する高い耐性が得られることを示す。我々の構成と分析は、量子エミッタを用いた耐故障性アプリケーションを対象としたフォトニック量子ハードウェアの開発のためのガイドラインを提供する。 Fusion-based quantum computation is a promising quantum computing model where small-sized photonic resource states are simultaneously entangled and measured by fusion gates. Such operations can be readily implemented with scalable photonic hardware: resource states can be deterministically generated by quantum emitters and fusions require only shallow linear-optical circuits. Here, we propose fusion-based architectures tailored to the capabilities and noise models in quantum emitters. We show that high tolerance to dominant physical error mechanisms can be achieved, with fault-tolerance thresholds of 8% for photon loss, 4% for photon distinguishability between emitters, and spin noise thresholds well below coherence times for typical spin-photon interfaces. Our construction and analysis provide guidelines for the development of photonic quantum hardware targeting fault-tolerant applications with quantum emitters.	翻訳日:2024-11-01 03:40:32 公開日:2024-10-18
# 核融合に基づくフォトニック量子コンピューティングスキームの量子エミッタへの応用 Tailoring fusion-based photonic quantum computing schemes to quantum emitters ( http://arxiv.org/abs/2410.06784v2 ) ライセンス: Link先を確認	Ming Lai Chan, Thomas J. Bell, Love A. Pettersson, Susan X. Chen, Patrick Yard, Anders Søndberg Sørensen, Stefano Paesani,	(参考訳) 核融合に基づく量子計算は、核融合ゲートによって小さなフォトニック資源状態が同時に絡み合って測定される、有望な量子計算モデルである。リソース状態は量子エミッタによって決定的に生成され、融合は浅い線形光学回路のみを必要とする。本稿では,量子エミッタの能力とノイズモデルに合わせた融合型アーキテクチャを提案する。光子損失の8%,エミッタの光子識別能の4%,スピンノイズの閾値が典型的なスピン光子界面のメモリ誘起誤差よりもはるかに高い値を示す。我々の構成と分析は、量子エミッタを用いた耐故障性アプリケーションを対象としたフォトニック量子ハードウェアの開発のためのガイドラインを提供する。 Fusion-based quantum computation is a promising quantum computing model where small-sized photonic resource states are simultaneously entangled and measured by fusion gates. Such operations can be readily implemented with scalable photonic hardware: resource states can be deterministically generated by quantum emitters and fusions require only shallow linear-optical circuits. Here, we propose fusion-based architectures tailored to the capabilities and noise models in quantum emitters. We show that high tolerance to dominant physical error mechanisms can be achieved, with fault-tolerance thresholds of 8% for photon loss, 4% for photon distinguishability between emitters, and spin noise thresholds well above memory-induced errors for typical spin-photon interfaces. Our construction and analysis provide guidelines for the development of photonic quantum hardware targeting fault-tolerant applications with quantum emitters.	翻訳日:2024-11-01 03:40:32 公開日:2024-10-18
# 深部畳み込みニューラルネットワークを用いた音声分類のためのスペクトル・リズム特性 Spectral and Rhythm Features for Audio Classification with Deep Convolutional Neural Networks ( http://arxiv.org/abs/2410.06927v1 ) ライセンス: Link先を確認	Friedrich Wolf-Monheim,	(参考訳) 畳み込みニューラルネットワーク(CNN)はコンピュータビジョンで広く使われている。パターンを認識するために従来のデジタル画像材料だけでなく、時間領域のデジタルオーディオ信号から抽出したスペクトルやリズムの特徴を表すデジタル画像からの特徴抽出にも使用できる。深層畳み込みニューラルネットワークを用いた音声分類性能の観点から,メルスケールスペクトル,メル周波数ケプストラム係数(MFCC),サイクリックテンモグラム,短時間フーリエ変換(STFT)クロマグラム,定数Q変換(CQT)クロマグラム,クロマエネルギー正規化統計(CENS)クロマグラムなどのスペクトル・リズム特徴表現について検討した。深層CNNを用いた音声分類作業において,メルスケールスペクトルとメル周波数ケプストラム係数(MFCC)は,他のスペクトル・リズム特性よりも有意に優れていた。実験はESC-50データセットと2,000のラベル付き環境オーディオ記録を用いて行われた。 Convolutional neural networks (CNNs) are widely used in computer vision. They can be used not only for conventional digital image material to recognize patterns, but also for feature extraction from digital imagery representing spectral and rhythm features extracted from time-domain digital audio signals for the acoustic classification of sounds. Different spectral and rhythm feature representations like mel-scaled spectrograms, mel-frequency cepstral coefficients (MFCCs), cyclic tempograms, short-time Fourier transform (STFT) chromagrams, constant-Q transform (CQT) chromagrams and chroma energy normalized statistics (CENS) chromagrams are investigated in terms of the audio classification performance using a deep convolutional neural network. It can be clearly shown that the mel-scaled spectrograms and the mel-frequency cepstral coefficients (MFCCs) perform significantly better then the other spectral and rhythm features investigated in this research for audio classification tasks using deep CNNs. The experiments were carried out with the aid of the ESC-50 dataset with 2,000 labeled environmental audio recordings.	翻訳日:2024-10-31 23:37:21 公開日:2024-10-18
# 深部畳み込みニューラルネットワークを用いた音声分類のためのスペクトル・リズム特性 Spectral and Rhythm Features for Audio Classification with Deep Convolutional Neural Networks ( http://arxiv.org/abs/2410.06927v2 ) ライセンス: Link先を確認	Friedrich Wolf-Monheim,	(参考訳) 畳み込みニューラルネットワーク(CNN)はコンピュータビジョンで広く使われている。パターンを認識するために従来のデジタル画像材料だけでなく、時間領域のデジタルオーディオ信号から抽出したスペクトルやリズムの特徴を表すデジタル画像からの特徴抽出にも使用できる。深層畳み込みニューラルネットワークを用いた音声分類性能の観点から,メルスケールスペクトル,メル周波数ケプストラム係数(MFCC),サイクリックテンモグラム,短時間フーリエ変換(STFT)クロマグラム,定数Q変換(CQT)クロマグラム,クロマエネルギー正規化統計(CENS)クロマグラムなどのスペクトル・リズム特徴表現について検討した。深層CNNを用いた音声分類作業において,メルスケールスペクトルとメル周波数ケプストラム係数(MFCC)は,他のスペクトル・リズム特性よりも有意に高い性能を示した。実験はESC-50データセットと2,000のラベル付き環境オーディオ記録を用いて行われた。 Convolutional neural networks (CNNs) are widely used in computer vision. They can be used not only for conventional digital image material to recognize patterns, but also for feature extraction from digital imagery representing spectral and rhythm features extracted from time-domain digital audio signals for the acoustic classification of sounds. Different spectral and rhythm feature representations like mel-scaled spectrograms, mel-frequency cepstral coefficients (MFCCs), cyclic tempograms, short-time Fourier transform (STFT) chromagrams, constant-Q transform (CQT) chromagrams and chroma energy normalized statistics (CENS) chromagrams are investigated in terms of the audio classification performance using a deep convolutional neural network. It can be clearly shown that the mel-scaled spectrograms and the mel-frequency cepstral coefficients (MFCCs) perform significantly better than the other spectral and rhythm features investigated in this research for audio classification tasks using deep CNNs. The experiments were carried out with the aid of the ESC-50 dataset with 2,000 labeled environmental audio recordings.	翻訳日:2024-10-31 23:37:21 公開日:2024-10-18
# コード切り替え再検討:カラーコードを用いた低オーバーヘッドマジック状態準備 Code switching revisited: low-overhead magic state preparation using color codes ( http://arxiv.org/abs/2410.07327v1 ) ライセンス: Link先を確認	Lucas Daguerre, Isaac H. Kim,	(参考訳) 本稿では,3Dカラーコードを用いた2Dカラーコード上に高忠実度マジック状態を作成するプロトコルを提案する。私たちの方法は、既知のコード切替プロトコルを変更します。 (i)2D符号と3D符号の間の横断ゲート (二)旗本選挙の司法的使用。これらの修正がマジック状態の忠実度を著しく向上させることを示す。例えば、均一な回路レベルのノイズ10^{-3}$を条件として、コードスイッチングプロトコルは、論理的不整合が4.6\times 10^{-5}\pm 1.6 \times 10^{-5}$(誤り訂正論理的状態トモグラフィで定式化)の3ドル2Dカラーコードで符号化されたマジックステートを出力する。より最近提案されたポストセレクションアプローチと併用して、多項式から外挿する場合に同じ符号に対して5.1 \times 10^{-7}$に改善される。 We propose a protocol to prepare a high-fidelity magic state on a 2D color code using a 3D color code. Our method modifies the known code switching protocol with (i) a transversal gate between the 2D and the 3D code and (ii) a judicious use of flag-based post-selection. We numerically demonstrate that these modifications lead to a significant improvement in the fidelity of the magic state. For instance, subjected to a uniform circuit-level noise of $10^{-3}$, our code switching protocol yields a magic state encoded in the distance-$3$ 2D color code with a logical infidelity of $4.6\times 10^{-5}\pm 1.6 \times 10^{-5}$ (quantified by an error-corrected logical state tomography) with a $84\%$ of acceptance rate. Used in conjunction with a more recently proposed post-selection approach, the infidelity improves to $5.1 \times 10^{-7}$ for the same code when extrapolating from the polynomial fits.	翻訳日:2024-10-31 20:56:57 公開日:2024-10-18
# コード切り替え再検討:カラーコードを用いた低オーバーヘッドマジック状態準備 Code switching revisited: low-overhead magic state preparation using color codes ( http://arxiv.org/abs/2410.07327v2 ) ライセンス: Link先を確認	Lucas Daguerre, Isaac H. Kim,	(参考訳) 本稿では,3Dカラーコードを用いた2Dカラーコード上に高忠実度マジック状態を作成するプロトコルを提案する。私たちの方法は、既知のコード切替プロトコルを変更します。 (i)2D符号と3D符号の間の横断ゲート (二)旗本選挙の司法的使用。これらの修正がマジック状態の忠実度を著しく向上させることを示す。例えば、均一な回路レベルのノイズ10^{-3}$を条件として、コードスイッチングプロトコルは、論理的不整合が4.6\times 10^{-5}\pm 1.6 \times 10^{-5}$(誤り訂正論理的状態トモグラフィで定量化)の3ドル2Dカラーコードで符号化されたマジックステートを出力する。より最近提案されたポストセレクションのアプローチと併用して、多項式フィッティングからの外挿は、同じコードに対して5.1 \times 10^{-7}$への忠実度の改善を示唆している。最後に,非クリフォード$T$ゲートを効果的に組み込んだ拡張安定化器シミュレータに類似した新しいシミュレーション手法を提案する。 We propose a protocol to prepare a high-fidelity magic state on a 2D color code using a 3D color code. Our method modifies the known code switching protocol with (i) a transversal gate between the 2D and the 3D code and (ii) a judicious use of flag-based post-selection. We numerically demonstrate that these modifications lead to a significant improvement in the fidelity of the magic state. For instance, subjected to a uniform circuit-level noise of $10^{-3}$, our code switching protocol yields a magic state encoded in the distance-$3$ 2D color code with a logical infidelity of $4.6\times 10^{-5}\pm 1.6 \times 10^{-5}$ (quantified by an error-corrected logical state tomography) with a $84\%$ of acceptance rate. Used in conjunction with a more recently proposed post-selection approach, extrapolation from a polynomial fit suggests a fidelity improvement to $5.1 \times 10^{-7}$ for the same code. Finally, we also present a novel simulation technique akin to an extended stabilizer simulator which effectively incorporates the non-Clifford $T$-gate.	翻訳日:2024-10-31 20:56:57 公開日:2024-10-18
# 多言語ASR評価における文字誤り率の回避 Advocating Character Error Rate for Multilingual ASR Evaluation ( http://arxiv.org/abs/2410.07400v1 ) ライセンス: Link先を確認	Thennal D K, Jesin James, Deepa P Gopinath, Muhammed Ashraf K,	(参考訳) 音声認識システム(ASR)は、伝統的に英語のデータセットを用いて評価され、単語誤り率(WER)が主要な指標となっている。 WERの単純さと解釈の容易さは、特に英語において広く採用されている。しかし、ASRシステムが多言語に拡張するにつれて、WERは様々な方法で失敗し、特に形態学的に複雑な言語や明確な単語境界を持たない言語では失敗する。本研究は,WERの限界を評価指標として記述し,多言語ASR評価における主指標として文字誤り率(CER)を提唱する。我々は、CERがWERが直面している多くの課題を回避し、書き込みシステム全体にわたってより一貫性を示すことを示す。我々は,マラヤラム,英語,アラビア語の3言語でASR転写の人為的評価を行い,形態学的特徴を明瞭に示すことによって提案を裏付ける。 CERは、英語においても、WERよりも人間の判断と密接に関連していることが示される。さらなる研究を容易にするため、今後のASRメトリクスのベンチマークのための人体評価データセットをリリースする。以上の結果から,多言語ASR評価においてCERを優先的に,少なくとも補足すべきであることが示唆された。 Automatic speech recognition (ASR) systems have traditionally been evaluated using English datasets, with the word error rate (WER) serving as the predominant metric. WER's simplicity and ease of interpretation have contributed to its widespread adoption, particularly for English. However, as ASR systems expand to multilingual contexts, WER fails in various ways, particularly with morphologically complex languages or those without clear word boundaries. Our work documents the limitations of WER as an evaluation metric and advocates for the character error rate (CER) as the primary metric in multilingual ASR evaluation. We show that CER avoids many of the challenges WER faces and exhibits greater consistency across writing systems. We support our proposition by conducting human evaluations of ASR transcriptions in three languages: Malayalam, English, and Arabic, which exhibit distinct morphological characteristics. We show that CER correlates more closely with human judgments than WER, even for English. To facilitate further research, we release our human evaluation dataset for future benchmarking of ASR metrics. Our findings suggest that CER should be prioritized, or at least supplemented, in multilingual ASR evaluations to account for the varying linguistic characteristics of different languages.	翻訳日:2024-10-31 20:47:00 公開日:2024-10-18
# 多言語ASR評価における文字誤り率の回避 Advocating Character Error Rate for Multilingual ASR Evaluation ( http://arxiv.org/abs/2410.07400v2 ) ライセンス: Link先を確認	Thennal D K, Jesin James, Deepa P Gopinath, Muhammed Ashraf K,	(参考訳) 音声認識システム(ASR)は、伝統的に英語のデータセットを用いて評価され、単語誤り率(WER)が主要な指標となっている。 WERの単純さと解釈の容易さは、特に英語において広く採用されている。しかし、ASRシステムが多言語に拡張するにつれて、WERは様々な方法で失敗し、特に形態学的に複雑な言語や明確な単語境界を持たない言語では失敗する。本研究は,WERの限界を評価指標として記述し,多言語ASR評価における主指標として文字誤り率(CER)を提唱する。我々は、CERがWERが直面している多くの課題を回避し、書き込みシステム全体にわたってより一貫性を示すことを示す。我々は,マラヤラム,英語,アラビア語の3言語でASR転写の人為的評価を行い,形態学的特徴を明瞭に示すことによって提案を裏付ける。 CERは、英語においても、WERよりも人間の判断と密接に関連していることが示される。さらなる研究を容易にするため、今後のASRメトリクスのベンチマークのための人体評価データセットをリリースする。以上の結果から,多言語ASR評価においてCERを優先的に,少なくとも補足すべきであることが示唆された。 Automatic speech recognition (ASR) systems have traditionally been evaluated using English datasets, with the word error rate (WER) serving as the predominant metric. WER's simplicity and ease of interpretation have contributed to its widespread adoption, particularly for English. However, as ASR systems expand to multilingual contexts, WER fails in various ways, particularly with morphologically complex languages or those without clear word boundaries. Our work documents the limitations of WER as an evaluation metric and advocates for the character error rate (CER) as the primary metric in multilingual ASR evaluation. We show that CER avoids many of the challenges WER faces and exhibits greater consistency across writing systems. We support our proposition by conducting human evaluations of ASR transcriptions in three languages: Malayalam, English, and Arabic, which exhibit distinct morphological characteristics. We show that CER correlates more closely with human judgments than WER, even for English. To facilitate further research, we release our human evaluation dataset for future benchmarking of ASR metrics. Our findings suggest that CER should be prioritized, or at least supplemented, in multilingual ASR evaluations to account for the varying linguistic characteristics of different languages.	翻訳日:2024-10-31 20:47:00 公開日:2024-10-18
# ギャラクシー以上の専門家たち:生物学的にインスパイアされた固定されたルーティングを持つ条件付きオーバーラップの専門家たち More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing ( http://arxiv.org/abs/2410.08003v1 ) ライセンス: Link先を確認	Sagi Shaier, Francisco Pereira, Katharina von der Wense, Lawrence E Hunter, Matt Jones,	(参考訳) 生物学的ニューラルネットワークの進化により、モジュール性とスパースコーディングの両方が生まれ、エネルギー使用効率が向上し、寿命におけるタスクの多様性が堅牢になった。対照的に、標準的なニューラルネットワークは密集した非特殊化アーキテクチャに依存しており、すべてのモデルパラメータを同時に更新して複数のタスクを学習することで、表現干渉につながる。現在のスパースニューラルネットワークアプローチはこの問題を軽減することを目的としているが、しばしば制限のような制限によって妨げられる。 1) 表現の崩壊を引き起こす訓練可能なゲーティング関数 2 重複しない専門家が冗長な計算と学習の遅さをもたらすこと。 3) 明示的な入力やタスクIDに依存して、柔軟性とスケーラビリティに大きな制約を課します。本稿では,重なり合う専門家の指数的な数でモジュラー・スパースアーキテクチャを誘導することにより,これらの課題に対処する一般的なディープラーニング手法であるComET(Conditionally Overlapping Mixture of ExperTs)を提案する。 COMETは、Sparse Mixture of Expertsで使用されるトレーニング可能なゲーティング関数を、個々の入力表現に適用された固定された生物学的にインスパイアされたランダムプロジェクションに置き換える。この設計により、専門家の重複度は入力の類似度に依存するため、類似した入力がより多くのパラメータを共有する傾向がある。これにより、ポジティブな知識伝達が促進され、学習が早くなり、一般化が向上する。本稿では,画像分類,言語モデリング,回帰といったタスクにおけるCOMETの有効性を,いくつかの人気のあるディープラーニングアーキテクチャを用いて実証する。 The evolution of biological neural systems has led to both modularity and sparse coding, which enables efficiency in energy usage, and robustness across the diversity of tasks in the lifespan. In contrast, standard neural networks rely on dense, non-specialized architectures, where all model parameters are simultaneously updated to learn multiple tasks, leading to representation interference. Current sparse neural network approaches aim to alleviate this issue, but are often hindered by limitations such as 1) trainable gating functions that cause representation collapse; 2) non-overlapping experts that result in redundant computation and slow learning; and 3) reliance on explicit input or task IDs that impose significant constraints on flexibility and scalability. In this paper we propose Conditionally Overlapping Mixture of ExperTs (COMET), a general deep learning method that addresses these challenges by inducing a modular, sparse architecture with an exponential number of overlapping experts. COMET replaces the trainable gating function used in Sparse Mixture of Experts with a fixed, biologically inspired random projection applied to individual input representations. This design causes the degree of expert overlap to depend on input similarity, so that similar inputs tend to share more parameters. This facilitates positive knowledge transfer, resulting in faster learning and improved generalization. We demonstrate the effectiveness of COMET on a range of tasks, including image classification, language modeling, and regression, using several popular deep learning architectures.	翻訳日:2024-10-31 06:05:02 公開日:2024-10-18
# ギャラクシー以上の専門家たち:生物学的にインスパイアされた固定されたルーティングを持つ条件付きオーバーラップの専門家たち More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing ( http://arxiv.org/abs/2410.08003v2 ) ライセンス: Link先を確認	Sagi Shaier, Francisco Pereira, Katharina von der Wense, Lawrence E Hunter, Matt Jones,	(参考訳) 生物学的ニューラルネットワークの進化により、モジュール性とスパースコーディングの両方が生まれ、エネルギー使用効率が向上し、寿命におけるタスクの多様性が堅牢になった。対照的に、標準的なニューラルネットワークは密集した非特殊化アーキテクチャに依存しており、すべてのモデルパラメータを同時に更新して複数のタスクを学習することで、表現干渉につながる。現在のスパースニューラルネットワークアプローチはこの問題を軽減することを目的としているが、しばしば制限のような制限によって妨げられる。 1) 表現の崩壊を引き起こす訓練可能なゲーティング関数 2 重複しない専門家が冗長な計算と学習の遅さをもたらすこと。 3) 明示的な入力やタスクIDに依存して、柔軟性とスケーラビリティに大きな制約を課します。本稿では,重なり合う専門家の指数的な数でモジュラー・スパースアーキテクチャを誘導することにより,これらの課題に対処する一般的なディープラーニング手法であるComET(Conditionally Overlapping Mixture of ExperTs)を提案する。 COMETは、Sparse Mixture of Expertsで使用されるトレーニング可能なゲーティング関数を、個々の入力表現に適用された固定された生物学的にインスパイアされたランダムプロジェクションに置き換える。この設計により、専門家の重複度は入力の類似度に依存するため、類似した入力がより多くのパラメータを共有する傾向がある。これにより、ポジティブな知識伝達が促進され、学習が早くなり、一般化が向上する。本稿では,画像分類,言語モデリング,回帰といったタスクにおけるCOMETの有効性を,いくつかの人気のあるディープラーニングアーキテクチャを用いて実証する。 The evolution of biological neural systems has led to both modularity and sparse coding, which enables efficiency in energy usage, and robustness across the diversity of tasks in the lifespan. In contrast, standard neural networks rely on dense, non-specialized architectures, where all model parameters are simultaneously updated to learn multiple tasks, leading to representation interference. Current sparse neural network approaches aim to alleviate this issue, but are often hindered by limitations such as 1) trainable gating functions that cause representation collapse; 2) non-overlapping experts that result in redundant computation and slow learning; and 3) reliance on explicit input or task IDs that impose significant constraints on flexibility and scalability. In this paper we propose Conditionally Overlapping Mixture of ExperTs (COMET), a general deep learning method that addresses these challenges by inducing a modular, sparse architecture with an exponential number of overlapping experts. COMET replaces the trainable gating function used in Sparse Mixture of Experts with a fixed, biologically inspired random projection applied to individual input representations. This design causes the degree of expert overlap to depend on input similarity, so that similar inputs tend to share more parameters. This facilitates positive knowledge transfer, resulting in faster learning and improved generalization. We demonstrate the effectiveness of COMET on a range of tasks, including image classification, language modeling, and regression, using several popular deep learning architectures.	翻訳日:2024-10-31 06:05:02 公開日:2024-10-18
# 弱監視ポイントクラウドセマンティックセマンティックセグメンテーションのための配電誘導ネットワーク Distribution Guidance Network for Weakly Supervised Point Cloud Semantic Segmentation ( http://arxiv.org/abs/2410.08091v1 ) ライセンス: Link先を確認	Zhiyi Pan, Wei Gao, Shan Liu, Ge Li,	(参考訳) 完全教師付き手法に固有の高密度アノテーションへの依存を緩和するにもかかわらず、弱い教師付きポイントクラウドセマンティックセグメンテーションは、不十分な監視信号に悩まされる。この課題に対応するために、弱い監督下で特徴空間を規制することで補助的制約を付与する新しい視点を導入する。最初の調査では、どの分布が特徴空間を正確に特徴付けるかを特定し、その後、この優先順位を利用して弱教師付き埋め込みのアライメントを導出する。具体的には、いくつかの共通分布候補のうち、von Mises-Fisher分布(moVMF)の混合の優越性を解析する。そこで我々は,弱教師付き学習部と分散アライメント部から構成される分散誘導ネットワーク(DGNet)を開発した。弱教師付き学習ブランチから導かれる信頼性の高いクラスタリング初期化を利用して、分散アライメントブランチは、moVMFとネットワークのパラメータを交互に更新し、moVMFで定義された潜在空間との整合性を確保する。集中的な実験は、分布選択とネットワーク設計の合理性と有効性を検証する。その結果、DGNetは複数のデータセットと様々な弱い教師付き設定の下で最先端のパフォーマンスを達成する。 Despite alleviating the dependence on dense annotations inherent to fully supervised methods, weakly supervised point cloud semantic segmentation suffers from inadequate supervision signals. In response to this challenge, we introduce a novel perspective that imparts auxiliary constraints by regulating the feature space under weak supervision. Our initial investigation identifies which distributions accurately characterize the feature space, subsequently leveraging this priori to guide the alignment of the weakly supervised embeddings. Specifically, we analyze the superiority of the mixture of von Mises-Fisher distributions (moVMF) among several common distribution candidates. Accordingly, we develop a Distribution Guidance Network (DGNet), which comprises a weakly supervised learning branch and a distribution alignment branch. Leveraging reliable clustering initialization derived from the weakly supervised learning branch, the distribution alignment branch alternately updates the parameters of the moVMF and the network, ensuring alignment with the moVMF-defined latent space. Extensive experiments validate the rationality and effectiveness of our distribution choice and network design. Consequently, DGNet achieves state-of-the-art performance under multiple datasets and various weakly supervised settings.	翻訳日:2024-10-31 05:25:16 公開日:2024-10-18
# 弱監視ポイントクラウドセマンティックセマンティックセグメンテーションのための配電誘導ネットワーク Distribution Guidance Network for Weakly Supervised Point Cloud Semantic Segmentation ( http://arxiv.org/abs/2410.08091v2 ) ライセンス: Link先を確認	Zhiyi Pan, Wei Gao, Shan Liu, Ge Li,	(参考訳) 完全教師付き手法に固有の高密度アノテーションへの依存を緩和するにもかかわらず、弱い教師付きポイントクラウドセマンティックセグメンテーションは、不十分な監視信号に悩まされる。この課題に対応するために、弱い監督下で特徴空間を規制することで補助的制約を付与する新しい視点を導入する。最初の調査では、どの分布が特徴空間を正確に特徴付けるかを特定し、その後、この優先順位を利用して弱教師付き埋め込みのアライメントを導出する。具体的には、いくつかの共通分布候補のうち、von Mises-Fisher分布(moVMF)の混合の優越性を解析する。そこで我々は,弱教師付き学習部と分散アライメント部から構成される分散誘導ネットワーク(DGNet)を開発した。弱教師付き学習ブランチから導かれる信頼性の高いクラスタリング初期化を利用して、分散アライメントブランチは、moVMFとネットワークのパラメータを交互に更新し、moVMFで定義された潜在空間との整合性を確保する。集中的な実験は、分布選択とネットワーク設計の合理性と有効性を検証する。その結果、DGNetは複数のデータセットと様々な弱い教師付き設定の下で最先端のパフォーマンスを達成する。 Despite alleviating the dependence on dense annotations inherent to fully supervised methods, weakly supervised point cloud semantic segmentation suffers from inadequate supervision signals. In response to this challenge, we introduce a novel perspective that imparts auxiliary constraints by regulating the feature space under weak supervision. Our initial investigation identifies which distributions accurately characterize the feature space, subsequently leveraging this priori to guide the alignment of the weakly supervised embeddings. Specifically, we analyze the superiority of the mixture of von Mises-Fisher distributions (moVMF) among several common distribution candidates. Accordingly, we develop a Distribution Guidance Network (DGNet), which comprises a weakly supervised learning branch and a distribution alignment branch. Leveraging reliable clustering initialization derived from the weakly supervised learning branch, the distribution alignment branch alternately updates the parameters of the moVMF and the network, ensuring alignment with the moVMF-defined latent space. Extensive experiments validate the rationality and effectiveness of our distribution choice and network design. Consequently, DGNet achieves state-of-the-art performance under multiple datasets and various weakly supervised settings.	翻訳日:2024-10-31 05:25:16 公開日:2024-10-18
# IncEventGS:1つのイベントカメラからポスフリーのガウシアンスプレイティング IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera ( http://arxiv.org/abs/2410.08107v1 ) ライセンス: Link先を確認	Jian Huang, Chengrui Dong, Peidong Liu,	(参考訳) 近年, フレームベースカメラ(例えばRGB, RGB-Dカメラ)では, 暗黙の神経表現と3Dガウススプラッティング(3D-GS)が顕著な進歩を遂げている。フレームベースのカメラと比較して、新しいタイプのバイオインスパイアされた視覚センサ、すなわちイベントカメラは、高時間分解能、高ダイナミックレンジ、低消費電力、低レイテンシの利点を実証している。ユニークな非同期かつ不規則なデータキャプチャプロセスのため、イベントカメラにニューラル表現や3Dガウススプラッティングを適用するための限られた作業が提案されている。本研究では,1つのイベントカメラを用いたインクリメンタルな3次元ガウス分割再構成アルゴリズムであるIncEventGSを提案する。 InEventGS における従来の SLAM パイプラインの追跡とマッピングのパラダイムを生かして,3D シーンの表現を漸進的に復元する。受信したイベントストリームから、トラッカーは、事前に再構成された3D-GSシーン表現に基づいて、最初に初期カメラモーションを推定する。そして、マッパーは、トラッカーから予め推定された動き軌跡に基づいて、3Dシーン表現とカメラモーションの両方を共同で洗練する。実験結果から,InEventGSは従来のNeRF法やそれに関連するベースラインに比べて優れた性能を示すことが示された。さらに,本手法は,カメラモーション推定の点から,最先端のイベント・オドメトリー法よりも優れた性能を実現することができる。コードはhttps://github.com/wu-cvgl/IncEventGS.comで公開されている。 Implicit neural representation and explicit 3D Gaussian Splatting (3D-GS) for novel view synthesis have achieved remarkable progress with frame-based camera (e.g. RGB and RGB-D cameras) recently. Compared to frame-based camera, a novel type of bio-inspired visual sensor, i.e. event camera, has demonstrated advantages in high temporal resolution, high dynamic range, low power consumption and low latency. Due to its unique asynchronous and irregular data capturing process, limited work has been proposed to apply neural representation or 3D Gaussian splatting for an event camera. In this work, we present IncEventGS, an incremental 3D Gaussian Splatting reconstruction algorithm with a single event camera. To recover the 3D scene representation incrementally, we exploit the tracking and mapping paradigm of conventional SLAM pipelines for IncEventGS. Given the incoming event stream, the tracker firstly estimates an initial camera motion based on prior reconstructed 3D-GS scene representation. The mapper then jointly refines both the 3D scene representation and camera motion based on the previously estimated motion trajectory from the tracker. The experimental results demonstrate that IncEventGS delivers superior performance compared to prior NeRF-based methods and other related baselines, even we do not have the ground-truth camera poses. Furthermore, our method can also deliver better performance compared to state-of-the-art event visual odometry methods in terms of camera motion estimation. Code is publicly available at: https://github.com/wu-cvgl/IncEventGS.	翻訳日:2024-10-31 05:25:16 公開日:2024-10-18
# IncEventGS:1つのイベントカメラからポスフリーのガウシアンスプレイティング IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera ( http://arxiv.org/abs/2410.08107v2 ) ライセンス: Link先を確認	Jian Huang, Chengrui Dong, Peidong Liu,	(参考訳) 近年, フレームベースカメラ(例えばRGB, RGB-Dカメラ)では, 暗黙の神経表現と3Dガウススプラッティング(3D-GS)が顕著な進歩を遂げている。フレームベースのカメラと比較して、新しいタイプのバイオインスパイアされた視覚センサ、すなわちイベントカメラは、高時間分解能、高ダイナミックレンジ、低消費電力、低レイテンシの利点を実証している。ユニークな非同期かつ不規則なデータキャプチャプロセスのため、イベントカメラにニューラル表現や3Dガウススプラッティングを適用するための限られた作業が提案されている。本研究では,1つのイベントカメラを用いたインクリメンタルな3次元ガウス分割再構成アルゴリズムであるIncEventGSを提案する。 InEventGS における従来の SLAM パイプラインの追跡とマッピングのパラダイムを生かして,3D シーンの表現を漸進的に復元する。受信したイベントストリームから、トラッカーは、事前に再構成された3D-GSシーン表現に基づいて、最初に初期カメラモーションを推定する。そして、マッパーは、トラッカーから予め推定された動き軌跡に基づいて、3Dシーン表現とカメラモーションの両方を共同で洗練する。実験結果から,InEventGSは従来のNeRF法やそれに関連するベースラインに比べて優れた性能を示すことが示された。さらに,本手法は,カメラモーション推定の点から,最先端のイベント・オドメトリー法よりも優れた性能を実現することができる。コードはhttps://github.com/wu-cvgl/IncEventGS.comで公開されている。 Implicit neural representation and explicit 3D Gaussian Splatting (3D-GS) for novel view synthesis have achieved remarkable progress with frame-based camera (e.g. RGB and RGB-D cameras) recently. Compared to frame-based camera, a novel type of bio-inspired visual sensor, i.e. event camera, has demonstrated advantages in high temporal resolution, high dynamic range, low power consumption and low latency. Due to its unique asynchronous and irregular data capturing process, limited work has been proposed to apply neural representation or 3D Gaussian splatting for an event camera. In this work, we present IncEventGS, an incremental 3D Gaussian Splatting reconstruction algorithm with a single event camera. To recover the 3D scene representation incrementally, we exploit the tracking and mapping paradigm of conventional SLAM pipelines for IncEventGS. Given the incoming event stream, the tracker firstly estimates an initial camera motion based on prior reconstructed 3D-GS scene representation. The mapper then jointly refines both the 3D scene representation and camera motion based on the previously estimated motion trajectory from the tracker. The experimental results demonstrate that IncEventGS delivers superior performance compared to prior NeRF-based methods and other related baselines, even we do not have the ground-truth camera poses. Furthermore, our method can also deliver better performance compared to state-of-the-art event visual odometry methods in terms of camera motion estimation. Code is publicly available at: https://github.com/wu-cvgl/IncEventGS.	翻訳日:2024-10-31 05:25:16 公開日:2024-10-18
# 連続制御における遅い決定周波数の克服:モデルに基づくモデル自由制御の系列強化学習 Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control ( http://arxiv.org/abs/2410.08979v1 ) ライセンス: Link先を確認	Devdhar Patel, Hava Siegelmann,	(参考訳) 強化学習(RL)は急速に人間レベルの制御能力を超えつつある。しかし、最先端のRLアルゴリズムは人間の能力よりもはるかに高速な時間ステップと反応時間を必要とすることが多く、これは現実の環境では非現実的であり、通常は特殊なハードウェアを必要とする。このようなスピードは現実世界では達成が困難であり、しばしば特別なハードウェアを必要とする。本稿では、与えられた入力状態に対するアクションのシーケンスを生成するために設計されたRLアルゴリズムであるSequence Reinforcement Learning (SRL)を紹介し、より低い決定周波数での効果的な制御を可能にする。 SRLは、異なる時間スケールで動作するモデルとアクタークリティカルアーキテクチャの両方を利用することで、アクションシーケンスを学習する際の課題に対処する。本研究では,このモデルを用いて原始的行動間の中間状態を推定し,各行動に対する学習信号を提供する「時間的リコール」機構を提案する。トレーニングが完了すると、アクターはモデルとは独立してアクションシーケンスを生成し、より遅い周波数でモデルフリー制御を達成する。我々はSRLを一連の連続制御タスクで評価し、最先端のアルゴリズムに匹敵する性能を実現するとともに、アクターサンプルの複雑さを著しく低減することを示した。そこで,周波数平均スコア(FAS)測定基準を導入する。その結果,SRL は従来の RL アルゴリズムよりも FAS の方が優れており,可変決定周波数を必要とするアプリケーションに特に適していることがわかった。さらに、SRLとモデルベースオンラインプランニングを比較し、SRLは、オンラインプランナーが計画に使用するトレーニングで同じモデルを活用しながら、優れたFASを達成することを示す。 Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. Such speeds are difficult to achieve in the real world and often requires specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a "temporal recall" mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is complete, the actor can generate action sequences independently of the model, achieving model-free control at a slower frequency. We evaluate SRL on a suite of continuous control tasks, demonstrating that it achieves performance comparable to state-of-the-art algorithms while significantly reducing actor sample complexity. To better assess performance across varying decision frequencies, we introduce the Frequency-Averaged Score (FAS) metric. Our results show that SRL significantly outperforms traditional RL algorithms in terms of FAS, making it particularly suitable for applications requiring variable decision frequencies. Additionally, we compare SRL with model-based online planning, showing that SRL achieves superior FAS while leveraging the same model during training that online planners use for planning.	翻訳日:2024-10-30 20:46:27 公開日:2024-10-18
# 連続制御における遅い決定周波数の克服:モデルに基づくモデル自由制御の系列強化学習 Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control ( http://arxiv.org/abs/2410.08979v2 ) ライセンス: Link先を確認	Devdhar Patel, Hava Siegelmann,	(参考訳) 強化学習(RL)は急速に人間レベルの制御能力を超えつつある。しかし、最先端のRLアルゴリズムは人間の能力よりもはるかに高速な時間ステップと反応時間を必要とすることが多く、これは現実の環境では非現実的であり、通常は特殊なハードウェアを必要とする。このようなスピードは現実世界では達成が困難であり、しばしば特別なハードウェアを必要とする。本稿では、与えられた入力状態に対するアクションのシーケンスを生成するために設計されたRLアルゴリズムであるSequence Reinforcement Learning (SRL)を紹介し、より低い決定周波数での効果的な制御を可能にする。 SRLは、異なる時間スケールで動作するモデルとアクタークリティカルアーキテクチャの両方を利用することで、アクションシーケンスを学習する際の課題に対処する。本研究では,このモデルを用いて原始的行動間の中間状態を推定し,各行動に対する学習信号を提供する「時間的リコール」機構を提案する。トレーニングが完了すると、アクターはモデルとは独立してアクションシーケンスを生成し、より遅い周波数でモデルフリー制御を達成する。我々はSRLを一連の連続制御タスクで評価し、最先端のアルゴリズムに匹敵する性能を実現するとともに、アクターサンプルの複雑さを著しく低減することを示した。そこで,周波数平均スコア(FAS)測定基準を導入する。その結果,SRL は従来の RL アルゴリズムよりも FAS の方が優れており,可変決定周波数を必要とするアプリケーションに特に適していることがわかった。さらに、SRLとモデルベースオンラインプランニングを比較し、SRLは、オンラインプランナーが計画に使用するトレーニングで同じモデルを活用しながら、優れたFASを達成することを示す。 Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. Such speeds are difficult to achieve in the real world and often requires specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a "temporal recall" mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is complete, the actor can generate action sequences independently of the model, achieving model-free control at a slower frequency. We evaluate SRL on a suite of continuous control tasks, demonstrating that it achieves performance comparable to state-of-the-art algorithms while significantly reducing actor sample complexity. To better assess performance across varying decision frequencies, we introduce the Frequency-Averaged Score (FAS) metric. Our results show that SRL significantly outperforms traditional RL algorithms in terms of FAS, making it particularly suitable for applications requiring variable decision frequencies. Additionally, we compare SRL with model-based online planning, showing that SRL achieves superior FAS while leveraging the same model during training that online planners use for planning.	翻訳日:2024-10-30 20:46:27 公開日:2024-10-18
# 観測可能な測定誘起遷移 Observable Measurement-Induced Transitions ( http://arxiv.org/abs/2410.09353v1 ) ライセンス: Link先を確認	Aleksei Khindanov, Igor L. Aleiner, Lara Faoro, Lev B. Ioffe,	(参考訳) 量子力学の主要な仮定の1つは、測定が量子コヒーレンス(波動関数の崩壊)を破壊することである。近年,多体系では局所測定の希薄化がシステム全体のコヒーレンスを保っていることが判明した。測定密度が大きくなると、システムの異なる部分の絡み合いが特徴の相転移が発生する。残念ながら、この遷移は、多体波動関数の指数的にコストのかかるフルトモグラフィーや、オラクル古典計算機のシミュレーションとの比較を必要とするため、マクロシステムに対して実験的に観察することは不可能である。本研究では、量子力学を逆転させることができる場合、実験的に観測できる別の測定誘起相転移の発見を報告する。この相転移の一方では、ヒルベルト空間の一部で符号化された量子情報が、時間反転後に完全に回復される。一方、全ての量子情報は破壊される。この遷移は、何度も繰り返される同一のブロックからなるプロセスにおいて、同じ測定結果を見る確率の挙動の変化としても現れている。各ブロックにおいて、ユニタリ進化は測定によって続く。遷移の一方の面において、確率は繰り返しの数とともに指数関数的に減少し、他方の面では繰り返しの数が増えるにつれて一定となる傾向にある。本研究では,実数値回路の数値シミュレーションと実効ランダム行列モデルを用いた解析計算により,提案した相転移の存在を確認する。 One of the main postulates of quantum mechanics is that measurements destroy quantum coherence (wave function collapse). Recently it was discovered that in a many-body system dilute local measurements still preserve some coherence across the entire system. As the measurement density is increased, a phase transition occurs that is characterized by the disentanglement of different parts of the system. Unfortunately, this transition is impossible to observe experimentally for macroscopic systems because it requires an exponentially costly full tomography of the many-body wave function or a comparison with the simulation on an oracle classical computer. In this work we report the discovery of another measurement-induced phase transition that can be observed experimentally if quantum dynamics can be reversed. On one side of this phase transition the quantum information encoded in some part of the Hilbert space is fully recovered after the time inversion. On the other side, all quantum information is corrupted. This transition also manifests itself as the change in the behavior of the probability to observe the same measurement outcome in the process that consists of identical blocks repeated many times. In each block the unitary evolution is followed by the measurement. On one side of the transition the probability decreases exponentially with the number of repetitions, on the other it tends to a constant as the number of repetitions is increased. We confirm the existence of the proposed phase transition through numerical simulations of realistic quantum circuits and analytical calculations using an effective random-matrix theory model.	翻訳日:2024-10-30 14:53:51 公開日:2024-10-18
# 観測可能な測定誘起遷移 Observable Measurement-Induced Transitions ( http://arxiv.org/abs/2410.09353v2 ) ライセンス: Link先を確認	Aleksei Khindanov, Igor L. Aleiner, Lara Faoro, Lev B. Ioffe,	(参考訳) 量子力学の主要な仮定の1つは、測定が量子コヒーレンス(波動関数の崩壊)を破壊することである。近年,多体系では局所測定の希薄化がシステム全体のコヒーレンスを保っていることが判明した。測定密度が大きくなると、システムの異なる部分の絡み合いが特徴の相転移が発生する。残念ながら、この遷移は、多体波動関数の指数的にコストのかかるフルトモグラフィーや、オラクル古典計算機のシミュレーションとの比較を必要とするため、マクロシステムに対して実験的に観察することは不可能である。本研究では、量子力学を逆転させることができる場合、実験的に観測できる別の測定誘起相転移の発見を報告する。この相転移の一方では、ヒルベルト空間の一部で符号化された量子情報が、時間反転後に完全に回復される。一方、全ての量子情報は破壊される。この遷移は、何度も繰り返される同一のブロックからなるプロセスにおいて、同じ測定結果を見る確率の挙動の変化としても現れている。各ブロックにおいて、ユニタリ進化は測定によって続く。遷移の一方の面において、確率は繰り返しの数とともに指数関数的に減少し、他方の面では繰り返しの数が増えるにつれて一定となる傾向にある。本研究では,実数値回路の数値シミュレーションと実効ランダム行列モデルを用いた解析計算により,提案した相転移の存在を確認する。 One of the main postulates of quantum mechanics is that measurements destroy quantum coherence (wave function collapse). Recently it was discovered that in a many-body system dilute local measurements still preserve some coherence across the entire system. As the measurement density is increased, a phase transition occurs that is characterized by the disentanglement of different parts of the system. Unfortunately, this transition is impossible to observe experimentally for macroscopic systems because it requires an exponentially costly full tomography of the many-body wave function or a comparison with the simulation on an oracle classical computer. In this work we report the discovery of another measurement-induced phase transition that can be observed experimentally if quantum dynamics can be reversed. On one side of this phase transition the quantum information encoded in some part of the Hilbert space is fully recovered after the time inversion. On the other side, all quantum information is corrupted. This transition also manifests itself as the change in the behavior of the probability to observe the same measurement outcome in the process that consists of identical blocks repeated many times. In each block the unitary evolution is followed by the measurement. On one side of the transition the probability decreases exponentially with the number of repetitions, on the other it tends to a constant as the number of repetitions is increased. We confirm the existence of the proposed phase transition through numerical simulations of realistic quantum circuits and analytical calculations using an effective random-matrix theory model.	翻訳日:2024-10-30 14:53:51 公開日:2024-10-18
# 擬似物理情報を用いた3次元恒星深層学習インバージョン 3-D Magnetotelluric Deep Learning Inversion Guided by Pseudo-Physical Information ( http://arxiv.org/abs/2410.09388v1 ) ライセンス: Link先を確認	Peifan Jiang, Xuben Wang, Shuang Wang, Fei Deng, Kunpeng Wang, Bin Wang, Yuhan Yang, Islam Fadel,	(参考訳) 近年,ジョイントデータ駆動と物理駆動を併用したDLインバージョン法が注目されている。ニューラルネットワーク(NN)を用いて、観測データ(またはフォワードモデリングデータ)を比抵抗モデルにマッピングすると、逆比抵抗のフォワードモデリング応答の誤差(ロス)項が組み込まれ、電磁場伝播に関する物理情報が導入され、インバージョン精度が大幅に向上する。大規模3次元MTデータに対するデータ-物理二重駆動型MT深層学習インバージョンを効率よく実現するために,DLフォワードモデリングネットワークを用いて損失のこの部分を計算することを提案する。この手法は、NNシミュレーションのフォワードモデリングを通じて擬似物理情報を導入し、さらにインバージョンネットワークの適合を導出する。具体的には,まずフォワード・モデリング・ネットワークを固定フォワード・モデリング・オペレータとして事前訓練し,次にインバージョン・ネットワーク・トレーニングに転送・統合し,最終的にマルチノード・ロスを最小化してインバージョン・ネットワークを最適化する。理論実験の結果, DLフォワードモデリングにおけるシミュレーション誤差はいくつかあるものの, 擬似物理情報の導入はインバージョン精度を向上し, トレーニング中のオーバーフィッティング問題を著しく軽減することがわかった。さらに,3次元MTインバージョンにおけるフィールドデータ環境をシミュレートし,マスキングとノイズ付加を含む新しい入力モードを提案する。 Magnetotelluric deep learning (DL) inversion methods based on joint data-driven and physics-driven have become a hot topic in recent years. When mapping observation data (or forward modeling data) to the resistivity model using neural networks (NNs), incorporating the error (loss) term of the inversion resistivity's forward modeling response--which introduces physical information about electromagnetic field propagation--can significantly enhance the inversion accuracy. To efficiently achieve data-physical dual-driven MT deep learning inversion for large-scale 3-D MT data, we propose using DL forward modeling networks to compute this portion of the loss. This approach introduces pseudo-physical information through the forward modeling of NN simulation, further guiding the inversion network fitting. Specifically, we first pre-train the forward modeling networks as fixed forward modeling operators, then transfer and integrate them into the inversion network training, and finally optimize the inversion network by minimizing the multinomial loss. Theoretical experimental results indicate that despite some simulation errors in DL forward modeling, the introduced pseudo-physical information still enhances inversion accuracy and significantly mitigates the overfitting problem during training. Additionally, we propose a new input mode that involves masking and adding noise to the data, simulating the field data environment of 3-D MT inversion, thereby making the method more flexible and effective for practical applications.	翻訳日:2024-10-30 14:44:04 公開日:2024-10-18
# 擬似物理情報を用いた3次元恒星深層学習インバージョン 3-D Magnetotelluric Deep Learning Inversion Guided by Pseudo-Physical Information ( http://arxiv.org/abs/2410.09388v2 ) ライセンス: Link先を確認	Peifan Jiang, Xuben Wang, Shuang Wang, Fei Deng, Kunpeng Wang, Bin Wang, Yuhan Yang, Islam Fadel,	(参考訳) 近年,ジョイントデータ駆動と物理駆動を併用したDLインバージョン法が注目されている。ニューラルネットワーク(NN)を用いて、観測データ(またはフォワードモデリングデータ)を比抵抗モデルにマッピングすると、逆比抵抗のフォワードモデリング応答の誤差(ロス)項が組み込まれ、電磁場伝播に関する物理情報が導入され、インバージョン精度が大幅に向上する。大規模3次元MTデータに対するデータ-物理二重駆動型MT深層学習インバージョンを効率よく実現するために,DLフォワードモデリングネットワークを用いて損失のこの部分を計算することを提案する。この手法は、NNシミュレーションのフォワードモデリングを通じて擬似物理情報を導入し、さらにインバージョンネットワークの適合を導出する。具体的には,まずフォワード・モデリング・ネットワークを固定フォワード・モデリング・オペレータとして事前訓練し,次にインバージョン・ネットワーク・トレーニングに転送・統合し,最終的にマルチノード・ロスを最小化してインバージョン・ネットワークを最適化する。理論実験の結果, DLフォワードモデリングにおけるシミュレーション誤差はいくつかあるものの, 擬似物理情報の導入はインバージョン精度を向上し, トレーニング中のオーバーフィッティング問題を著しく軽減することがわかった。さらに,3次元MTインバージョンにおけるフィールドデータ環境をシミュレートし,マスキングとノイズ付加を含む新しい入力モードを提案する。 Magnetotelluric deep learning (DL) inversion methods based on joint data-driven and physics-driven have become a hot topic in recent years. When mapping observation data (or forward modeling data) to the resistivity model using neural networks (NNs), incorporating the error (loss) term of the inversion resistivity's forward modeling response--which introduces physical information about electromagnetic field propagation--can significantly enhance the inversion accuracy. To efficiently achieve data-physical dual-driven MT deep learning inversion for large-scale 3-D MT data, we propose using DL forward modeling networks to compute this portion of the loss. This approach introduces pseudo-physical information through the forward modeling of NN simulation, further guiding the inversion network fitting. Specifically, we first pre-train the forward modeling networks as fixed forward modeling operators, then transfer and integrate them into the inversion network training, and finally optimize the inversion network by minimizing the multinomial loss. Theoretical experimental results indicate that despite some simulation errors in DL forward modeling, the introduced pseudo-physical information still enhances inversion accuracy and significantly mitigates the overfitting problem during training. Additionally, we propose a new input mode that involves masking and adding noise to the data, simulating the field data environment of 3-D MT inversion, thereby making the method more flexible and effective for practical applications.	翻訳日:2024-10-30 14:44:04 公開日:2024-10-18
# VLFeedback: 大規模ビジョンランゲージモデルアライメントのための大規模AIフィードバックデータセット VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment ( http://arxiv.org/abs/2410.09421v1 ) ライセンス: Link先を確認	Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, Qi Liu,	(参考訳) 大規模視覚言語モデル(LVLM)が急速に進化するにつれて、これらのモデルを調整するための高品質で多様なデータ要求がますます重要になっている。しかし、人間の監督によるこのようなデータの作成は、コストが高く、時間を要することが証明されている。本稿では,LVLMの調整におけるAIフィードバックの有効性について検討する。 VLFeedbackは、人間のアノテーションを使わずに、82万以上のマルチモーダル命令と、市販のモデルが生成する包括的合理性を含む、最初の大規模視覚言語フィードバックデータセットである。視覚言語アライメントにおけるAIフィードバックの有効性を評価するために,VLFeedback上での直接優先最適化によるLVLMの微調整であるSilkieを訓練する。 Silkieは、有用性、視覚的忠実性、安全性の指標に関する特別なパフォーマンスを誇示している。ベースモデルは6.9 %、認知タスクでは9.5 %、MMHal-Benchでは幻覚の問題を減らし、レッドチーム攻撃に対する弾力性を高めている。さらに、我々の分析は、AIフィードバックの利点、特により包括的な改善を提供するために、好みの多様性を育むことの基盤となっている。私たちのデータセット、トレーニングコード、モデルはhttps://vlf-silkie.github.io.comで公開されています。 As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and cognition tasks, reduces hallucination issues on MMHal-Bench, and exhibits enhanced resilience against red-teaming attacks. Furthermore, our analysis underscores the advantage of AI feedback, particularly in fostering preference diversity to deliver more comprehensive improvements. Our dataset, training code and models are available at https://vlf-silkie.github.io.	翻訳日:2024-10-30 14:24:23 公開日:2024-10-18
# VLFeedback: 大規模ビジョンランゲージモデルアライメントのための大規模AIフィードバックデータセット VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment ( http://arxiv.org/abs/2410.09421v2 ) ライセンス: Link先を確認	Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, Qi Liu,	(参考訳) 大規模視覚言語モデル(LVLM)が急速に進化するにつれて、これらのモデルを調整するための高品質で多様なデータ要求がますます重要になっている。しかし、人間の監督によるこのようなデータの作成は、コストが高く、時間を要することが証明されている。本稿では,LVLMの調整におけるAIフィードバックの有効性について検討する。 VLFeedbackは、人間のアノテーションを使わずに、82万以上のマルチモーダル命令と、市販のモデルが生成する包括的合理性を含む、最初の大規模視覚言語フィードバックデータセットである。視覚言語アライメントにおけるAIフィードバックの有効性を評価するために,VLFeedback上での直接優先最適化によるLVLMの微調整であるSilkieを訓練する。 Silkieは、有用性、視覚的忠実性、安全性の指標に関する特別なパフォーマンスを誇示している。ベースモデルは6.9 %、認知タスクでは9.5 %、MMHal-Benchでは幻覚の問題を減らし、レッドチーム攻撃に対する弾力性を高めている。さらに、我々の分析は、AIフィードバックの利点、特により包括的な改善を提供するために、好みの多様性を育むことの基盤となっている。私たちのデータセット、トレーニングコード、モデルはhttps://vlf-silkie.github.io.comで公開されています。 As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and cognition tasks, reduces hallucination issues on MMHal-Bench, and exhibits enhanced resilience against red-teaming attacks. Furthermore, our analysis underscores the advantage of AI feedback, particularly in fostering preference diversity to deliver more comprehensive improvements. Our dataset, training code and models are available at https://vlf-silkie.github.io.	翻訳日:2024-10-30 14:24:23 公開日:2024-10-18
# Timeseria: オブジェクト指向時系列処理ライブラリ Timeseria: an object-oriented time series processing library ( http://arxiv.org/abs/2410.09567v1 ) ライセンス: Link先を確認	Stefano Alberto Russo, Giuliano Taffonia, Luca Bortolussi,	(参考訳) TimeseriaはPythonで実装されたオブジェクト指向の時系列処理ライブラリで、時系列データを操作しやすくし、その上に統計的および機械学習モデルを構築することを目的としている。一般的なデータ分析フレームワークとは異なり、明確に定義された再利用可能な論理ユニット(オブジェクト)から構築される。このアプローチにより、Timeseriaは、データ損失、非一様サンプリング率、集約データと時間的観察の違い、時間帯、日光の節約時間など、しばしば過小評価されるいくつかの非自明な問題に対処することができる。 Timeseriaには、一連のベースデータ構造、一般的なデータ操作操作、データ再構成、予測、異常検出のための拡張可能なモデルが含まれている。また、数百万のデータポイントを処理できる強力なプロットエンジンを統合している。 Timeseria is an object-oriented time series processing library implemented in Python, which aims at making it easier to manipulate time series data and to build statistical and machine learning models on top of it. Unlike common data analysis frameworks, it builds up from well defined and reusable logical units (objects), which can be easily combined together in order to ensure a high level of consistency. Thanks to this approach, Timeseria can address by design several non-trivial issues often underestimated, such as handling data losses, non-uniform sampling rates, differences between aggregated data and punctual observations, time zones, daylight saving times, and more. Timeseria comes with a comprehensive set of base data structures, common data manipulation operations, and extensible models for data reconstruction, forecasting and anomaly detection. It also integrates a powerful plotting engine capable of handling even millions of data points.	翻訳日:2024-10-30 13:45:15 公開日:2024-10-18
# Timeseria: オブジェクト指向時系列処理ライブラリ Timeseria: an object-oriented time series processing library ( http://arxiv.org/abs/2410.09567v2 ) ライセンス: Link先を確認	Stefano Alberto Russo, Giuliano Taffoni, Luca Bortolussi,	(参考訳) TimeseriaはPythonで実装されたオブジェクト指向の時系列処理ライブラリで、時系列データを操作しやすくし、その上に統計的および機械学習モデルを構築することを目的としている。一般的なデータ分析フレームワークとは異なり、明確に定義された再利用可能な論理ユニット(オブジェクト)から構築される。このアプローチにより、Timeseriaは、データ損失、非一様サンプリング率、集約データと時間的観察の違い、時間帯、日光の節約時間など、しばしば過小評価されるいくつかの非自明な問題に対処することができる。 Timeseriaには、一連のベースデータ構造、一般的なデータ操作操作、データ再構成、予測、異常検出のための拡張可能なモデルが含まれている。また、数百万のデータポイントを処理できる強力なプロットエンジンを統合している。 Timeseria is an object-oriented time series processing library implemented in Python, which aims at making it easier to manipulate time series data and to build statistical and machine learning models on top of it. Unlike common data analysis frameworks, it builds up from well defined and reusable logical units (objects), which can be easily combined together in order to ensure a high level of consistency. Thanks to this approach, Timeseria can address by design several non-trivial issues often underestimated, such as handling data losses, non-uniform sampling rates, differences between aggregated data and punctual observations, time zones, daylight saving times, and more. Timeseria comes with a comprehensive set of base data structures, common data manipulation operations, and extensible models for data reconstruction, forecasting and anomaly detection. It also integrates a powerful plotting engine capable of handling even millions of data points.	翻訳日:2024-10-30 13:45:15 公開日:2024-10-18
# 大規模言語モデルは地理空間コードを生成することができるか? Can Large Language Models Generate Geospatial Code? ( http://arxiv.org/abs/2410.09738v1 ) ライセンス: Link先を確認	Shuyang Hou, Shen Zhangxiao, Liang Jianyuan, Zhao Anqi, Gui Zhipeng, Li Rui, Huayi Wu,	(参考訳) 時空間データ処理と地理空間モデリングの需要が高まっているため、地理空間コード生成の自動化は生産性に欠かせないものとなっている。大規模言語モデル(LLM)はコード生成において有望であるが、ドメイン固有の知識ギャップや"コーディング幻覚"といった課題に直面している。本稿では,LLMが3次元の空間的コードを生成する能力を評価するためのフレームワークであるGeoCode-Eval(GCE)について紹介する。ベンチマークデータセットであるGeoCode-Benchは、5000の多重選択、1500の補充、1500の真/偽の質問、1000の主観的なタスクで構成され、コードの要約、生成、完了、修正をカバーしている。 GeoCode-Benchを用いて、3つの商用クローズドソースLCM、4つのオープンソース汎用LCM、14の特殊コード生成モデルを評価した。また,数発・ゼロショット学習,思考の連鎖(Chain of Thought reasoning),多ラウンド多数決(multi-round majority voting)を行い,空間的コード生成への影響を計測した。さらに、Google Earth Engine関連JavaScriptを用いて、LLaMA-7Bモデルを微調整し、GEECode-GPTを作成し、主観的なタスクで評価した。その結果、事前トレーニングと命令データセットの構築はコード生成を大幅に改善し、特定のドメインでLLMを最適化するための洞察を提供することがわかった。 With the growing demand for spatiotemporal data processing and geospatial modeling, automating geospatial code generation has become essential for productivity. Large language models (LLMs) show promise in code generation but face challenges like domain-specific knowledge gaps and "coding hallucinations." This paper introduces GeoCode-Eval (GCE), a framework for assessing LLMs' ability to generate geospatial code across three dimensions: "Cognition and Memory," "Comprehension and Interpretation," and "Innovation and Creation," distributed across eight capability levels. We developed a benchmark dataset, GeoCode-Bench, consisting of 5,000 multiple-choice, 1,500 fill-in-the-blank, 1,500 true/false questions, and 1,000 subjective tasks covering code summarization, generation, completion, and correction. Using GeoCode-Bench, we evaluated three commercial closed-source LLMs, four open-source general-purpose LLMs, and 14 specialized code generation models. We also conducted experiments on few-shot and zero-shot learning, Chain of Thought reasoning, and multi-round majority voting to measure their impact on geospatial code generation. Additionally, we fine-tuned the Code LLaMA-7B model using Google Earth Engine-related JavaScript, creating GEECode-GPT, and evaluated it on subjective tasks. Results show that constructing pre-training and instruction datasets significantly improves code generation, offering insights for optimizing LLMs in specific domains.	翻訳日:2024-10-30 05:12:47 公開日:2024-10-18
# 大規模言語モデルは地理空間コードを生成することができるか? Can Large Language Models Generate Geospatial Code? ( http://arxiv.org/abs/2410.09738v2 ) ライセンス: Link先を確認	Shuyang Hou, Zhangxiao Shen, Jianyuan Liang, Anqi Zhao, Zhipeng Gui, Rui Li, Huayi Wu,	(参考訳) 時空間データ処理と地理空間モデリングの需要が高まっているため、地理空間コード生成の自動化は生産性に欠かせないものとなっている。大規模言語モデル(LLM)はコード生成において有望であるが、ドメイン固有の知識ギャップや"コーディング幻覚"といった課題に直面している。本稿では,LLMが3次元の空間的コードを生成する能力を評価するためのフレームワークであるGeoCode-Eval(GCE)について紹介する。ベンチマークデータセットであるGeoCode-Benchは、5000の多重選択、1500の補充、1500の真/偽の質問、1000の主観的なタスクで構成され、コードの要約、生成、完了、修正をカバーしている。 GeoCode-Benchを用いて、3つの商用クローズドソースLCM、4つのオープンソース汎用LCM、14の特殊コード生成モデルを評価した。また,数発・ゼロショット学習,思考の連鎖(Chain of Thought reasoning),多ラウンド多数決(multi-round majority voting)を行い,空間的コード生成への影響を計測した。さらに、Google Earth Engine関連JavaScriptを用いて、LLaMA-7Bモデルを微調整し、GEECode-GPTを作成し、主観的なタスクで評価した。その結果、事前トレーニングと命令データセットの構築はコード生成を大幅に改善し、特定のドメインでLLMを最適化するための洞察を提供することがわかった。 With the growing demand for spatiotemporal data processing and geospatial modeling, automating geospatial code generation has become essential for productivity. Large language models (LLMs) show promise in code generation but face challenges like domain-specific knowledge gaps and "coding hallucinations." This paper introduces GeoCode-Eval (GCE), a framework for assessing LLMs' ability to generate geospatial code across three dimensions: "Cognition and Memory," "Comprehension and Interpretation," and "Innovation and Creation," distributed across eight capability levels. We developed a benchmark dataset, GeoCode-Bench, consisting of 5,000 multiple-choice, 1,500 fill-in-the-blank, 1,500 true/false questions, and 1,000 subjective tasks covering code summarization, generation, completion, and correction. Using GeoCode-Bench, we evaluated three commercial closed-source LLMs, four open-source general-purpose LLMs, and 14 specialized code generation models. We also conducted experiments on few-shot and zero-shot learning, Chain of Thought reasoning, and multi-round majority voting to measure their impact on geospatial code generation. Additionally, we fine-tuned the Code LLaMA-7B model using Google Earth Engine-related JavaScript, creating GEECode-GPT, and evaluated it on subjective tasks. Results show that constructing pre-training and instruction datasets significantly improves code generation, offering insights for optimizing LLMs in specific domains.	翻訳日:2024-10-30 05:12:47 公開日:2024-10-18
# BlackDAN: 大規模言語モデルの効果的かつ文脈的ジェイルブレイクのためのブラックボックス多目的アプローチ BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models ( http://arxiv.org/abs/2410.09804v1 ) ライセンス: Link先を確認	Xinyuan Wang, Victor Shea-Jay Huang, Renmiao Chen, Hao Wang, Chengwei Pan, Lei Sha, Minlie Huang,	(参考訳) 大きな言語モデル(LLM)は様々なタスクにまたがる優れた機能を示すが、セキュリティ対策をバイパスし有害な出力を生成するために脆弱性を利用するジェイルブレイク攻撃のような潜在的なセキュリティリスクに遭遇する。既存のジェイルブレイク戦略は主に攻撃成功率(ASR)の最大化に重点を置いており、クエリに対するジェイルブレイク応答の関連性やステルスネスのレベルなど、他の重要な要素を頻繁に無視している。この単一目的への焦点の狭さは、文脈的関連性を欠いたり、容易に認識できるような非効果的な攻撃をもたらす可能性がある。本研究では,多目的最適化を備えた革新的なブラックボックスアタックフレームワークであるBlackDANを紹介し,コンテキスト関連性を維持しつつジェイルブレイクを効果的に促進し,検出可能性を最小限に抑えるための高品質なプロンプトを生成することを目的とする。 BlackDANはマルチオブジェクト進化アルゴリズム(MOEA)、特にNSGA-IIアルゴリズムを活用して、ASR、ステルスネス、セマンティック関連性を含む複数の目的にわたるジェイルブレイクを最適化する。 BlackDANは、突然変異、クロスオーバー、パレート・マディナンスなどのメカニズムを統合することで、ジェイルブレイクを生成するための透明で解釈可能なプロセスを提供する。さらに、このフレームワークは、ユーザの好みに基づいたカスタマイズを可能にし、有害性、関連性、その他の要因のバランスをとるプロンプトの選択を可能にする。実験の結果、BlackDANは従来の単目的法よりも優れており、高い成功率と様々なLSMおよびマルチモーダルLSM間の堅牢性が向上し、ジェイルブレイク応答が適切かつ検出不能であることが確認された。 While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.	翻訳日:2024-10-30 04:52:52 公開日:2024-10-18
# BlackDAN: 大規模言語モデルの効果的かつ文脈的ジェイルブレイクのためのブラックボックス多目的アプローチ BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models ( http://arxiv.org/abs/2410.09804v2 ) ライセンス: Link先を確認	Xinyuan Wang, Victor Shea-Jay Huang, Renmiao Chen, Hao Wang, Chengwei Pan, Lei Sha, Minlie Huang,	(参考訳) 大きな言語モデル(LLM)は様々なタスクにまたがる優れた機能を示すが、セキュリティ対策をバイパスし有害な出力を生成するために脆弱性を利用するジェイルブレイク攻撃のような潜在的なセキュリティリスクに遭遇する。既存のジェイルブレイク戦略は主に攻撃成功率(ASR)の最大化に重点を置いており、クエリに対するジェイルブレイク応答の関連性やステルスネスのレベルなど、他の重要な要素を頻繁に無視している。この単一目的への焦点の狭さは、文脈的関連性を欠いたり、容易に認識できるような非効果的な攻撃をもたらす可能性がある。本研究では,多目的最適化を備えた革新的なブラックボックスアタックフレームワークであるBlackDANを紹介し,コンテキスト関連性を維持しつつジェイルブレイクを効果的に促進し,検出可能性を最小限に抑えるための高品質なプロンプトを生成することを目的とする。 BlackDANはマルチオブジェクト進化アルゴリズム(MOEA)、特にNSGA-IIアルゴリズムを活用して、ASR、ステルスネス、セマンティック関連性を含む複数の目的にわたるジェイルブレイクを最適化する。 BlackDANは、突然変異、クロスオーバー、パレート・マディナンスなどのメカニズムを統合することで、ジェイルブレイクを生成するための透明で解釈可能なプロセスを提供する。さらに、このフレームワークは、ユーザの好みに基づいたカスタマイズを可能にし、有害性、関連性、その他の要因のバランスをとるプロンプトの選択を可能にする。実験の結果、BlackDANは従来の単目的法よりも優れており、高い成功率と様々なLSMおよびマルチモーダルLSM間の堅牢性が向上し、ジェイルブレイク応答が適切かつ検出不能であることが確認された。 While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.	翻訳日:2024-10-30 04:52:52 公開日:2024-10-18
# 多項式時間における線形注意の学習 Learning Linear Attention in Polynomial Time ( http://arxiv.org/abs/2410.10101v1 ) ライセンス: Link先を確認	Morris Yau, Ekin Akyurek, Jiayuan Mao, Joshua B. Tenenbaum, Stefanie Jegelka, Jacob Andreas,	(参考訳) 従来、ブール回路やチューリングマシンを模擬するトランスフォーマーモデルの計算表現性について研究されてきた。しかし,これらのシミュレータの観測データからの学習性には疑問が残る。本研究は,線形注意を持つ単層トランスフォーマーに対して,最初の多項式時間学習性(特に強い,無依存なPAC学習)を提供することにより,このギャップに対処する。線形アテンションは RKHS で適切に定義された線形予測器とみなすことができる。その結果、任意の線形変圧器を学習する問題は、拡張された特徴空間において通常の線形変圧器を学習する問題に変換でき、そのような予測器をマルチヘッド線形変圧器に変換することができる。一般化に移行して、データを生成する線形変換器に対して、すべての経験的リスク最小化器が等価(自明な対称性まで)なトレーニングデータセットを効率的に識別する方法を示し、学習モデルが全ての入力に対して正しく一般化されることを保証する。最後に,線形注意で表現可能な計算の例として,連想記憶,有限オートマトン,多項式境界計算履歴を持つUniversal Turing Machine (UTM) のクラスについて述べる。ランダムな線形アテンションネットワークの学習,キー値アソシエーションの学習,有限オートマトン実行の学習という3つの課題に関する理論的知見を実証的に検証した。本研究は,トランスフォーマーの理論的表現性と学習性の間に重要なギャップを埋め,フレキシブルで汎用的な計算モデルが効率的に学習可能であることを示す。 Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.	翻訳日:2024-10-30 03:04:18 公開日:2024-10-18
# 多項式時間における線形注意の学習 Learning Linear Attention in Polynomial Time ( http://arxiv.org/abs/2410.10101v2 ) ライセンス: Link先を確認	Morris Yau, Ekin Akyürek, Jiayuan Mao, Joshua B. Tenenbaum, Stefanie Jegelka, Jacob Andreas,	(参考訳) 従来、ブール回路やチューリングマシンを模擬するトランスフォーマーモデルの計算表現性について研究されてきた。しかし,これらのシミュレータの観測データからの学習性には疑問が残る。本研究は,線形注意を持つ単層トランスフォーマーに対して,最初の多項式時間学習性(特に強い,無依存なPAC学習)を提供することにより,このギャップに対処する。線形アテンションは RKHS で適切に定義された線形予測器とみなすことができる。その結果、任意の線形変圧器を学習する問題は、拡張された特徴空間において通常の線形変圧器を学習する問題に変換でき、そのような予測器をマルチヘッド線形変圧器に変換することができる。一般化に移行して、データを生成する線形変換器に対して、すべての経験的リスク最小化器が等価(自明な対称性まで)なトレーニングデータセットを効率的に識別する方法を示し、学習モデルが全ての入力に対して正しく一般化されることを保証する。最後に,線形注意で表現可能な計算の例として,連想記憶,有限オートマトン,多項式境界計算履歴を持つUniversal Turing Machine (UTM) のクラスについて述べる。ランダムな線形アテンションネットワークの学習,キー値アソシエーションの学習,有限オートマトン実行の学習という3つの課題に関する理論的知見を実証的に検証した。本研究は,トランスフォーマーの理論的表現性と学習性の間に重要なギャップを埋め,フレキシブルで汎用的な計算モデルが効率的に学習可能であることを示す。 Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.	翻訳日:2024-10-30 03:04:18 公開日:2024-10-18
# X-Fi:マルチモーダルヒューマンセンシングのためのモダリティ不変基礎モデル X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing ( http://arxiv.org/abs/2410.10167v1 ) ライセンス: Link先を確認	Xinyan Chen, Jianfei Yang,	(参考訳) さまざまなセンサーと高度なディープラーニング技術を使って人体情報を正確に捉え、解釈するヒューマンセンシングは、公共のセキュリティやロボティクスといった分野に大きな影響を与えている。しかし、現在の人間の感覚は、主にカメラやLiDARのような、それぞれ独自の強みと限界を持つモダリティに依存している。さらに、既存のマルチモーダル融合ソリューションは、通常、固定されたモーダルの組み合わせのために設計されており、様々なシナリオに対してモーダルが加えられたり取り除かれたりする際には、広範囲なリトレーニングを必要とする。本稿では、この問題に対処するため、すべてのモダリティ(X-Fi)に対するモダリティ不変基盤モデルを提案する。 X-Fiは、変圧器構造を利用して可変入力サイズを調整し、マルチモーダル統合中にモダリティ固有の特徴を保存する新しい「X-フュージョン」機構を組み込むことで、追加のトレーニングなしで、センサモダリティの独立的または複合的使用を可能にする。このアプローチは適応性を向上するだけでなく、モダリティを越えた補完的な特徴の学習を促進する。 MM-FiとXRF55のデータセットを6つの異なるモードで組み合わせた実験により,ヒトのポーズ推定(HPE)とヒトの活動認識(HAR)タスクにおいて,X-Fiが最先端のパフォーマンスを達成することを示した。この結果から,提案モデルでは広範囲の人体検知アプリケーションを効率的にサポートでき,最終的にはスケーラブルでマルチモーダルなセンシング技術の進化に寄与することが示唆された。 Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel "X-fusion" mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies.	翻訳日:2024-10-30 02:34:41 公開日:2024-10-18
# X-Fi:マルチモーダルヒューマンセンシングのためのモダリティ不変基礎モデル X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing ( http://arxiv.org/abs/2410.10167v2 ) ライセンス: Link先を確認	Xinyan Chen, Jianfei Yang,	(参考訳) さまざまなセンサーと高度なディープラーニング技術を使って人体情報を正確に捉え、解釈するヒューマンセンシングは、公共のセキュリティやロボティクスといった分野に大きな影響を与えている。しかし、現在の人間の感覚は、主にカメラやLiDARのような、それぞれ独自の強みと限界を持つモダリティに依存している。さらに、既存のマルチモーダル融合ソリューションは、通常、固定されたモーダルの組み合わせのために設計されており、様々なシナリオに対してモーダルが加えられたり取り除かれたりする際には、広範囲なリトレーニングを必要とする。本稿では、この問題に対処するため、すべてのモダリティ(X-Fi)に対するモダリティ不変基盤モデルを提案する。 X-Fiは、変圧器構造を利用して可変入力サイズを調整し、マルチモーダル統合中にモダリティ固有の特徴を保存する新しい「X-フュージョン」機構を組み込むことで、追加のトレーニングなしで、センサモダリティの独立的または複合的使用を可能にする。このアプローチは適応性を向上するだけでなく、モダリティを越えた補完的な特徴の学習を促進する。 MM-FiとXRF55のデータセットを6つの異なるモードで組み合わせた実験により,ヒトのポーズ推定(HPE)とヒトの活動認識(HAR)タスクにおいて,X-Fiが最先端のパフォーマンスを達成することを示した。この結果から,提案モデルでは広範囲の人体検知アプリケーションを効率的にサポートでき,最終的にはスケーラブルでマルチモーダルなセンシング技術の進化に寄与することが示唆された。 Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel "X-fusion" mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies.	翻訳日:2024-10-30 02:34:41 公開日:2024-10-18
# テキスト・画像合成における意味的変動の評価:因果的視点 Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective ( http://arxiv.org/abs/2410.10291v1 ) ライセンス: Link先を確認	Xiangru Zhu, Penglei Sun, Yaoxian Song, Yanghua Xiao, Zhixu Li, Chengyu Wang, Jun Huang, Bei Yang, Xiaoxiao Xu,	(参考訳) 人間の指示の正確な解釈と可視化は、テキスト・トゥ・イメージ(T2I)合成に不可欠である。しかし、現在のモデルは、単語の順序の変化から意味的なバリエーションを捉えるのに苦労しており、既存の評価は、テキストと画像の類似性のような間接的な指標に依存して、これらの課題を確実に評価することができない。これはしばしば、頻繁な単語の組み合わせに焦点をあてることで、複雑な言語パターンや一般的でない言語パターンのパフォーマンスが低下する。これらの欠陥に対処するために、SemVarEffectとSemVarBenchというベンチマークと呼ばれる新しいメトリクスを提案し、T2I合成における入力のセマンティックなバリエーションと出力の因果性を評価する。意味的変異は2種類の言語置換によって達成されるが、予測可能なリテラル変異は避けられる。実験の結果、CagView-3-PlusとIdeogram 2のスコアは0.2/1となった。対象関係の意味的変動は属性よりも理解されにくく、0.07/1と0.17-0.19/1と評価される。 UNetやTransformersの相互モーダルアライメントはセマンティックなバリエーションを扱う上で重要な役割を担っていることがわかった。本研究は,T2I合成コミュニティによるヒューマンインストラクション理解の探索を促進する効果的な評価枠組みを確立する。 Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.	翻訳日:2024-10-29 22:34:36 公開日:2024-10-18
# テキスト・画像合成における意味的変動の評価:因果的視点 Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective ( http://arxiv.org/abs/2410.10291v2 ) ライセンス: Link先を確認	Xiangru Zhu, Penglei Sun, Yaoxian Song, Yanghua Xiao, Zhixu Li, Chengyu Wang, Jun Huang, Bei Yang, Xiaoxiao Xu,	(参考訳) 人間の指示の正確な解釈と可視化は、テキスト・トゥ・イメージ(T2I)合成に不可欠である。しかし、現在のモデルは、単語の順序の変化から意味的なバリエーションを捉えるのに苦労しており、既存の評価は、テキストと画像の類似性のような間接的な指標に依存して、これらの課題を確実に評価することができない。これはしばしば、頻繁な単語の組み合わせに焦点をあてることで、複雑な言語パターンや一般的でない言語パターンのパフォーマンスが低下する。これらの欠陥に対処するために、SemVarEffectとSemVarBenchというベンチマークと呼ばれる新しいメトリクスを提案し、T2I合成における入力のセマンティックなバリエーションと出力の因果性を評価する。意味的変異は2種類の言語置換によって達成されるが、予測可能なリテラル変異は避けられる。実験の結果、CagView-3-PlusとIdeogram 2のスコアは0.2/1となった。対象関係の意味的変動は属性よりも理解されにくく、0.07/1と0.17-0.19/1と評価される。 UNetやTransformersの相互モーダルアライメントはセマンティックなバリエーションを扱う上で重要な役割を担っていることがわかった。本研究は,T2I合成コミュニティによるヒューマンインストラクション理解の探索を促進する効果的な評価枠組みを確立する。私たちのベンチマークとコードはhttps://github.com/zhuxiangru/SemVarBench で公開されています。 Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding. Our benchmark and code are available at https://github.com/zhuxiangru/SemVarBench .	翻訳日:2024-10-29 22:34:36 公開日:2024-10-18
# SensorBench: コーディングベースのセンサ処理におけるLLMのベンチマーク SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing ( http://arxiv.org/abs/2410.10741v1 ) ライセンス: Link先を確認	Pengrui Quan, Xiaomin Ouyang, Jeya Vikranth Jeyakumar, Ziqi Wang, Yang Xing, Mani Srivastava,	(参考訳) センサデータの効果的な処理、解釈、管理は、サイバー物理システムの重要な構成要素として現れている。伝統的に、センサデータを処理するには、信号処理ツールに深い理論的知識と熟練が必要である。しかし,近年の研究では,Large Language Models (LLMs) が知覚データの処理に有望な能力を持っていることが示されており,センシングシステム開発における副操縦士としての可能性も示唆されている。この可能性を探るため、定量化のための総合的なベンチマークであるSensorBenchを構築した。このベンチマークでは、さまざまなタスクのためのさまざまな現実世界のセンサーデータセットが組み込まれている。以上の結果から,LLMは単純なタスクにはかなりの習熟度を示す一方で,パラメータ選択による構成タスクの処理において,工学的専門家と比較して固有の課題に直面していることが明らかとなった。さらに,センサ処理の4つのプロンプト戦略について検討し,48%のタスクにおいて,自己検証が他のすべてのベースラインより優れていることを示す。本研究は,LLMに基づくセンサ処理コンパロへの道筋をたどって,総合的なベンチマークと今後の発展に向けた分析を促すものである。 Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems. Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in signal-processing tools. However, recent works show that Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems. To explore this potential, we construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective. The benchmark incorporates diverse real-world sensor datasets for various tasks. The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks with parameter selections compared to engineering experts. Additionally, we investigate four prompting strategies for sensor processing and show that self-verification can outperform all other baselines in 48% of tasks. Our study provides a comprehensive benchmark and prompting analysis for future developments, paving the way toward an LLM-based sensor processing copilot.	翻訳日:2024-10-29 19:55:21 公開日:2024-10-18
# SensorBench: コーディングベースのセンサ処理におけるLLMのベンチマーク SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing ( http://arxiv.org/abs/2410.10741v2 ) ライセンス: Link先を確認	Pengrui Quan, Xiaomin Ouyang, Jeya Vikranth Jeyakumar, Ziqi Wang, Yang Xing, Mani Srivastava,	(参考訳) センサデータの効果的な処理、解釈、管理は、サイバー物理システムの重要な構成要素として現れている。伝統的に、センサデータを処理するには、信号処理ツールに深い理論的知識と熟練が必要である。しかし,近年の研究では,Large Language Models (LLMs) が知覚データの処理に有望な能力を持っていることが示されており,センシングシステム開発における副操縦士としての可能性も示唆されている。この可能性を探るため、定量化のための総合的なベンチマークであるSensorBenchを構築した。このベンチマークでは、さまざまなタスクのためのさまざまな現実世界のセンサーデータセットが組み込まれている。以上の結果から,LLMは単純なタスクにはかなりの習熟度を示す一方で,パラメータ選択による構成タスクの処理において,工学的専門家と比較して固有の課題に直面していることが明らかとなった。さらに,センサ処理の4つのプロンプト戦略について検討し,48%のタスクにおいて,自己検証が他のすべてのベースラインより優れていることを示す。本研究は,LLMに基づくセンサ処理コンパロへの道筋をたどって,総合的なベンチマークと今後の発展に向けた分析を促すものである。 Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems. Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in signal-processing tools. However, recent works show that Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems. To explore this potential, we construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective. The benchmark incorporates diverse real-world sensor datasets for various tasks. The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks with parameter selections compared to engineering experts. Additionally, we investigate four prompting strategies for sensor processing and show that self-verification can outperform all other baselines in 48% of tasks. Our study provides a comprehensive benchmark and prompting analysis for future developments, paving the way toward an LLM-based sensor processing copilot.	翻訳日:2024-10-29 19:55:21 公開日:2024-10-18

Title

Authors

Abstract

論文公表日・翻訳日

# トレイン・アンド・コンストレイン:トピックとパラフレーズから音韻的にインフォームドされたトング・ツイスター生成

Train & Constrain: Phonologically Informed Tongue-Twister Generation from Topics and Paraphrases ( http://arxiv.org/abs/2403.13901v2 )

ライセンス: Link先を確認

Tyler Loakman, Chen Tang, Chenghua Lin,

(参考訳) 音韻学的・音声学的に根ざした言語生成の先行研究は、主に句や詩などの領域に焦点を当てている。本稿では,入力話題やフレーズとのセマンティックな整合性を維持しつつ,文法的正確性を維持しつつ,音素レベルで条件を定めなければならない言語である,英語舌ツイスターの生成に関する新たな研究について述べる。提案するTwisterListerは,人間の言語モデル(LLM)から音韻的に入力された舌ねじれ音を生成するパイプラインであり,人間の言語モデルとLLMの著者の組み合わせによる17K以上の例からなる,舌ねじれ音のアノテートデータセットであるTwistList 2.0を生成する。我々の生成パイプラインは、LLMと共に音韻的に制約された語彙を用いることで、新規な非派生的な舌ねじれの例を生成する。さらに, 音声学的知識を明示的に注入することなく, 音韻的動機付け言語が生成できる範囲を示すために, 生成されたデータセット上で訓練された小型モデルの自動的, 人為的評価結果も提示する。さらに、自動回帰言語モデルに統合可能な音素制約付きデコードモジュール(PACD)を導入し、基礎となる言語モデルを微調整することなく良質な舌ねじれを生成することを示した。また,主に音素編集距離(PED)に基づいて,音韻的に動機付けされ,舌ねじり器の独特な本質を捉えた舌ねじり器生成作業のための多種多様な自動測度を設計・実装する。

Previous work in phonologically and phonetically grounded language generation has mainly focused on domains such as puns and poetry. In this article, we present new work on the generation of English tongue twisters - a form of language that is required to be conditioned on a phoneme level to maximize sound overlap, while maintaining semantic consistency with an input topic or phrase and still being grammatically correct. We present TwisterLister, a pipeline for generating phonologically informed tongue twisters from large language models (LLMs) that we use to generate TwistList 2.0, the largest annotated dataset of tongue twisters to date, consisting of 17K+ examples from a combination of human and LLM authors. Our generation pipeline involves the use of a phonologically constrained vocabulary alongside LLM prompting to generate novel, non-derivative tongue twister examples. We additionally present the results of automatic and human evaluation of smaller models trained on our generated dataset to demonstrate the extent to which phonologically motivated language types can be generated without explicit injection of phonological knowledge. Additionally, we introduce a phoneme-aware constrained decoding module (PACD) that can be integrated into an autoregressive language model and demonstrate that this method generates good quality tongue twisters both with and without fine-tuning the underlying language model. We also design and implement a range of automatic metrics for the task of tongue twister generation that is phonologically motivated and captures the unique essence of tongue twisters, primarily based on phonemic edit distance (PED)

翻訳日:2024-11-09 03:59:23 公開日:2024-10-18

# リモート光胸腺造影信号形態に基づく生体認証

Biometric Authentication Based on Enhanced Remote Photoplethysmography Signal Morphology ( http://arxiv.org/abs/2407.04127v2 )

ライセンス: Link先を確認

Zhaodong Sun, Xiaobai Li, Jukka Komulainen, Guoying Zhao,

(参考訳) 遠隔プラチスモグラフィー(Remote Photoplethysmography、rPPG)は、コンタクトセンサーから得られる接触型フォトプレチスモグラフィー(cPPG)の代替として、顔画像から心臓の信号を計測する非接触式方法である。近年の研究では、顔画像から抽出したrPPG信号の形態を人物認証に利用するために、各個人が生体認証として利用できる独自のcPPG信号形態を持っていることが示されている。顔の外観とrPPGが混在しているため、まず顔の外観を識別し、rPPG情報を保持しながら顔の外観を除去し、顔のプライバシーを保護し、rPPGのみが認証に使用されることを保証する。未同定ビデオは、rPPG信号形態を認証するためにrPPGモデルに入力される。第1の訓練段階では、粗いrPPG信号を得るために、教師なしrPPG訓練を行う。第2の訓練段階では、外部のcPPGデータセットを組み込んで、rPPG生体認証を実現し、rPPG信号形態を向上することにより、rPPG-cPPGハイブリッドトレーニングを行う。提案手法では,rPPG認証モデルのトレーニングを行うために,対象ID付き顔認識ビデオのみを必要とする。実験により, 顔画像に隠されたrPPG信号形態が生体認証に有効であることが確認された。コードはhttps://github.com/zhaodongsun/rppg_biometricsで公開されている。

Remote photoplethysmography (rPPG) is a non-contact method for measuring cardiac signals from facial videos, offering a convenient alternative to contact photoplethysmography (cPPG) obtained from contact sensors. Recent studies have shown that each individual possesses a unique cPPG signal morphology that can be utilized as a biometric identifier, which has inspired us to utilize the morphology of rPPG signals extracted from facial videos for person authentication. Since the facial appearance and rPPG are mixed in the facial videos, we first de-identify facial videos to remove facial appearance while preserving the rPPG information, which protects facial privacy and guarantees that only rPPG is used for authentication. The de-identified videos are fed into an rPPG model to get the rPPG signal morphology for authentication. In the first training stage, unsupervised rPPG training is performed to get coarse rPPG signals. In the second training stage, an rPPG-cPPG hybrid training is performed by incorporating external cPPG datasets to achieve rPPG biometric authentication and enhance rPPG signal morphology. Our approach needs only de-identified facial videos with subject IDs to train rPPG authentication models. The experimental results demonstrate that rPPG signal morphology hidden in facial videos can be used for biometric authentication. The code is available at https://github.com/zhaodongsun/rppg_biometrics.

翻訳日:2024-11-08 23:57:53 公開日:2024-10-18

# 地球回転による中性子の角運動量測定

Measuring the Angular Momentum of a Neutron Using Earth's Rotation ( http://arxiv.org/abs/2407.09307v3 )

ライセンス: Link先を確認

Niels Geerits, Stephan Sponar, Kyle E. Steffen, William M. Snow, Steven R. Parnell, Giacomo Mauri, Gregory N. Smith, Robert M. Dalgliesh, Victor de Haan,

(参考訳) サニャック効果(英語版)として知られる地球回転と軌道角運動量(OAM)の結合は、スピンエコー干渉計を用いて生じる絡み合った中性子で観測される。機器の体系的な修正の後、測定された結合は理論の5%以内であり、不確実性は7.2%である。セットアップ中のOAMは伝播方向を横切り、波長(4A〜12.75A)と直線的にスケールするので、デバイスを機械的に回転させることなく結合を可変させることができる。したがって、系統的な誤差は以前の実験より低い。検出されたビームの逆OAMは、以前の中性子実験より5桁低い4098 +- 295 hbar A-1と一致し、サニャック効果を用いて中性子OAMを確定測定し、量子サニャック効果の観測への道を開く可能性を示す。

A coupling between Earths rotation and orbital angular momentum (OAM), known as the Sagnac effect, is observed in entangled neutrons produced using a spin echo interferometer. After correction for instrument systematics the measured coupling is within 5% of theory, with an uncertainty of 7.2%. The OAM in our setup is transverse to the propagation direction and scales linearly with wavelength (4 A - 12.75 A), hence the coupling can be varied, without mechanically rotating the device. Therefore, the systematic error is lower than in previous experiments. The detected transverse OAM of our beam corresponds to 4098 +- 295 hbar A-1, 5 orders of magnitude lower than in previous neutron experiments, thereby demonstrating the feasibility of using the Sagnac effect to definitively measure neutron OAM and paving the way towards observations of the quantum Sagnac effect

翻訳日:2024-11-08 22:06:29 公開日:2024-10-18

# ASTPrompter: 毒なプロンプットを識別する言語モデルの再設計

ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts ( http://arxiv.org/abs/2407.09447v2 )

ライセンス: Link先を確認

Amelia F. Hardy, Houjun Liu, Bernard Lange, Mykel J. Kochenderfer,

(参考訳) 大規模言語モデル(LLM)の自動再チームの典型的なスキームは、凍結した言語モデル(ディフェンダー)をトリガーして有害なテキストを生成するプロンプトを発見することに焦点を当てている。これはしばしば、不可知であり、起こりそうもないテキストを生成するプロンプトモデル(敵)を生み出します。本稿では,(1)凍結したディフェンダーから有毒な出力を誘導するプロンプトと(2)そのディフェンダーが得点するパープレキシティの低いプロンプトの発見を可能にする,LDMレッドチームタスクの強化学習形式を提案する。これらのケースは、ディフェンダーモデルの通常の使用中に発生する可能性が高いため、レッドチーム環境で最も重要なケースである、と我々は主張する。我々は、GPT-2、GPT-2 XL、TinyLlamaディフェンダーによる、オンラインおよび弱教師付きIdentity Preference Optimization(IPO)によるこの定式化を解決する。当社のポリシーは、これらすべてのアーキテクチャから毒性を引き起こす可能性のある(低複雑さ)プロンプトを生成することができることを実証しています。さらに,このポリシーは,高い確率で発生し,より効果的である攻撃を発生させることにより,ベースラインよりも優れていることを示す。最後に, 可能性と毒性のトレードオフについて検討した。このプロジェクトのソースコードは、https://github.com/sisl/ASTPrompter/.comで入手できる。

Typical schemes for the automated red-teaming of large language models (LLMs) focus on discovering prompts that trigger a frozen language model (the defender) to generate toxic text. This often results in the prompting model (the adversary) producing text that is unintelligible and unlikely to arise. Here, we propose a reinforcement learning formulation of the LLM red-teaming task that allows us to discover prompts that both (1) trigger toxic outputs from a frozen defender and (2) have low perplexity as scored by that defender. We argue these cases are the most pertinent in a red-teaming setting because they are likely to arise during normal use of the defender model. We solve this formulation through a novel online and weakly supervised variant of Identity Preference Optimization (IPO) on GPT-2, GPT-2 XL, and TinyLlama defenders. We demonstrate that our policy is capable of generating likely (low-perplexity) prompts that also trigger toxicity from all of these architectures. Furthermore, we show that this policy outperforms baselines by producing attacks that are occur with higher probability and are more effective. Finally, we discuss our findings and the observed trade-offs between likelihood vs toxicity. Source code for this project is available for this project at: https://github.com/sisl/ASTPrompter/.

翻訳日:2024-11-08 22:06:29 公開日:2024-10-18

# 極端における顆粒の因果関係

Granger Causality in Extremes ( http://arxiv.org/abs/2407.09632v2 )

ライセンス: Link先を確認

Juraj Bodik, Olivier C. Pasche,

(参考訳) 本稿では,時系列における極端事象からの因果関係の同定を目的とした,極端におけるグランガー因果関係の厳密な数学的枠組みを提案する。グランガー因果関係は、時間変化変数間の方向関係を明らかにする上で重要な役割を果たす。この概念は極端かつ非常に不安定な期間に重要性を増すが、最先端の手法は主に分布の本体内の因果性に焦点を当てており、しばしば極端な出来事にのみ現れる因果的メカニズムを見落としている。本フレームワークは, 因果尾係数を利用して, 主に極端な事象から因果関係を推定するように設計されている。我々は、極端な因果関係と(古典的な)グランガー因果関係、シムズ因果関係、構造因果関係などの他の因果関係の概念の等価性を確立する。 Grangerの因果関係の他の重要な性質を極端に証明し、このフレームワークが隠れた共同創設者の存在下で特に有用であることを示す。また,データから極端にグランガー因果性が存在することを検出する新しい推論手法を提案する。提案手法はモデルフリーであり, 非線形・高次元時系列処理が可能であり, 性能, 速度の両面において, 現状の手法よりも優れており, 財務・極端気象観測におけるコヒーレントな効果を明らかにすることができた。

We introduce a rigorous mathematical framework for Granger causality in extremes, designed to identify causal links from extreme events in time series. Granger causality plays a pivotal role in uncovering directional relationships among time-varying variables. While this notion gains heightened importance during extreme and highly volatile periods, state-of-the-art methods primarily focus on causality within the body of the distribution, often overlooking causal mechanisms that manifest only during extreme events. Our framework is designed to infer causality mainly from extreme events by leveraging the causal tail coefficient. We establish equivalences between causality in extremes and other causal concepts, including (classical) Granger causality, Sims causality, and structural causality. We prove other key properties of Granger causality in extremes and show that the framework is especially helpful under the presence of hidden confounders. We also propose a novel inference method for detecting the presence of Granger causality in extremes from data. Our method is model-free, can handle non-linear and high-dimensional time series, outperforms current state-of-the-art methods in all considered setups, both in performance and speed, and was found to uncover coherent effects when applied to financial and extreme weather observations.

翻訳日:2024-11-08 21:54:45 公開日:2024-10-18

# ヒューマン・アウェア・パス・プランニングのための社会的コスト関数の学習

Learning Social Cost Functions for Human-Aware Path Planning ( http://arxiv.org/abs/2407.10547v2 )

ライセンス: Link先を確認

Andrea Eirale, Matteo Leonetti, Marcello Chiaberge,

(参考訳) 社会的受容を達成することは、社会ロボットナビゲーションの主要な目標の1つである。この話題は近年注目されているが、研究の大半は障害物のない軌道に沿ってロボットエージェントを駆動することに焦点を当てており、個人距離を尊重し、ナビゲーションを最適化するために将来の人間の動きを推定する計画を立てている。しかし、日常生活における社会的相互作用は、カットするよりもキューの端に立っている場合など、運動に厳密に依存しない規範によっても規定される。本稿では,一般的な社会的シナリオを認識し,従来のプランナーのコスト関数を適応させる新しい手法を提案する。このソリューションは、従来のナビゲーションの堅牢性を維持しながら、他の方法では発生しない様々なソーシャルナビゲーション行動を実行することを可能にする。我々のアプローチでは、ロボットはタスクごとに異なるモジュールを持つのではなく、単一の学習モデルで異なる社会的規範を学習することができる。概念実証として、話し合う人々の集団の相互作用空間をキューイングし、尊重するタスクについて考察するが、この方法は動きを伴わない他の人間の活動にまで拡張することができる。

Achieving social acceptance is one of the main goals of Social Robotic Navigation. Despite this topic has received increasing interest in recent years, most of the research has focused on driving the robotic agent along obstacle-free trajectories, planning around estimates of future human motion to respect personal distances and optimize navigation. However, social interactions in everyday life are also dictated by norms that do not strictly depend on movement, such as when standing at the end of a queue rather than cutting it. In this paper, we propose a novel method to recognize common social scenarios and modify a traditional planner's cost function to adapt to them. This solution enables the robot to carry out different social navigation behaviors that would not arise otherwise, maintaining the robustness of traditional navigation. Our approach allows the robot to learn different social norms with a single learned model, rather than having different modules for each task. As a proof of concept, we consider the tasks of queuing and respect interaction spaces of groups of people talking to one another, but the method can be extended to other human activities that do not involve motion.

翻訳日:2024-11-08 21:32:38 公開日:2024-10-18

# 放射性炭素とAIを用いた筆跡解析を用いた古写本の年代推定

Dating ancient manuscripts using radiocarbon and AI-based writing style analysis ( http://arxiv.org/abs/2407.12013v2 )

ライセンス: Link先を確認

Mladen Popović, Maruf A. Dhali, Lambert Schomaker, Johannes van der Plicht, Kaare Lund Rasmussen, Jacopo La Nasa, Ilaria Degano, Maria Perla Colombini, Eibert Tigchelaar,

(参考訳) 古写本の年代決定は、思想の進化の再構築に不可欠である。デッドシースクロールにとって、これは特に重要である。しかし、ほぼ完全な年代記の欠如がタイムラインに均等に散在し、パレオグラフィー比較で利用可能な類似の書体で書かれている。本稿では,現在最先端のAIに基づく年代予測モデルであるEnochについて紹介する。 Enochは、確立された手書きスタイルの記述子を使用し、ベイズ尾根の回帰を適用している。本研究の課題は,現在の機械学習では大量のトレーニングデータを必要とするのに対して,放射性炭素年代付原稿の数は少ないことである。角線およびアログラフによる特徴ベクトルとベイジアンリッジの回帰を併用することにより,エノクは放射性炭素系年代を27.9～30.7年で予測できることを示した。その後、エノクは135点の未確認写本の日付を推定するために用いられ、標本の79パーセントがパレオグラフィーによるポストホック評価で「現実的」であるとされた。我々はその巻物の新しい年表を提示する。放射性炭素の範囲とエノクのスタイルに基づく予測は、伝統的に推定されるパレオグラフィー推定よりも古いことが多い。紀元前300-50年の範囲では、エノクの年代予測により粒度は改善された。本研究は, マルチモーダル機械学習技術の現況と一致し, 他の部分的古写本コレクションの日付予測に利用することができる。この研究は、エノクの量的、確率に基づくアプローチが、パレオグラフィーや歴史家にとっての道具となり、古代ユダヤ人の鍵となる文章を再編纂し、現在のユダヤ教とキリスト教の起源に関する議論に寄与していることを示している。

Determining the chronology of ancient handwritten manuscripts is essential for reconstructing the evolution of ideas. For the Dead Sea Scrolls, this is particularly important. However, there is an almost complete lack of date-bearing manuscripts evenly distributed across the timeline and written in similar scripts available for palaeographic comparison. Here, we present Enoch, a state-of-the-art AI-based date-prediction model, trained on the basis of new radiocarbon-dated samples of the scrolls. Enoch uses established handwriting-style descriptors and applies Bayesian ridge regression. The challenge of this study is that the number of radiocarbon-dated manuscripts is small, while current machine learning requires an abundance of training data. We show that by using combined angular and allographic writing style feature vectors and applying Bayesian ridge regression, Enoch could predict the radiocarbon-based dates from style, supported by leave-one-out validation, with varied MAEs of 27.9 to 30.7 years relative to the radiocarbon dating. Enoch was then used to estimate the dates of 135 unseen manuscripts, revealing that 79 per cent of the samples were considered 'realistic' upon palaeographic post-hoc evaluation. We present a new chronology of the scrolls. The radiocarbon ranges and Enoch's style-based predictions are often older than the traditionally assumed palaeographic estimates. In the range of 300-50 BCE, Enoch's date prediction provides an improved granularity. The study is in line with current developments in multimodal machine-learning techniques, and the methods can be used for date prediction in other partially-dated manuscript collections. This research shows how Enoch's quantitative, probability-based approach can be a tool for palaeographers and historians, re-dating ancient Jewish key texts and contributing to current debates on Jewish and Christian origins.

翻訳日:2024-11-08 20:59:00 公開日:2024-10-18

# 検索強化機械学習:合成と機会

Retrieval-Enhanced Machine Learning: Synthesis and Opportunities ( http://arxiv.org/abs/2407.12982v2 )

ライセンス: Link先を確認

To Eun Kim, Alireza Salemi, Andrew Drozdov, Fernando Diaz, Hamed Zamani,

(参考訳) 言語モデリングの分野では、自然言語処理(NLP)分野で直面するいくつかの課題に対処するために、検索コンポーネントで拡張されたモデルが有望なソリューションとして登場した。 NLPに主眼を置いているにもかかわらず、検索・エンハンスメントのパラダイムはコンピュータビジョン、時系列予測、計算生物学など幅広い機械学習(ML)に拡張できると仮定する。そこで本研究では,このパラダイムの形式的枠組みであるRetrieval-Enhanced Machine Learning (REML)を導入し,MLの各領域の文献を,現在の文献から欠落している一貫した表記で合成する。また,多くの研究が検索コンポーネントを用いてモデルを強化する一方で,基礎的情報検索(IR)研究との連携が欠如していることが判明した。我々は、REMLフレームワークを構成する各コンポーネントを調査することで、セミナルIR研究と現代のREML研究のギャップを埋める。究極的には、この研究の目的は、様々な分野の研究者に、検索強化モデルの包括的、正式に構造化された枠組みを付与し、学際的な将来の研究を促進することである。

In the field of language modeling, models augmented with retrieval components have emerged as a promising solution to address several challenges faced in the natural language processing (NLP) field, including knowledge grounding, interpretability, and scalability. Despite the primary focus on NLP, we posit that the paradigm of retrieval-enhancement can be extended to a broader spectrum of machine learning (ML) such as computer vision, time series prediction, and computational biology. Therefore, this work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature. Also, we found that while a number of studies employ retrieval components to augment their models, there is a lack of integration with foundational Information Retrieval (IR) research. We bridge this gap between the seminal IR research and contemporary REML studies by investigating each component that comprises the REML framework. Ultimately, the goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.

翻訳日:2024-11-08 20:25:29 公開日:2024-10-18

# 交差するワッサースタインボールによる分布的および逆ロバストなロジスティック回帰

Distributionally and Adversarially Robust Logistic Regression via Intersecting Wasserstein Balls ( http://arxiv.org/abs/2407.13625v2 )

ライセンス: Link先を確認

Aras Selvi, Eleonora Kreacic, Mohsen Ghassemi, Vamsi Potluru, Tucker Balch, Manuela Veloso,

(参考訳) 逆堅牢最適化(Adversarially robust optimization, ARO)は、テスト中に敵の攻撃に対して防御する訓練モデルのデファクトスタンダードとなっている。しかし、その頑丈さにもかかわらず、これらのモデルはしばしば過度なオーバーフィットに悩まされる。この問題を緩和するために、トレーニングにおける経験的分布を次のように置き換えるなど、いくつかの成功したアプローチが提案されている。一曖昧性集合内の最悪の場合の分布で、AROの分布的堅牢性(DR)に繋がるもの二補助的データセット(例えば、合成、外部、ドメイン外)から派生した経験分布の混合物。最初のアプローチに基づいて、ロジスティック回帰のための ARO のワッサーシュタイン DR を探索し、トラクタブル凸最適化の修正を認めることを示す。第2のアプローチを採用することで,データ生成と補助分布間のワッサーシュタイン距離を推定し,そのあいまいさを補助的データセットから構築したものと交差させることにより,DRフレームワークを強化する。提案手法は,結果の最適化問題を解析し,効率的な解を開発し,標準データセットのベンチマーク手法よりも優れていることを示す。

Adversarially robust optimization (ARO) has become the de facto standard for training models to defend against adversarial attacks during testing. However, despite their robustness, these models often suffer from severe overfitting. To mitigate this issue, several successful approaches have been proposed, including replacing the empirical distribution in training with: (i) a worst-case distribution within an ambiguity set, leading to a distributionally robust (DR) counterpart of ARO; or (ii) a mixture of the empirical distribution with one derived from an auxiliary dataset (e.g., synthetic, external, or out-of-domain). Building on the first approach, we explore the Wasserstein DR counterpart of ARO for logistic regression and show it admits a tractable convex optimization reformulation. Adopting the second approach, we enhance the DR framework by intersecting its ambiguity set with one constructed from an auxiliary dataset, which yields significant improvements when the Wasserstein distance between the data-generating and auxiliary distributions can be estimated. We analyze the resulting optimization problem, develop efficient solutions, and show that our method outperforms benchmark approaches on standard datasets.

翻訳日:2024-11-08 20:14:30 公開日:2024-10-18

# クエリコンテキスト信号の活用によるスポンサー検索における検索精度の向上

Improving Retrieval in Sponsored Search by Leveraging Query Context Signals ( http://arxiv.org/abs/2407.14346v2 )

ライセンス: Link先を確認

Akash Kumar Mohankumar, Gururaj K, Gagan Madan, Amit Singh,

(参考訳) ユーザクエリに関する関連する入札キーワードを正確に検索することは、Sponsored Searchでは重要だが、特に短いあいまいなクエリでは難しい。既存の高密度で生成的な検索モデルは、これらのケースにおいて、ニュアンスのあるユーザ意図をキャプチャできないことが多い。そこで本研究では,オンラインキャッシュに格納されたWeb検索結果と大規模言語モデルから得られるリッチなコンテキスト信号でクエリを増強し,クエリ理解を強化する手法を提案する。具体的には、Web検索のタイトルとスニペットを使って、現実世界の情報にクエリを接地し、GPT-4を使って、ユーザの意図を明確にしたクエリの書き直しや説明を生成する。これらの信号はFusion-in-DecoderベースのUnityアーキテクチャを通じて効率よく統合され、高密度かつ生成的な検索と従来の文脈自由モデルと同等の費用がかかる。キャッシュでコンテキストが利用できないシナリオに対処するために、推論中にコンテキスト信号なしでモデルロバスト性や性能を改善するカリキュラム学習戦略であるコンテキストグラシングを導入する。大規模なオフライン実験は、文脈認識アプローチが文脈自由モデルを大幅に上回ることを示した。さらに、160以上の国で有名な検索エンジン上でのオンラインA/Bテストでは、ユーザのエンゲージメントと収益が大幅に改善されている。

Accurately retrieving relevant bid keywords for user queries is critical in Sponsored Search but remains challenging, particularly for short, ambiguous queries. Existing dense and generative retrieval models often fail to capture nuanced user intent in these cases. To address this, we propose an approach to enhance query understanding by augmenting queries with rich contextual signals derived from web search results and large language models, stored in an online cache. Specifically, we use web search titles and snippets to ground queries in real-world information and utilize GPT-4 to generate query rewrites and explanations that clarify user intent. These signals are efficiently integrated through a Fusion-in-Decoder based Unity architecture, enabling both dense and generative retrieval with serving costs on par with traditional context-free models. To address scenarios where context is unavailable in the cache, we introduce context glancing, a curriculum learning strategy that improves model robustness and performance even without contextual signals during inference. Extensive offline experiments demonstrate that our context-aware approach substantially outperforms context-free models. Furthermore, online A/B testing on a prominent search engine across 160+ countries shows significant improvements in user engagement and revenue.

翻訳日:2024-11-08 19:38:31 公開日:2024-10-18

# 量子エンタングルメント、量子テレポーテーション、多線形多項式と幾何学

Quantum Entanglement, Quantum Teleportation, Multilinear Polynomials and Geometry ( http://arxiv.org/abs/2407.17621v3 )

ライセンス: Link先を確認

Juan M. Romero, Emiliano Montoya-Gonzalez, Oscar Velazquez-Alvarado,

(参考訳) 量子絡み合い状態は、分解できない多線型多項式と関連していることを示す。これらの多線型多項式を用いて、絡み合い状態の幾何学的表現を提案する。特に、ベル状態は3次元曲面で幾何学的に表現できる非分解可能実多重線型多項式と関連していることを示す。さらに, この枠組みでは, 量子回路を平面幾何学の幾何学的変換と見なすことができる。この現象は、物質が時空を曲がる重力と類似している。さらに、量子テレポーテーションと多線型多項式を含む演算の類似性を示す。

We show that quantum entanglement states are associated with multilinear polynomials that cannot be factored. By using these multilinear polynomials, we propose a geometric representation for entanglement states. In particular, we show that the Bell's states are associated with non-factorable real multilinear polynomial, which can be represented geometrically by three-dimensional surfaces. Furthermore, in this framework, we show that a quantum circuit can be seen as a geometric transformations of plane geometry. This phenomenon is analogous to gravity, where matter curves space-time. In addition, we show an analogy between quantum teleportation and operations involving multilinear polynomials.

翻訳日:2024-11-08 15:12:19 公開日:2024-10-18

# 低エネルギー物質励起における空洞媒介相互作用の一般理論

General theory of cavity-mediated interactions between low-energy matter excitations ( http://arxiv.org/abs/2407.19478v2 )

ライセンス: Link先を確認

Carlos J. Sánchez Martínez, Frieder Lindel, Francisco J. García-Vidal, Johannes Feist,

(参考訳) 超伝導、強磁性、強磁性などの低エネルギー物質特性のキャビティ量子力学技術による操作は、これらの多体集合現象を強化する方法として提案されている。本研究では, 共振器外結合と共振器共振器共振器共振器共振器による低エネルギー物質励起と共振器共振器共振器共振器の有効相互作用について検討する。物質の全偏極密度と磁化密度を考慮した双極子近似を超越して、従来の研究を拡張した。さらに、しばしば無視される反磁性相互作用を包含し、空洞に対しては、非局所性および非相互性を持つ一般的な線形吸収媒体を検討する。この一般的なシナリオにおいても、自由度の物質間の効果的な空洞誘起相互作用は静電気的および静磁的性質であることを示す。このことは、低エネルギーの仮定が成立する物質系の空洞工学におけるマルチモード記述の必要性を裏付けるものである。本研究は, 一般的な光環境が拡張低エネルギー物質励起に与える影響を理論的に研究するための枠組みを提供する。

The manipulation of low-energy matter properties such as superconductivity, ferromagnetism and ferroelectricity via cavity quantum electrodynamics engineering has been suggested as a way to enhance these many-body collective phenomena. In this work, we investigate the effective interactions between low-energy matter excitations induced by the off-resonant coupling with cavity electromagnetic modes. We extend previous work by going beyond the dipole approximation accounting for the full polarization and magnetization densities of matter. We further include the often neglected diamagnetic interaction and, for the cavity, we consider general linear absorbing media with possibly non-local and non-reciprocal response. We demonstrate that, even in this general scenario, the effective cavity-induced interactions between the matter degrees of freedom are of electrostatic and magnetostatic nature. This confirms the necessity of a multimode description for cavity engineering of matter systems where the low-energy assumption holds. Our findings provide a theoretical framework for studying the influence of general optical environments on extended low-energy matter excitations.

翻訳日:2024-11-08 14:27:29 公開日:2024-10-18

# 局所処理によるマルコフ決定過程の実験

Experimenting on Markov Decision Processes with Local Treatments ( http://arxiv.org/abs/2407.19618v2 )

ライセンス: Link先を確認

Shuze Chen, David Simchi-Levi, Chonghuan Wang,

(参考訳) 短期的治療が短期成績に与える影響を評価するためにランダム化実験を利用することは、工業的実践における黄金の基準となっている。しかし、サービスシステムが動的かつパーソナライズされていくにつれて、介入への生涯的露出を通じて、顧客寿命価値などの長期的な累積的な成果の最大化に焦点が移りつつある。このギャップを埋めるために,マルコフ決定過程(MDP)をモデル化した力学系におけるランダム化実験について検討する。我々のゴールは、比較的短期的な観察による長期累積報酬に対する治療・制御政策の影響を評価することである。まず,一般的な治療パターンの効果を評価するための最適推論手法を開発した。さらに, 実世界の処理の多くは, 実用的効率と運用上の便宜のために微粒化され, 局所化される傾向があることを認識し, 非ターゲット状態の情報を共有することで, この局所化構造を利用する方法を提案する。我々の新しい推定器は局所的な処理構造を組み込んだより厳密な下界をマッチングしながら、一般的な処理に対する分散下界を効果的に克服する。さらに, 推定器は, 分散の大きな部分に対して, 試験アーム数の線形化を最適に行うことができる。最後に、制御アームの完全な知識と推論効率をさらに向上させる設計推定器を用いてシナリオを探索する。

Utilizing randomized experiments to evaluate the effect of short-term treatments on the short-term outcomes has been well understood and become the golden standard in industrial practice. However, as service systems become increasingly dynamical and personalized, much focus is shifting toward maximizing long-term cumulative outcomes, such as customer lifetime value, through lifetime exposure to interventions. To bridge this gap, we investigate the randomized experiments within dynamical systems modeled as Markov Decision Processes (MDPs). Our goal is to assess the impact of treatment and control policies on long-term cumulative rewards from relatively short-term observations. We first develop optimal inference techniques for assessing the effects of general treatment patterns. Furthermore, recognizing that many real-world treatments tend to be fine-grained and localized for practical efficiency and operational convenience, we then propose methods to harness this localized structure by sharing information on the non-targeted states. Our new estimator effectively overcomes the variance lower bound for general treatments while matching the more stringent lower bound incorporating the local treatment structure. Furthermore, our estimator can optimally achieve a linear reduction with the number of test arms for a major part of the variance. Finally, we explore scenarios with perfect knowledge of the control arm and design estimators that further improve inference efficiency.

翻訳日:2024-11-08 14:27:29 公開日:2024-10-18

# 二次帯域交差を伴う位相相転移におけるキブル・ズールクの挙動

Kibble-Zurek behavior in a topological phase transition with a quadratic band crossing ( http://arxiv.org/abs/2407.19780v2 )

ライセンス: Link先を確認

Huan Yuan, Jinyi Zhang, Shuai Chen, Xiaotian Nie,

(参考訳) Kibble-Zurek (KZ) メカニズムは、連続対称性を破る遷移でシステムを駆動する際のスケーリングの振る舞いを記述している。従来の研究では、KZ様のスケーリング挙動はQi-Wu-Zhangモデル (2D) とSu-Schrieffer-Heegerモデル (1D) のトポロジ的遷移にも関係していることが示されたが、対称性の破れはここでは存在しない。線形帯域交差を持つどちらのモデルも$\nu=1$と$z=1$を与える。線形帯域通過を超えるトポロジカル遷移において、異なる臨界指数が取得できるかどうか疑問である。本研究では,2次帯域交差を持つトポロジカル2次元チェッカーボード格子のKZ挙動について検討する。クリーンシステムにおけるベリー曲率の運動量分布の単純さと、従来のKZ記述とより直感的な類似である混乱系における領域様局所チャーンマーカー構成の実空間解析の2点から検討する。平衡では、相関長は$\nu\simeq 1/2$で分岐する。そして、トポロジカル位相遷移でゆっくりと系を焼くことで、フリーズアウト時間 $t_\mathrm{f}$ と未凍長スケール $\xi(t_\mathrm{f})$ が KZ のスケーリングを満足し、$z\simeq 2$ を検証できることが分かる。その後、他の高次帯域通過と位相相転移におけるKZ挙動を探索し、臨界指数と順序の関係を見出す。我々の結果は、KZ機構と非平衡トポロジカル相転移の理解を拡大する。

Kibble-Zurek (KZ) mechanism describes the scaling behavior when driving a system across a continuous symmetry-breaking transition. Previous studies have shown that the KZ-like scaling behavior also lies in the topological transitions in the Qi-Wu-Zhang model (2D) and the Su-Schrieffer-Heeger model (1D), although symmetry breaking does not exist here. Both models with linear band crossings give that $\nu=1$ and $z=1$. We wonder whether different critical exponents can be acquired in topological transitions beyond linear band crossing. In this work, we look into the KZ behavior in a topological 2D checkerboard lattice with a quadratic band crossing. We investigate from dual perspectives: momentum distribution of the Berry curvature in clean systems for simplicity, and real-space analysis of domain-like local Chern marker configurations in disordered systems, which is a more intuitive analog to conventional KZ description. In equilibrium, we find the correlation length diverges with a power $\nu\simeq 1/2$. Then, by slowly quenching the system across the topological phase transition, we find that the freeze-out time $t_\mathrm{f}$ and the unfrozen length scale $\xi(t_\mathrm{f})$ both satisfy the KZ scaling, verifying $z\simeq 2$. We subsequently explore KZ behavior in topological phase transitions with other higher-order band crossing and find the relationship between the critical exponents and the order. Our results extend the understanding of the KZ mechanism and non-equilibrium topological phase transitions.

翻訳日:2024-11-08 14:27:29 公開日:2024-10-18

# 非神経モデルにおける創発性:平均勾配外積によるモジュラー算術

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product ( http://arxiv.org/abs/2407.20199v2 )

ライセンス: Link先を確認

Neil Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin,

(参考訳) モジュラー演算タスクを解くために訓練されたニューラルネットワークは、モデルがトレーニングプロセスで100%のトレーニング精度を達成した後、テスト精度が長く改善し始める現象であるグラッキングを示す。モデル能力は相転移を通じて急激に現れます。本研究では,グルーキング現象はニューラルネットワークや勾配降下に基づく最適化に特有ではないことを示す。具体的には、一般的な機械学習モデルを用いてタスク固有の特徴学習を可能にするために、平均勾配外積(AGOP)を用いた反復アルゴリズムであるRecursive Feature Machines (RFM) を用いてモジュラー算術を学習する際に、この現象が生じることを示す。カーネルマシンと組み合わせて使用すると、RCMを繰り返すと、ランダムにほぼゼロに近いテスト精度から完全なテスト精度へ素早く移行する。この移行は、同じゼロのトレーニング損失や、初期イテレーションで一定であるテスト損失から予測することはできない。 RFMは徐々にブロック循環機能を学び、モジュラー演算を解く。 RFMの結果と並行して、モジュラー演算を解くニューラルネットワークもブロック循環の特徴を学習することを示した。さらに, ニューラルネットワークがこれらの課題から学習する一般化解として提案されるフーリエ乗算アルゴリズムの実装に, RFMがそのようなブロック循環的特徴を用いるという理論的証拠を示す。この結果から,出現はタスク関連の特徴を学習することによるものであり,ニューラルアーキテクチャや勾配降下に基づく最適化手法に特有ではないことが示唆された。さらに、我々の研究は、ニューラルネットワークにおける特徴学習の鍵となるメカニズムとしてAGOPのさらなる証拠を提供する。

Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.

翻訳日:2024-11-08 14:16:02 公開日:2024-10-18

# マンバのサーベイ

A Survey of Mamba ( http://arxiv.org/abs/2408.01129v4 )

ライセンス: Link先を確認

Haohao Qu, Liangbo Ning, Rui An, Wenqi Fan, Tyler Derr, Hui Liu, Xin Xu, Qing Li,

(参考訳) 最も代表的なDL技術の1つとして、トランスフォーマーアーキテクチャは多くの高度なモデル、特に数十億のパラメータからなる大規模言語モデル(LLM)が強化され、ディープラーニングの基盤となっている。素晴らしい成果にもかかわらず、トランスフォーマーは依然として固有の制限に直面しており、特に注意計算の2次計算の複雑さから生じる時間を要する推論である。近年、古典的状態空間モデル(SSM)からインスピレーションを得た新しいアーキテクチャであるMambaが、トランスフォーマーに匹敵するモデリング能力を提供しながら、シーケンス長に関するほぼ直線的なスケーラビリティを保ちながら、基礎モデルを構築するための有望な代替手段として登場した。このことが、様々な領域で印象的なパフォーマンスを達成するためのマンバの可能性を積極的に探究する研究を活発に進めるきっかけとなった。このような急速な進化を考えると、既存のマンバ駆動モデルを統合する体系的なレビューが不可欠であり、この新たなモデルアーキテクチャの包括的理解を提供する。そこで本研究では,近年のマンバ関連研究を詳細に調査し,マンバモデルの発展,さまざまなデータにマンバを適応させる技術,およびマンバが優れている応用の3つの主な側面について考察する。具体的には,様々な代表的な深層学習モデルの基礎知識と,Mamba-1&2の詳細について概説する。そして、AIにおけるMambaの重要性を示すために、Mambaモデルのアーキテクチャ設計、データ適応性、アプリケーションに焦点を当てた関連する研究を網羅的にレビューする。最後に,現状の限界について考察し,将来的な研究の方向性を探究し,今後の研究に深い洞察を与える。

As one of the most representative DL techniques, Transformer architecture has empowered numerous advanced models, especially the large language models (LLMs) that comprise billions of parameters, becoming a cornerstone in deep learning. Despite the impressive achievements, Transformers still face inherent limitations, particularly the time-consuming inference resulting from the quadratic computation complexity of attention calculation. Recently, a novel architecture named Mamba, drawing inspiration from classical state space models (SSMs), has emerged as a promising alternative for building foundation models, delivering comparable modeling abilities to Transformers while preserving near-linear scalability concerning sequence length. This has sparked an increasing number of studies actively exploring Mamba's potential to achieve impressive performance across diverse domains. Given such rapid evolution, there is a critical need for a systematic review that consolidates existing Mamba-empowered models, offering a comprehensive understanding of this emerging model architecture. In this survey, we therefore conduct an in-depth investigation of recent Mamba-associated studies, covering three main aspects: the advancements of Mamba-based models, the techniques of adapting Mamba to diverse data, and the applications where Mamba can excel. Specifically, we first review the foundational knowledge of various representative deep learning models and the details of Mamba-1&2 as preliminaries. Then, to showcase the significance of Mamba for AI, we comprehensively review the related studies focusing on Mamba models' architecture design, data adaptability, and applications. Finally, we present a discussion of current limitations and explore various promising research directions to provide deeper insights for future investigations.

翻訳日:2024-11-08 13:18:17 公開日:2024-10-18

# Signal-SGN:時間周波数ダイナミクスの学習による骨格行動認識のためのスパイキンググラフ畳み込みネットワーク

Signal-SGN: A Spiking Graph Convolutional Network for Skeletal Action Recognition via Learning Temporal-Frequency Dynamics ( http://arxiv.org/abs/2408.01701v2 )

ライセンス: Link先を確認

Naichuan Zheng, Hailun Xia, Dapeng Liu,

(参考訳) 骨格に基づく行動認識では、グラフ畳み込みネットワーク(GCN)ベースの手法は、その複雑さと高エネルギー消費のために制限に直面している。スパイキングニューラルネットワーク(SNN)は近年、低エネルギー消費で注目を集めているが、GCNとSNNを組み合わせた既存の手法では骨格配列の時間的特性を完全に活用できず、ストレージと計算コストが増大している。この問題に対処するために、骨格配列の時間次元をスパイキング時間ステップとして利用し、特徴を離散確率信号として扱うSignal-SGN(Spiking Graph Convolutional Network)を提案する。ネットワークのコアは1Dスパイキンググラフ畳み込みネットワーク(1D-SGN)と周波数スパイキング畳み込みネットワーク(FSN)で構成されている。 SGNは単一フレーム上でグラフ畳み込みを行い、スパイクネットワーク特性を取り入れてフレーム間時間関係を捉え、FSNはFast Fourier Transform(FFT)と複雑な畳み込みを用いて時間周波数の特徴を抽出する。また,マルチスケールウェーブレット変換機能融合モジュール(MWTF)を導入し,時間信号のスペクトル特性を捉え,モデルの分類能力を向上する。本稿では,時間空間的特徴抽出モジュール(TFSM)を提案する。 NTU RGB+D、NTU RGB+D 120、およびNW-UCLAデータセットに関する多数の実験により、提案モデルは既存のSNNベースの手法を精度良く上回るだけでなく、トレーニング中の計算および記憶コストを低減できることを示した。さらに、対応するGCNベースの手法と比較して競争精度が向上し、非常に顕著である。

In skeletal-based action recognition, Graph Convolutional Networks (GCNs) based methods face limitations due to their complexity and high energy consumption. Spiking Neural Networks (SNNs) have gained attention in recent years for their low energy consumption, but existing methods combining GCNs and SNNs fail to fully utilize the temporal characteristics of skeletal sequences, leading to increased storage and computational costs. To address this issue, we propose a Signal-SGN(Spiking Graph Convolutional Network), which leverages the temporal dimension of skeletal sequences as the spiking timestep and treats features as discrete stochastic signals. The core of the network consists of a 1D Spiking Graph Convolutional Network (1D-SGN) and a Frequency Spiking Convolutional Network (FSN). The SGN performs graph convolution on single frames and incorporates spiking network characteristics to capture inter-frame temporal relationships, while the FSN uses Fast Fourier Transform (FFT) and complex convolution to extract temporal-frequency features. We also introduce a multi-scale wavelet transform feature fusion module(MWTF) to capture spectral features of temporal signals, enhancing the model's classification capability. We propose a pluggable temporal-frequency spatial semantic feature extraction module(TFSM) to enhance the model's ability to distinguish features without increasing inference-phase consumption. Our numerous experiments on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets demonstrate that the proposed models not only surpass existing SNN-based methods in accuracy but also reduce computational and storage costs during training. Furthermore, they achieve competitive accuracy compared to corresponding GCN-based methods, which is quite remarkable.

翻訳日:2024-11-08 13:07:08 公開日:2024-10-18

# 観測時空間データにおける治療応答サブグループ同定

Identifying treatment response subgroups in observational time-to-event data ( http://arxiv.org/abs/2408.03463v2 )

ライセンス: Link先を確認

Vincent Jeanselme, Chang Ho Yoon, Fabian Falck, Brian Tom, Jessica Barrett,

(参考訳) 治療反応の異なる患者サブグループを特定することは、医療勧告、ガイドライン、将来の臨床試験の設計を知らせる重要な課題である。既存のサブグループ分析のアプローチは主にランダム化制御試験 (Randomized Controlled Trials, RRT) に依存しており、処理の割り当てはランダム化されている。 RCTの患者コホートはコストに制約されることが多く、実際の臨床で治療を受ける可能性の高い患者の異種性を表すものではない。観察研究に適用すると、サブグループ分析のアプローチは、特に治療の非ランダム化のために有意な統計バイアスに悩まされる。本研究は、観察研究における治療応答サブグループを特定するための、新しい結果誘導手法を提案する。本手法では,各患者を2つの時間-時間分布に関連するサブグループ,すなわち治療中のサブグループとコントロール中のサブグループに割り当てる。そのため、個々の治療効果と平均治療効果の見積もりの間に位置づけられる。本モデルの仮定は, 逆確率重み付けによる非ランダム化処理から統計バイアスを簡易に補正する。実験では, ランダム化処理と観察処理の両方において, 結果誘導サブグループ分析の最先端手法を著しく上回る結果を得た。

Identifying patient subgroups with different treatment responses is an important task to inform medical recommendations, guidelines, and the design of future clinical trials. Existing approaches for subgroup analysis primarily rely on Randomised Controlled Trials (RCTs), in which treatment assignment is randomised. RCTs' patient cohorts are often constrained by cost, rendering them not representative of the heterogeneity of patients likely to receive treatment in real-world clinical practice. When applied to observational studies, subgroup analysis approaches suffer from significant statistical biases particularly because of the non-randomisation of treatment. Our work introduces a novel, outcome-guided method for identifying treatment response subgroups in observational studies. Our approach assigns each patient to a subgroup associated with two time-to-event distributions: one under treatment and one under control regime. It hence positions itself in between individualised and average treatment effect estimation. The assumptions of our model result in a simple correction of the statistical bias from treatment non-randomisation through inverse propensity weighting. In experiments, our approach significantly outperforms the current state-of-the-art method for outcome-guided subgroup analysis in both randomised and observational treatment regimes.

翻訳日:2024-11-08 12:33:46 公開日:2024-10-18

# 観測時空間データにおける治療応答サブグループ同定

Identifying treatment response subgroups in observational time-to-event data ( http://arxiv.org/abs/2408.03463v3 )

ライセンス: Link先を確認

Vincent Jeanselme, Chang Ho Yoon, Fabian Falck, Brian Tom, Jessica Barrett,

翻訳日:2024-11-08 12:33:46 公開日:2024-10-18

# P3: LLMトレーニングにおけるデータプルーニングのためのポリシー駆動型、ペース適応型、多様性駆動型フレームワーク

P3: A Policy-Driven, Pace-Adaptive, and Diversity-Promoted Framework for data pruning in LLM Training ( http://arxiv.org/abs/2408.05541v2 )

ライセンス: Link先を確認

Yingxuan Yang, Huayi Wang, Muning Wen, Xiaoyun Mo, Qiuying Peng, Jun Wang, Weinan Zhang,

(参考訳) 大規模言語モデル(LLM)の急速に進歩する分野では、モデルの可能性の最大化のために微調整中に既存のデータセットを効果的に活用することが最重要事項である。本稿では、反復データプルーニングによるタスク固有の微調整プロセスの最適化を目的とした適応型フレームワークであるP3を紹介する。 P3は,(1)静的メトリクスを適応性評価に置き換え,モデルのリアルタイムパフォーマンスに基づいてデータの難易度を動的に評価するポリシ駆動の難易度測定,(2)より困難なデータを段階的に導入し,モデル能力を向上するペース適応型選択,(3)決定点プロセス(Determinantal Point Process, DPP)を導入し,エポック間のデータの多様性を保証するための多様性向上,といった3つの要素から構成される。我々は,従来のデータプルーニング手法に対して,P3を推論シナリオであるAPPSとMATHで検証し,大幅な改善を示した。動的データ選択と利用戦略の進歩により、P3はLLMのパフォーマンス改善のために既存のデータを完全に活用する理論的なフレームワークと具体的なアプローチの両方に貢献し、多様なタスクにまたがるユーティリティを提供する。

In the rapidly advancing field of Large Language Models (LLMs), effectively leveraging existing datasets during fine-tuning to maximize the model's potential is of paramount importance. This paper introduces P3, an adaptive framework aimed at optimizing the task-specific fine-tuning process through iterative data pruning. P3 consists of three key components: (1) Policy-driven Difficulty Measurement, which dynamically assesses data difficulty based on the model's real-time performance, replacing static metrics with adaptable evaluations; (2) Pace-Adaptive Selection, leveraging self-paced learning to progressively introduce more challenging data, thereby enhancing model capability; (3) Diversity Promotion, incorporating Determinantal Point Process (DPP) to ensure data diversity across epochs, enriching the learning process. We validate P3 on the reasoning scenarios, APPS and MATH, demonstrating significant improvements over traditional data pruning methods. By advancing dynamic data selection and utilization strategies, P3 contributes both a theoretical framework and concrete approach to fully exploit existing data for LLMs' performance improvement, offering utility across diverse tasks.

翻訳日:2024-11-08 11:49:24 公開日:2024-10-18

# Residual-INR: 命令型ニューラル表現を用いた通信効率の良いオンデバイス学習

Residual-INR: Communication Efficient On-Device Learning Using Implicit Neural Representation ( http://arxiv.org/abs/2408.05617v2 )

ライセンス: Link先を確認

Hanqiu Chen, Xuebin Yao, Pradeep Subedi, Cong Hao,

(参考訳) エッジコンピューティング(エッジコンピューティング)は、データ生成の源泉付近でデータを収集、処理する分散コンピューティングパラダイムである。エッジでのデバイス上の学習は、複数のデバイス間でリアルタイムなデータ共有と協調的な意思決定を容易にするデバイス間無線通信に依存している。これにより、エッジコンピューティングシステムの環境変化への適応性が大幅に向上する。しかし、エッジコンピューティングシステムの規模が大きくなるにつれて、無線通信の帯域が限られているため、デバイス間の通信がボトルネックになっている。本稿では、デバイス間データ伝送の削減とデバイス上での学習の高速化を目的として、暗黙のニューラルネットワーク表現(INR)を利用して、フォグコンピューティングに基づく通信効率の高いデバイス上での学習フレームワークであるResidual-INRを提案し、画像や映像をニューラルネットワークの重みに圧縮する。 Residual-INRは、エッジデバイスからJPEGイメージを収集し、フォグノードのINRフォーマットに圧縮し、デバイス上での学習のために再配布することで、データ転送効率を向上させる。画像の完全符号化に小型のINRと高画質のオブジェクト領域再構成に別個のINRを用いることにより、オブジェクトの品質を維持しながら符号化の冗長性を低減できる。 Residual-INRはエッジデバイス上での学習において有望なソリューションである。また、CPUを使わずにデバイス上での学習を加速し、精度を犠牲にすることなく最大2.9倍のスピードアップを達成する。私たちのコードは、https://github.com/sharclab/Residual-INR.comで利用可能です。

Edge computing is a distributed computing paradigm that collects and processes data at or near the source of data generation. The on-device learning at edge relies on device-to-device wireless communication to facilitate real-time data sharing and collaborative decision-making among multiple devices. This significantly improves the adaptability of the edge computing system to the changing environments. However, as the scale of the edge computing system is getting larger, communication among devices is becoming the bottleneck because of the limited bandwidth of wireless communication leads to large data transfer latency. To reduce the amount of device-to-device data transmission and accelerate on-device learning, in this paper, we propose Residual-INR, a fog computing-based communication-efficient on-device learning framework by utilizing implicit neural representation (INR) to compress images/videos into neural network weights. Residual-INR enhances data transfer efficiency by collecting JPEG images from edge devices, compressing them into INR format at the fog node, and redistributing them for on-device learning. By using a smaller INR for full image encoding and a separate object INR for high-quality object region reconstruction through residual encoding, our technique can reduce the encoding redundancy while maintaining the object quality. Residual-INR is a promising solution for edge on-device learning because it reduces data transmission by up to 5.16 x across a network of 10 edge devices. It also facilitates CPU-free accelerated on-device learning, achieving up to 2.9 x speedup without sacrificing accuracy. Our code is available at: https://github.com/sharclab/Residual-INR.

翻訳日:2024-11-08 11:49:24 公開日:2024-10-18

# 大次元カーネル密度推定器

Kernel Density Estimators in Large Dimensions ( http://arxiv.org/abs/2408.05807v3 )

ライセンス: Link先を確認

Giulio Biroli, Marc Mézard,

(参考訳) 本稿では,高次元分布$\rho(x)$に対するカーネル密度推定について検討する。従来のアプローチでは、大量のデータポイント$n$と固定次元$d$の制限に重点を置いてきた。代わりに、データポイントの数$n$$$y_i$とそれらの次元$d$が、固定比$\alpha=(\log n)/d$で成長する状態を分析する。我々の研究は、カーネルベースの密度$\hat \rho_h^{\mathcal {D}}(x)=\frac{1}{n h^d}\sum_{i=1}^n K\left(\frac{x-y_i}{h}\right)$, 帯域幅$h$: 中央極限定理(CLT)が持つ大帯域幅の古典的レジーム。帯域幅の一定の値の下に$h_{CLT}(\alpha)$ とすると、CLTが故障する。 $\hat\rho_h^{\mathcal {D}}(x)$ for a fixed $x$ from $\rho(x)$の統計は、重尾分布(アルファ安定分布)によって与えられる。特に$h_G(\alpha)$ 以下の値では、$\hat\rho_h^{\mathcal {D}}(x)$ は極値統計によって支配される。高次元多変量ガウスデータの詳細な解析を行う。本稿では,Kullback-Leibler分散に基づく帯域幅の最適しきい値が,本論文で同定された新しい統計体系に含まれることを示す。実践者が知っているように、Kernelが推定した帯域幅の減少は、スムーズな曲線から、データポイントを中心としたピークのコレクションへと変化している。本研究により, この現象は, 異なる統計特性を特徴とする相間の急激な遷移に関連し, 高次元環境下でのケルネル密度推定の新しい知見が得られた。

This paper studies Kernel Density Estimation for a high-dimensional distribution $\rho(x)$. Traditional approaches have focused on the limit of large number of data points $n$ and fixed dimension $d$. We analyze instead the regime where both the number $n$ of data points $y_i$ and their dimensionality $d$ grow with a fixed ratio $\alpha=(\log n)/d$. Our study reveals three distinct statistical regimes for the kernel-based estimate of the density $\hat \rho_h^{\mathcal {D}}(x)=\frac{1}{n h^d}\sum_{i=1}^n K\left(\frac{x-y_i}{h}\right)$, depending on the bandwidth $h$: a classical regime for large bandwidth where the Central Limit Theorem (CLT) holds, which is akin to the one found in traditional approaches. Below a certain value of the bandwidth, $h_{CLT}(\alpha)$, we find that the CLT breaks down. The statistics of $\hat\rho_h^{\mathcal {D}}(x)$ for a fixed $x$ drawn from $\rho(x)$ is given by a heavy-tailed distribution (an alpha-stable distribution). In particular below a value $h_G(\alpha)$, we find that $\hat\rho_h^{\mathcal {D}}(x)$ is governed by extreme value statistics: only a few points in the database matter and give the dominant contribution to the density estimator. We provide a detailed analysis for high-dimensional multivariate Gaussian data. We show that the optimal bandwidth threshold based on Kullback-Leibler divergence lies in the new statistical regime identified in this paper. As known by practitioners, when decreasing the bandwidth a Kernel-estimated estimated changes from a smooth curve to a collections of peaks centred on the data points. Our findings reveal that this general phenomenon is related to sharp transitions between phases characterized by different statistical properties, and offer new insights for Kernel density estimation in high-dimensional settings.

翻訳日:2024-11-08 11:49:24 公開日:2024-10-18

# EasyRec: 勧告のためのシンプルで効果的な言語モデル

EasyRec: Simple yet Effective Language Models for Recommendation ( http://arxiv.org/abs/2408.08821v2 )

ライセンス: Link先を確認

Xubin Ren, Chao Huang,

(参考訳) ディープニューラルネットワークは、リコメンダシステムのためのコラボレーティブフィルタリング(CF)において、ユーザとイテムのインタラクションデータから表現を学ぶための強力な技術になっている。しかし、既存の多くのメソッドは、ユニークなユーザIDとアイテムIDに大きく依存しており、十分なトレーニングデータが利用できないような現実的なゼロショット学習シナリオにおいて、うまく機能する能力を制限する。言語モデル(LM)の成功と、その強力な一般化能力に触発されて、重要な疑問が浮かび上がっている。本研究では,テキストに基づく意味理解を協調的な信号とシームレスに統合する,効果的で使いやすいアプローチであるEasyRecを提案する。 EasyRecは、コントラスト学習と協調言語モデルチューニングを組み合わせたテキストビヘイビアアライメントフレームワークを使用して、テキスト強化セマンティックスペースと協調行動情報との強いアライメントを保証する。さまざまな実世界のデータセットにわたる大規模な経験的評価は、特にテキストベースのゼロショットレコメンデーションシナリオにおいて、最先端の代替モデルと比較して、EasyRecの優れたパフォーマンスを示している。さらに、この研究は、プラグイン・アンド・プレイコンポーネントとしてEasyRecをテキスト強化協調フィルタリングフレームワークにシームレスに統合する可能性を強調し、既存のレコメンデーションシステムにより、推奨性能を高め、動的環境における進化するユーザの好みに適応することが可能になる。我々のEasyRecフレームワークの再現性を改善するために、モデル実装の詳細、ソースコード、データセットはリンクで利用可能である。

Deep neural networks have become a powerful technique for learning representations from user-item interaction data in collaborative filtering (CF) for recommender systems. However, many existing methods heavily rely on unique user and item IDs, which limits their ability to perform well in practical zero-shot learning scenarios where sufficient training data may be unavailable. Inspired by the success of language models (LMs) and their strong generalization capabilities, a crucial question arises: How can we harness the potential of language models to empower recommender systems and elevate its generalization capabilities to new heights? In this study, we propose EasyRec - an effective and easy-to-use approach that seamlessly integrates text-based semantic understanding with collaborative signals. EasyRec employs a text-behavior alignment framework, which combines contrastive learning with collaborative language model tuning, to ensure a strong alignment between the text-enhanced semantic space and the collaborative behavior information. Extensive empirical evaluations across diverse real-world datasets demonstrate the superior performance of EasyRec compared to state-of-the-art alternative models, particularly in the challenging text-based zero-shot recommendation scenarios. Furthermore, the study highlights the potential of seamlessly integrating EasyRec as a plug-and-play component into text-enhanced collaborative filtering frameworks, thereby empowering existing recommender systems to elevate their recommendation performance and adapt to the evolving user preferences in dynamic environments. For better result reproducibility of our EasyRec framework, the model implementation details, source code, and datasets are available at the link: https://github.com/HKUDS/EasyRec.

翻訳日:2024-11-08 07:18:07 公開日:2024-10-18

# EasyRec: 勧告のためのシンプルで効果的な言語モデル

EasyRec: Simple yet Effective Language Models for Recommendation ( http://arxiv.org/abs/2408.08821v3 )

ライセンス: Link先を確認

Xubin Ren, Chao Huang,

翻訳日:2024-11-08 07:18:07 公開日:2024-10-18

# スパースGPTの高次複雑度解析

A Tighter Complexity Analysis of SparseGPT ( http://arxiv.org/abs/2408.12151v2 )

ライセンス: Link先を確認

Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song,

(参考訳) 本研究では, SparseGPT [Frantar, Alistarh ICML 2023] を$O(d^{3})$から$O(d^{\omega} + d^{2+a+o(1)} + d^{1+\omega(1,1,a)-a})$ の任意の $a \in [0,1]$ に対して, $\omega$ は行列乗算の指数である。特に、現在の$\omega \approx 2.371$[Alman, Duan, Williams, Xu, Xu, Zhou 2024]の場合、ランニングタイムは$O(d^{2.53})$に沸騰する。この実行時間は,[Deng, Song, Weinstein 2022; Brand, Song, Zhou ICML 2024]のような反復メンテナンス問題における遅延更新動作の分析によるものだ。

In this work, we improved the analysis of the running time of SparseGPT [Frantar, Alistarh ICML 2023] from $O(d^{3})$ to $O(d^{\omega} + d^{2+a+o(1)} + d^{1+\omega(1,1,a)-a})$ for any $a \in [0, 1]$, where $\omega$ is the exponent of matrix multiplication. In particular, for the current $\omega \approx 2.371$ [Alman, Duan, Williams, Xu, Xu, Zhou 2024], our running time boils down to $O(d^{2.53})$. This running time is due to the analysis of the lazy update behavior in iterative maintenance problems such as [Deng, Song, Weinstein 2022; Brand, Song, Zhou ICML 2024].

翻訳日:2024-11-08 05:49:00 公開日:2024-10-18

# 量子レインボー符号

Quantum Rainbow Codes ( http://arxiv.org/abs/2408.13130v2 )

ライセンス: Link先を確認

Thomas R. Scruby, Arthur Pesah, Mark Webster,

(参考訳) 色符号とピン符号を一般化した新しい量子誤り訂正符号である虹符号を導入する。レインボー符号は、$0$-simplicesの有効な$(D+1)$-colouringを許容する任意の$D$-次元のsimplicial complex上で定義することができる。本稿では, これらの単純錯体がハイパーグラフ生成物を介して得られた鎖錯体から導出される場合について詳細に検討し, これらの符号をドメイン壁に結合したカラー符号の集合として再解釈することにより, 符号付きキュービットの数と距離が増大するコードファミリ, および$T$および$T^\dag$の超越的応用によって実装された論理的非クリフォードゲートが得られることを示す。これらの技法をZhu et al (arXiv:2310.16982) の準双曲色符号と組み合わせることで、超越的な非クリフォードゲートとパラメータ $[\! [n,O(n),O(log(n))]\! これにより、マジック状態の収率パラメータ $\gamma = \log_d(n/k)$ を任意に小さくすることができる。一方、$\gamma \rightarrow 0 の他の構成とは対照的に、我々の符号は qubit 上でネイティブに定義されており、LDPC であり、論理的な非クリフォードゲートはシングルキュービット(エンタングリングではなく)物理演算で実装できるが、漸近的に良いものではない。

We introduce rainbow codes, a novel class of quantum error correcting codes generalising colour codes and pin codes. Rainbow codes can be defined on any $D$-dimensional simplicial complex that admits a valid $(D+1)$-colouring of its $0$-simplices. We study in detail the case where these simplicial complexes are derived from chain complexes obtained via the hypergraph product and, by reinterpreting these codes as collections of colour codes joined at domain walls, show that we can obtain code families with growing distance and number of encoded qubits as well as logical non-Clifford gates implemented by transversal application of $T$ and $T^\dag$. By combining these techniques with the quasi-hyperbolic colour codes of Zhu et al. (arXiv:2310.16982) we obtain families of codes with transversal non-Clifford gates and parameters $[\![n,O(n),O(log(n))]\!]$ which allow the magic-state yield parameter $\gamma = \log_d(n/k)$ to be made arbitrarily small. In contrast to other recent constructions that achieve $\gamma \rightarrow 0$ our codes are natively defined on qubits, are LDPC, and have logical non-Clifford gates implementable by single-qubit (rather than entangling) physical operations, but are not asymptotically good.

翻訳日:2024-11-08 05:26:28 公開日:2024-10-18

# SciLitLLM:科学文献理解のためのLLMの適応方法

SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding ( http://arxiv.org/abs/2408.15545v3 )

ライセンス: Link先を確認

Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, Hengxing Cai,

(参考訳) 科学的文献の理解は、対象とする情報を抽出し、洞察を得るために不可欠であり、科学的な発見を著しく前進させる。 LLM(Large Language Models)の顕著な成功にもかかわらず、第一に科学的知識の欠如と、第二に専門的な科学的タスクに精通していないことによる科学文献理解の課題に直面している。本研究では,科学文献理解に特化したLLMを開発するために,CPT(Continuous Pre-Turning)とSFT(教師付き微調整)を統合したハイブリッド戦略を提案し,科学的ドメイン知識を同時に注入し,ドメイン固有のタスクの指示追従能力を高める。我々は、PDFテキスト抽出、コンテンツエラー訂正のパース、品質フィルタリング、合成命令生成など、微妙なパイプラインを通じてこれらの課題に対処する。この戦略を応用して、科学文献理解に特化したLLMのスイートSciLitLLMを提示する。これらのモデルは科学文献理解ベンチマークにおいて有望な性能を示す。 1) CPT と SFT を統合し,科学文献理解に LLM を適用し,他の領域にも容易に適用可能な効果的なフレームワークを提案する。 2) LLMに基づく多種多様な科学的命令を生成するための合成法を提案し, より表現の少ない科学領域における微調整のための新しい命令セットであるSciLitInsを提案する。 (3)SciLitLLMは,学術文献理解ベンチマークにおいて有望な性能向上を実現している。

Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks.

翻訳日:2024-11-08 04:30:58 公開日:2024-10-18

# MedDet:効率的な頚椎椎間板ヘルニア検出のための生成的対側蒸留法

MedDet: Generative Adversarial Distillation for Efficient Cervical Disc Herniation Detection ( http://arxiv.org/abs/2409.00204v2 )

ライセンス: Link先を確認

Zeyu Zhang, Nengmin Yi, Shengbo Tan, Ying Cai, Yi Yang, Lei Xu, Qingtai Li, Zhang Yi, Daji Ergu, Yang Zhao,

(参考訳) 頚椎椎間板ヘルニア(Cervical disc herniation, CDH)は、筋骨格障害の1つである。医用画像の自動検出の進歩にもかかわらず、これらの手法の現実的な応用を妨げる2つの大きな課題がある。第一に、計算の複雑さとリソース要求は、リアルタイムアプリケーションにとって大きなギャップを生じさせる。第二に、MRIのノイズは特徴抽出を歪ませることで既存の手法の有効性を低下させる。まず, モデル圧縮と効率向上のために, マルチ教師による単一学習知識の蒸留を活用するMedDetを導入した。さらに、MRIのノイズ耐性を改善するために、2階のnmODEをカスタマイズする。最後に,CDH-1848データセットの総合的な実験を行い,従来の手法と比較して最大5%のmAP改善を実現した。提案手法は,約67.8%のパラメータを,36.9%のFLOPを教師モデルと比較し,推論速度を5倍以上に向上させる。これらの進歩はCDH自動検出の性能と効率を大幅に向上させ、将来的な臨床応用の可能性を示している。プロジェクトのWebサイト https://steve-zeyu-zhang.github.io/MedDet

Cervical disc herniation (CDH) is a prevalent musculoskeletal disorder that significantly impacts health and requires labor-intensive analysis from experts. Despite advancements in automated detection of medical imaging, two significant challenges hinder the real-world application of these methods. First, the computational complexity and resource demands present a significant gap for real-time application. Second, noise in MRI reduces the effectiveness of existing methods by distorting feature extraction. To address these challenges, we propose three key contributions: Firstly, we introduced MedDet, which leverages the multi-teacher single-student knowledge distillation for model compression and efficiency, meanwhile integrating generative adversarial training to enhance performance. Additionally, we customize the second-order nmODE to improve the model's resistance to noise in MRI. Lastly, we conducted comprehensive experiments on the CDH-1848 dataset, achieving up to a 5% improvement in mAP compared to previous methods. Our approach also delivers over 5 times faster inference speed, with approximately 67.8% reduction in parameters and 36.9% reduction in FLOPs compared to the teacher model. These advancements significantly enhance the performance and efficiency of automated CDH detection, demonstrating promising potential for future application in clinical practice. See project website https://steve-zeyu-zhang.github.io/MedDet

翻訳日:2024-11-08 03:46:25 公開日:2024-10-18

# GraphInsight: グラフ構造理解のための大規模言語モデルのロック解除

GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding ( http://arxiv.org/abs/2409.03258v2 )

ライセンス: Link先を確認

Yukun Cao, Shuo Han, Zengyi Gao, Zezhong Ding, Xike Xie, S. Kevin Zhou,

(参考訳) 大規模言語モデル(LLM)はグラフ処理の可能性を実証しているが、グラフサイズが大きくなるにつれてグラフ記述シーケンスのプロンプトを通じてグラフィカル構造情報の理解に苦慮している。この課題は「位置バイアス」と呼ばれるグラフ記述配列の異なる位置におけるLLMの不均一メモリ性能に起因する。そこで我々は,マクロおよびマイクロレベルのグラフィカル情報に対するLLMの理解を改善するための新しいフレームワークであるGraphInsightを提案する。 GraphInsightには2つの重要な戦略がある。 1)LCMがより強力なメモリ性能を示す位置に重要なグラフィカル情報を配置し、 2)検索強化世代(RAG)にインスパイアされた,メモリ性能の低い領域に対する軽量な外部知識ベースの検討。さらに、GraphInsightは、これらの2つの戦略を多段階推論を必要とする複合グラフタスクのLLMエージェントプロセスに統合することを検討している。幅広い評価タスクを持つベンチマークに関する広範な実証研究により、グラフインサイトは他のグラフ記述手法(例えば、様々な大きさのグラフ構造を理解する上でのテクニックや並べ替え戦略)を著しく上回っていることが示されている。

Although Large Language Models (LLMs) have demonstrated potential in processing graphs, they struggle with comprehending graphical structure information through prompts of graph description sequences, especially as the graph size increases. We attribute this challenge to the uneven memory performance of LLMs across different positions in graph description sequences, known as ''positional biases''. To address this, we propose GraphInsight, a novel framework aimed at improving LLMs' comprehension of both macro- and micro-level graphical information. GraphInsight is grounded in two key strategies: 1) placing critical graphical information in positions where LLMs exhibit stronger memory performance, and 2) investigating a lightweight external knowledge base for regions with weaker memory performance, inspired by retrieval-augmented generation (RAG). Moreover, GraphInsight explores integrating these two strategies into LLM agent processes for composite graph tasks that require multi-step reasoning. Extensive empirical studies on benchmarks with a wide range of evaluation tasks show that GraphInsight significantly outperforms all other graph description methods (e.g., prompting techniques and reordering strategies) in understanding graph structures of varying sizes.

翻訳日:2024-11-07 23:23:02 公開日:2024-10-18

# 自然言語のプランニングによりコード生成のためのLLM検索が改善

Planning In Natural Language Improves LLM Search For Code Generation ( http://arxiv.org/abs/2409.03733v2 )

ライセンス: Link先を確認

Evan Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, Will Song, Vaskar Nath, Ziwen Han, Sean Hendryx, Summer Yue, Hugh Zhang,

(参考訳) 大規模言語モデル(LLM)では、スケールトレーニング計算が顕著に改善されているが、スケーリング推論計算では、まだ類似のゲインが得られていない。我々は、中核的な欠落成分は多様なLCM出力の欠如であり、モデルが非常によく似ているが誤りな世代を繰り返しサンプリングするため、非効率な探索につながると仮定する。この多様性の欠如は、自然言語の問題を解決するための候補計画を探すことによって緩和することができることを実証的に実証する。この知見に基づいて,HumanEval+,MBPP+,LiveCodeBench(競合コーディングのための汚染のないベンチマーク)にまたがる強力な結果を示す新しい検索アルゴリズムであるPlanSearchを提案する。 PlanSearchは、問題に関するさまざまな観察結果を生成し、これらの観測結果を使用して、問題を解決するための計画を構築します。 PlanSearchは、コードソリューションを直接ではなく自然言語で探索することによって、ベースライン検索法よりもはるかに多様な潜在的なソリューションを探索する。クロード3.5上でPlanSearchを使用することで、LiveCodeBenchで77.0%の最先端パス@200を達成し、検索なしで最高のスコア(pass@1 = 41.4%)と標準繰り返しサンプリング(pass@200 = 60.6%)の両方を上回ります。最後に、分析したモデル、検索アルゴリズム、およびベンチマークにおいて、生成したアイデアに対する多様性の直接的な関数として検索による性能向上を正確に予測できることを示す。コードはhttps://github.com/scaleapi/plansearch.comにある。

While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs, leading to inefficient search due to models repeatedly sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PlanSearch, a novel search algorithm which shows strong results across HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding). PlanSearch generates a diverse set of observations about the problem and then uses these observations to construct plans for solving the problem. By searching over plans in natural language rather than directly over code solutions, PlanSearch explores a significantly more diverse range of potential solutions compared to baseline search methods. Using PlanSearch on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%). Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately predict performance gains due to search as a direct function of the diversity over generated ideas. Code can be found at https://github.com/scaleapi/plansearch.

翻訳日:2024-11-07 23:11:54 公開日:2024-10-18

# 訓練されたエージェント探索によるインタラクティブな生成環境の学習

Learning Generative Interactive Environments By Trained Agent Exploration ( http://arxiv.org/abs/2409.06445v2 )

ライセンス: Link先を確認

Naser Kazemi, Nedko Savov, Danda Paudel, Luc Van Gool,

(参考訳) 世界モデルは、複雑な環境のルールと行動の解釈とシミュレートにおいて、ますます重要になっている。最近のモデルであるGenieは、視覚的に多様な環境からの学習に優れていますが、コストのかかる人為的なデータに依存しています。ランダムエージェントの代替手法が環境を探索するには限界すぎることを観察する。データ生成に強化学習に基づくエージェントを用いてモデルを改善することを提案する。このアプローチは、さまざまなシナリオや環境内の現実的なアクションに対して、モデルを適応し、適切に実行する能力を高める多様なデータセットを生成する。本稿では、Genieをベースにした実装であるGenieReduxモデルを最初にリリースする。また,GenieRedux-Gを導入し,エージェントの容易な動作を利用して,検証中の動作予測の不確実性を判断する。 Coinrun ケーススタディの再現を含む評価の結果,GenieRedux-G は訓練されたエージェント探索を用いて優れた視覚的忠実度と制御性が得られることが示された。提案されたアプローチは再現可能で、スケーラブルで、新しいタイプの環境に適応できる。私たちのコードベースはhttps://github.com/insait-institute/GenieRedux で公開されています。

World models are increasingly pivotal in interpreting and simulating the rules and actions of complex environments. Genie, a recent model, excels at learning from visually diverse environments but relies on costly human-collected data. We observe that their alternative method of using random agents is too limited to explore the environment. We propose to improve the model by employing reinforcement learning based agents for data generation. This approach produces diverse datasets that enhance the model's ability to adapt and perform well across various scenarios and realistic actions within the environment. In this paper, we first release the model GenieRedux - an implementation based on Genie. Additionally, we introduce GenieRedux-G, a variant that uses the agent's readily available actions to factor out action prediction uncertainty during validation. Our evaluation, including a replication of the Coinrun case study, shows that GenieRedux-G achieves superior visual fidelity and controllability using the trained agent exploration. The proposed approach is reproducable, scalable and adaptable to new types of environments. Our codebase is available at https://github.com/insait-institute/GenieRedux .

翻訳日:2024-11-07 22:16:23 公開日:2024-10-18

# LED:夜間の光深度推定

LED: Light Enhanced Depth Estimation at Night ( http://arxiv.org/abs/2409.08031v2 )

ライセンス: Link先を確認

Simon de Moreau, Yasser Almehio, Andrei Bursuc, Hafid El-Idrissi, Bogdan Stanciulescu, Fabien Moutarde,

(参考訳) 夜間カメラによる深度推定は、特に安全なナビゲーションを確保するために正確な深度認識が不可欠である自律運転アプリケーションにおいて、非常に困難な作業である。夜間における知覚システムの信頼性向上を目指しており、日中のデータで訓練されたモデルは、正確なLiDARセンサーがなければ、しばしば失敗する。本研究は,高精細ヘッドライトによって投影されるパターンを活用することで,低照度環境における奥行き推定を大幅に改善する,新しいコスト効率のアプローチであるLight Enhanced Depth(LED)を紹介する。 LEDは、複数の深度推定アーキテクチャ(エンコーダ-デコーダ、Adabins、DepthFormer)において、合成データセットと実際のデータセットの両方において、大幅なパフォーマンス向上をもたらします。さらに,照明領域を越えた性能向上は,シーン理解の全体的向上を示す。最後に、我々はNighttime Synthetic Drive Datasetをリリースした。Nighttime Synthetic Drive Datasetは、49,990の注釈付き画像からなる、新しい合成的で写真リアルなナイトタイムデータセットである。

Nighttime camera-based depth estimation is a highly challenging task, especially for autonomous driving applications, where accurate depth perception is essential for ensuring safe navigation. We aim to improve the reliability of perception systems at night time, where models trained on daytime data often fail in the absence of precise but costly LiDAR sensors. In this work, we introduce Light Enhanced Depth (LED), a novel cost-effective approach that significantly improves depth estimation in low-light environments by harnessing a pattern projected by high definition headlights available in modern vehicles. LED leads to significant performance boosts across multiple depth-estimation architectures (encoder-decoder, Adabins, DepthFormer) both on synthetic and real datasets. Furthermore, increased performances beyond illuminated areas reveal a holistic enhancement in scene understanding. Finally, we release the Nighttime Synthetic Drive Dataset, a new synthetic and photo-realistic nighttime dataset, which comprises 49,990 comprehensively annotated images.

翻訳日:2024-11-07 21:31:36 公開日:2024-10-18

# 大規模言語モデルに基づく生成誤差補正:音声認識、話者タグ付け、感情認識の課題と基礎

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition ( http://arxiv.org/abs/2409.09785v3 )

ライセンス: Link先を確認

Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke,

(参考訳) 生成AI技術の最近の進歩を踏まえると、大きな言語モデル(LLM)が、凍結した事前訓練された自動音声認識(ASR)モデルからテキストデコード結果を用いて、音響モデリングタスクをどのように強化できるかが重要な疑問である。音声処理における言語モデリングの新機能を探るため,生成音声の書き起こし誤り訂正(GenSEC)の課題について紹介する。この課題は、ASR後の3つの言語モデリングタスクから成っている。 (i)ASR後の転写補正 (二)話者タグ付け、及び (三)感情認識。これらのタスクは、オープンな事前訓練された言語モデルやエージェントベースのAPIを利用することで、音声ベースのインターフェースを扱う将来のLLMベースのエージェントのエミュレートを目的としている。また,ベースライン評価から得られた知見や,今後の評価設計における教訓についても論じる。

Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.

翻訳日:2024-11-07 20:46:36 公開日:2024-10-18

# CSKV:長期シナリオにおけるKVキャッシュのための訓練効率の良いチャネルスライキング

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios ( http://arxiv.org/abs/2409.10593v3 )

ライセンス: Link先を確認

Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang,

(参考訳) 大きな言語モデル(LLM)は、長いコンテキストタスクを処理するために広く採用されている。しかしながら、キー値(KV)キャッシュの大きなメモリオーバーヘッドは、長期コンテキストシナリオにおいて大きな課題を生じさせる。既存のトレーニング不要なKVキャッシュ圧縮手法は、圧縮限界のある量子化とトークンプルーニングに重点を置いており、過度なスパーシリティによってパフォーマンスが著しく低下する可能性がある。他の手法はKVオーバーヘッドが少ないが、かなりのトレーニングオーバーヘッドを必要とする新しいアーキテクチャを設計する。上記の2つの欠点に対処するため、チャネル次元の冗長性をさらに検討し、少ないトレーニングコストでアーキテクチャレベルの設計を適用する。そこで我々は,KVキャッシュ圧縮のための訓練効率の高いチャネルシンキング手法であるCSKVを紹介した:(1)KVキャッシュの特異値分布をまず解析し,チャネル次元に沿った大きな冗長性と圧縮ポテンシャルを明らかにする。そこで本研究では,鍵層と値層を低階分解し,低次元特徴を記憶する手法を提案する。 2) モデル性能を維持するため,ウィンドウベースフル精度KVキャッシュと低精度圧縮KVキャッシュを含む分岐KVキャッシュを導入する。 (3) トレーニングコストを削減するため, 圧縮KVキャッシュの階層的再構成損失を最小限に抑える。大規模な実験により、CSKVはKVキャッシュのメモリオーバーヘッドを80%削減し、モデルの長期コンテキスト能力を維持できることが示された。さらに,本手法を量子化とシームレスに組み合わせることで,メモリオーバーヘッドをさらに低減し,最大95%の圧縮比が得られることを示す。コードはhttps://github.com/wln20/CSKVで入手できる。

Large Language Models (LLMs) have been widely adopted to process long-context tasks. However, the large memory overhead of the key-value (KV) cache poses significant challenges in long-context scenarios. Existing training-free KV cache compression methods typically focus on quantization and token pruning, which have compression limits, and excessive sparsity can lead to severe performance degradation. Other methods design new architectures with less KV overhead but require significant training overhead. To address the above two drawbacks, we further explore the redundancy in the channel dimension and apply an architecture-level design with minor training costs. Therefore, we introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression: (1) We first analyze the singular value distribution of the KV cache, revealing significant redundancy and compression potential along the channel dimension. Based on this observation, we propose using low-rank decomposition for key and value layers and storing the low-dimension features. (2) To preserve model performance, we introduce a bi-branch KV cache, including a window-based full-precision KV cache and a low-precision compressed KV cache. (3) To reduce the training costs, we minimize the layer-wise reconstruction loss for the compressed KV cache instead of retraining the entire LLMs. Extensive experiments show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model's long-context capability. Moreover, we show that our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%. Code is available at https://github.com/wln20/CSKV.

翻訳日:2024-11-07 20:24:11 公開日:2024-10-18

# Fact, Fetch, Reason : Retrieval-Augmented Generation の統一評価

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation ( http://arxiv.org/abs/2409.12941v1 )

ライセンス: Link先を確認

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal Faruqui,

(参考訳) 大規模言語モデル(LLM)は、様々な認知タスクにおいて、大幅なパフォーマンス向上を示している。新たなアプリケーションは、LLMを使用して検索強化世代(RAG)機能を強化している。これらのシステムでは、ユーザクエリを理解し、関連する情報を検索し、一貫性と正確な応答を合成するためにLLMが必要である。このようなシステムの現実的な展開が増加する中、包括的評価が重要となる。そこで本研究では,FRAMES (Factuality, Retrieval, And reasoning Measurement Set) を提案する。以前の作業では、これらの機能を分離して評価するためのデータセットとベンチマークが提供されていたが、FRAMESは、エンドツーエンドのRAGシナリオにおけるLLMパフォーマンスのより明確な図を提供する統一されたフレームワークを提供している。私たちのデータセットは、複数のソースからの情報の統合を必要とする、挑戦的なマルチホップ質問で構成されています。本稿では,最先端のLLMでもこの課題に対処し,0.40の精度で検索を行なわないことを示す。提案した多段階探索パイプラインでは精度が大幅に向上し,0.66(>50%)の精度が得られた。我々は、評価ギャップを埋め、より堅牢で有能なRAGシステムの開発を支援することを願っている。

Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.

翻訳日:2024-11-07 12:48:01 公開日:2024-10-18

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal Faruqui,

翻訳日:2024-11-07 12:48:01 公開日:2024-10-18

# Video-XL:24時間ビデオ理解のための極長ビジョン言語モデル

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding ( http://arxiv.org/abs/2409.14485v2 )

ライセンス: Link先を確認

Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, Bo Zhao,

(参考訳) 現在のMLLM(Multi-modal Large Language Models)は、ビデオ理解における有望な結果を示しているが、非常に長いビデオの処理は今も進行中の課題である。通常、MLLMはLLMの最大コンテキスト長を超える何千ものトークンを扱うのに苦労し、トークン集約による視覚的明瞭度の低下を経験する。もう一つの課題は、大量のビデオトークンから生じる高い計算コストである。これらの課題に対処するために,時間スケールの効率的な映像理解を目的とした超長期視覚言語モデルであるVideo-XLを提案する。具体的には、LLMを効果的な視覚コンデンサとして適用し、視覚コンテキストを高度にコンパクトな形式に凝縮する視覚コンテキストラテント要約を導入することを論じる。広範にわたる実験により,画像データに制限があるにもかかわらず,人気ビデオ理解ベンチマークで有望な結果が得られた。さらに、Video-XLは80GBのGPU上で1024フレームを処理し、Needdle-in-a-Haystack評価においてほぼ100%の精度を実現している。我々は、ビデオ要約、監視異常検出、広告配置識別などの長大なビデオアプリケーションにとって、ビデオ-XLが貴重なツールになることを期待している。

Although current Multi-modal Large Language Models (MLLMs) demonstrate promising results in video understanding, processing extremely long videos remains an ongoing challenge. Typically, MLLMs struggle with handling thousands of tokens that exceed the maximum context length of LLMs, and they experience reduced visual clarity due to token aggregation. Another challenge is the high computational cost stemming from the large number of video tokens. To tackle these issues, we propose Video-XL, an extra-long vision language model designed for efficient hour-scale video understanding. Specifically, we argue that LLMs can be adapted as effective visual condensers and introduce Visual Context Latent Summarization, which condenses visual contexts into highly compact forms. Extensive experiments demonstrate that our model achieves promising results on popular long video understanding benchmarks, despite being trained on limited image data. Moreover, Video-XL strikes a promising balance between efficiency and effectiveness, processing 1024 frames on a single 80GB GPU while achieving nearly 100\% accuracy in the Needle-in-a-Haystack evaluation. We envision Video-XL becoming a valuable tool for long video applications such as video summarization, surveillance anomaly detection, and Ad placement identification.

翻訳日:2024-11-06 22:30:40 公開日:2024-10-18

# Video-XL:24時間ビデオ理解のための極長ビジョン言語モデル

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding ( http://arxiv.org/abs/2409.14485v3 )

ライセンス: Link先を確認

Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, Bo Zhao,

(参考訳) 現在のMLLM(Multi-modal Large Language Models)は、ビデオ理解における有望な結果を示しているが、非常に長いビデオの処理は今も進行中の課題である。一般的にMLLMは、最大コンテキスト長を超える数千の視覚トークンを扱うのに苦労し、トークン集約による情報減衰に悩まされる。もう一つの課題は、大量のビデオトークンから生じる高い計算コストである。これらの課題に対処するために,時間スケールの効率的な映像理解を目的とした超長期視覚言語モデルであるVideo-XLを提案する。具体的には、LLMを効果的なビジュアルコンデンサとして適用し、視覚コンテキストを高度にコンパクトな形式に凝縮する視覚コンテキストラテント要約を提案する。広範にわたる実験により,我々のモデルは,人気ビデオ理解ベンチマークにおいて有望な結果が得られることを示した。例えば、Video-XLはVNBench上の現在の最先端の手法を10倍近い精度で上回る。さらに、Video-XLは効率と効率の両立を示し、80GBのGPU上で2048フレームを処理すると同時に、Needle-in-a-Haystack評価において95%近い精度を実現している。

Although current Multi-modal Large Language Models (MLLMs) demonstrate promising results in video understanding, processing extremely long videos remains an ongoing challenge. Typically, MLLMs struggle with handling thousands of visual tokens that exceed the maximum context length, and they suffer from the information decay due to token aggregation. Another challenge is the high computational cost stemming from the large number of video tokens. To tackle these issues, we propose Video-XL, an extra-long vision language model designed for efficient hour-scale video understanding. Specifically, we argue that LLMs can be adapted as effective visual condensers and propose Visual Context Latent Summarization which condenses visual contexts into highly compact forms. Extensive experiments demonstrate that our model achieves promising results on popular long video understanding benchmarks. For example, Video-XL outperforms the current state-of-the-art method on VNBench by nearly 10\% in accuracy. Moreover, Video-XL presents an impressive balance between efficiency and effectiveness, processing 2048 frames on a single 80GB GPU while achieving nearly 95% accuracy in the Needle-in-a-Haystack evaluation.

翻訳日:2024-11-06 22:30:40 公開日:2024-10-18

# 大規模言語モデルにおける性・人種・年齢バイアスの評価:職業・犯罪シナリオの比較分析

Evaluating Gender, Racial, and Age Biases in Large Language Models: A Comparative Analysis of Occupational and Crime Scenarios ( http://arxiv.org/abs/2409.14583v1 )

ライセンス: Link先を確認

Vishal Mirza, Rahul Kulkarni, Aakanksha Jadhav,

(参考訳) LLM(Large Language Models)の最近の進歩は注目されているが、様々な制約のため、広く採用されている企業はまだ限られている。本稿では, LLM におけるバイアスがユーザビリティ, 信頼性, 公平性に与える影響について検討する。研究者はバイアスを軽減するための戦略を開発しており、例えば、デバイアス層、WinogenderやWinobiasのような特別な参照データセット、人間からのフィードバックによる強化学習(RLHF)などがある。これらの技術は最新のLSMに統合されている。本研究は,2024年に公開された4つのLLM(Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, GPT-4o)における,職業シナリオ,性別,年齢,人種バイアスの性別バイアスを評価する。 LLMは、様々な職業において、男性よりも女性キャラクターが頻繁に描かれており、米国のBLSデータから37%の偏差が示されている。犯罪シナリオでは、FBIのデータからの偏差は性別が54%、人種が28%、年齢が17%である。我々は、性別と人種的偏見を減らす努力が、しばしば1つのサブクラスを過大評価し、問題を悪化させる可能性がある結果をもたらすことを観察する。これらの結果は、現在のバイアス緩和技術の限界を強調し、より効果的なアプローチの必要性を強調している。

Recent advancements in Large Language Models(LLMs) have been notable, yet widespread enterprise adoption remains limited due to various constraints. This paper examines bias in LLMs-a crucial issue affecting their usability, reliability, and fairness. Researchers are developing strategies to mitigate bias, including debiasing layers, specialized reference datasets like Winogender and Winobias, and reinforcement learning with human feedback (RLHF). These techniques have been integrated into the latest LLMs. Our study evaluates gender bias in occupational scenarios and gender, age, and racial bias in crime scenarios across four leading LLMs released in 2024: Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, and GPT-4o. Findings reveal that LLMs often depict female characters more frequently than male ones in various occupations, showing a 37% deviation from US BLS data. In crime scenarios, deviations from US FBI data are 54% for gender, 28% for race, and 17% for age. We observe that efforts to reduce gender and racial bias often lead to outcomes that may over-index one sub-class, potentially exacerbating the issue. These results highlight the limitations of current bias mitigation techniques and underscore the need for more effective approaches.

翻訳日:2024-11-06 22:08:18 公開日:2024-10-18

Vishal Mirza, Rahul Kulkarni, Aakanksha Jadhav,

翻訳日:2024-11-06 22:08:18 公開日:2024-10-18

# 耳を聴く耳:多モーダル大言語モデルを用いた音の象徴実験

With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models ( http://arxiv.org/abs/2409.14917v2 )

ライセンス: Link先を確認

Tyler Loakman, Yucheng Li, Chenghua Lin,

(参考訳) 近年,Large Language Models (LLMs) とVision Language Models (VLMs) は,精神言語学的な現象を実験する実験において,人間の代替手段としての能力を示している。しかし,視覚やテキストのモダリティにのみアクセス可能なモデルが,正書法や画像のみからの抽象的推論を通じて,暗黙的に音による現象を理解することができるのか,という疑問がある。そこで本研究では,VLM と LLM の音のシンボリズム(すなわち音と概念の非任意リンクの認識)を実証する能力と,オープンかつクローズドなマルチモーダルモデルの言語とヴィジュアルモジュールのインタープレイを通じて「聴く」能力について分析する。我々は,古典的キキ・ブーバとミル・マールの形状と等級記号課題を再現し,言語的象徴性の人間の判断をLLMと比較するなど,複数の実験を行った。以上の結果から, VLMは人体ラベルとの一致のレベルが異なることが示され, サイリコ実験において, VLMと人体ラベルの対応に必要となるタスク情報がより多く必要となる可能性が示唆された。さらに, マグニチュード・シンボリズムは, VLMが形状シンボリズムよりも識別しやすいパターンであり, 言語的象徴性の理解がモデルサイズに大きく依存していることも確認した。

Recently, Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated aptitude as potential substitutes for human participants in experiments testing psycholinguistic phenomena. However, an understudied question is to what extent models that only have access to vision and text modalities are able to implicitly understand sound-based phenomena via abstract reasoning from orthography and imagery alone. To investigate this, we analyse the ability of VLMs and LLMs to demonstrate sound symbolism (i.e., to recognise a non-arbitrary link between sounds and concepts) as well as their ability to "hear" via the interplay of the language and vision modules of open and closed-source multimodal models. We perform multiple experiments, including replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism tasks, and comparing human judgements of linguistic iconicity with that of LLMs. Our results show that VLMs demonstrate varying levels of agreement with human labels, and more task information may be required for VLMs versus their human counterparts for in silico experimentation. We additionally see through higher maximum agreement levels that Magnitude Symbolism is an easier pattern for VLMs to identify than Shape Symbolism, and that an understanding of linguistic iconicity is highly dependent on model size.

翻訳日:2024-11-06 20:39:08 公開日:2024-10-18

# CNNに基づくBi-GRUモデルを用いた英語攻撃テキストの検出

English offensive text detection using CNN based Bi-GRU model ( http://arxiv.org/abs/2409.15652v2 )

ライセンス: Link先を確認

Tonmoy Roy, Md Robiul Islam, Asif Ahammad Miazee, Anika Antara, Al Amin, Sunjim Hossain,

(参考訳) ここ数年、ソーシャルメディアの利用者数は大幅に増加した。人々はよくソーシャルプラットフォームを通じて自分の考えを共有し、これはヘイトコンテンツの増加につながる。この仮想コミュニティでは、個人が自分の見解を共有し、感情を表現し、写真、ビデオ、ブログなどを投稿する。 FacebookやTwitterのようなソーシャルネットワークサイトは、ワンクリックで大量のコンテンツを共有できるプラットフォームを提供している。しかし、これらのプラットフォームはアップロードされたコンテンツに制限を課していない。この問題を解決するために、不適切なコンテンツを分割するためには、新しいアイデアが実装されなければならない。プロセスを自動化するために多くの研究がなされている。本稿では,テキストが攻撃的であるか否かを分類する新しいBi-GRU-CNNモデルを提案する。 Bi-GRUモデルとCNNモデルの組み合わせは、既存のモデルよりも優れている。

Over the years, the number of users of social media has increased drastically. People frequently share their thoughts through social platforms, and this leads to an increase in hate content. In this virtual community, individuals share their views, express their feelings, and post photos, videos, blogs, and more. Social networking sites like Facebook and Twitter provide platforms to share vast amounts of content with a single click. However, these platforms do not impose restrictions on the uploaded content, which may include abusive language and explicit images unsuitable for social media. To resolve this issue, a new idea must be implemented to divide the inappropriate content. Numerous studies have been done to automate the process. In this paper, we propose a new Bi-GRU-CNN model to classify whether the text is offensive or not. The combination of the Bi-GRU and CNN models outperforms the existing model.

翻訳日:2024-11-06 19:32:29 公開日:2024-10-18

# CNNに基づくBi-GRUモデルを用いた英語攻撃テキストの検出

English offensive text detection using CNN based Bi-GRU model ( http://arxiv.org/abs/2409.15652v3 )

ライセンス: Link先を確認

Tonmoy Roy, Md Robiul Islam, Asif Ahammad Miazee, Anika Antara, Al Amin, Sunjim Hossain,

翻訳日:2024-11-06 19:32:29 公開日:2024-10-18

# オープンソースソフトウェアにおけるProtestwareに対する開発者の反応: color.js と es5.ext のケース

Developer Reactions to Protestware in Open Source Software: The cases of color.js and es5.ext ( http://arxiv.org/abs/2409.15674v2 )

ライセンス: Link先を確認

Youmei Fan, Dong Wang, Supatsara Wattanakriengkrai, Hathaichanok Damrongsiri, Christoph Treude, Hideaki Hata, Raula Gaikovina Kula,

(参考訳) 保守層が政治や経済のスタンスをとるために自分の仕事を自己破壊することへの懸念が高まっており、これは「抗議者」と呼ばれる慣例である。我々の目的は,このような攻撃に関する議論やコミュニティの受け取り方,開発者がタイムリーに攻撃に反応するかどうかを理解することである。そこで我々は,2つの有名な抗議ウェア,すなわち color.js と es5-ext について検討した。結果として、抗議ウェアの議論はGitHubプラットフォームで急速に広まり、セキュリティ上の脆弱性はソーシャルメディアでより高速であることが示されている。デモウェアの議論の分類を確立させることで、スタンスを表現し、技術的な緩和指示を提供するポストを特定できる。 684件の抗議者関連投稿にテーマ分析を適用し,議論中の5つの主要なテーマを同定した。拡散して反応するわスタンス iii 評判だ iv コミュニケーションのスタイル v. 権利と倫理この作業は、開発者と開発者の両方に、開発者の政治的あるいは社会的行動と、オープンソースコミュニティの集合的幸福との間の健全なバランスを維持するための洞察を提供する。

There is growing concern about maintainers self-sabotaging their work in order to take political or economic stances, a practice referred to as "protestware". Our objective is to understand the discourse around discussions on such an attack, how it is received by the community, and whether developers respond to the attack in a timely manner. We study two notable protestware cases i.e., colors.js and es5-ext. Results indicate that protestware discussions are spread more quickly on the GitHub platform, while security vulnerabilities are faster on social media. By establishing a taxonomy of protestware discussions, we identify posts that express stances and provide technical mitigation instructions. We applied a thematic analysis to 684 protestware related posts to identify five major themes during the discussions: i. disseminate and response, ii. stance, iii. reputation, iv. communicative styles, v. rights and ethics. This work sheds light on the nuanced landscape of protestware discussions, offering insights for both researchers and developers into maintaining a healthy balance between the political or social actions of developers and the collective well-being of the open-source community.

翻訳日:2024-11-06 19:32:29 公開日:2024-10-18

# アテンションヘッド活性化パターンを交互に変更した超微調整アチエーブ高速タスク適応

Supervised Fine-Tuning Achieve Rapid Task Adaption Via Alternating Attention Head Activation Patterns ( http://arxiv.org/abs/2409.15820v2 )

ライセンス: Link先を確認

Yang Zhao, Li Du, Xiao Ding, Kai Xiong, Ting Liu, Bing Qin,

(参考訳) 複雑なタスクにおけるLLMのパフォーマンスはまだ不十分です。重要な問題は、LLMがデータ駆動スキーマで学習しているのに対して、これらの複雑なタスクに関する命令は、収集や構築が困難であることだ。逆に顕著な現象は、LLMが事前訓練の段階で得られた十分な事前知識で、より単純なタスクでより速く学習できることである。したがって、そのような急激な一般化の前提条件とメカニズムが解明できれば、複雑なタスクを学習するLLMの効率性と有効性を高めることができる。そこで本稿では,SFTプロセスが注視パターンの観点から,下流タスクにLLMを適用する過程を解析するために,勾配に基づく手法を用いる。 1) SFTにおけるタスク固有の注意を選択的に活性化する; 2) 複雑なタスクのアクティベーションパターンは基本的なタスクパターンの組み合わせである; 3) 少数のパラメータの変化は、少数のサンプルに対してSFT後のアクティベーションパターンに大きな影響を及ぼす可能性がある。

LLMs' performance on complex tasks is still unsatisfactory. A key issue is that presently LLMs learn in a data-driven schema, while the instructions about these complex tasks are both scarce and hard to collect or construct. On the contrary, a prominent phenomenon is that LLMs can learn rather fast on simpler tasks with adequate prior knowledge captured during pretraining stage. Thus, if the prerequisite and mechanism of such rapid generalization could be elucidated, it could enhance the efficiency and effectiveness of the LLM's ability to learn complex tasks. Thus, in this paper, we employ a gradient-based method, to dissect the process that the SFT process adapts LLMs to downstream tasks via the perspective of attention patterns. We find that: (1) LLMs selectively activate task-specific attention heads during SFT; (2) activation patterns for complex tasks are combinations of basic task patterns; and (3) changes in a few parameters can significantly impact activation patterns after SFT on a small number of samples.Based on these insights, experiments are conducted to actually enhance the efficiency and effectiveness of SFT.

翻訳日:2024-11-06 19:21:13 公開日:2024-10-18

# 同時音声翻訳におけるグラディエント・コンフリクトの緩和のためのモジュラー・ベース・ストラテジー

A Modular-based Strategy for Mitigating Gradient Conflicts in Simultaneous Speech Translation ( http://arxiv.org/abs/2409.15911v2 )

ライセンス: Link先を確認

Xiaoqian Liu, Yangfan Du, Jianjin Wang, Yuan Ge, Chen Xu, Tong Xiao, Guocheng Chen, Jingbo Zhu,

(参考訳) 同時音声翻訳(SimulST)は、ストリーミング音声入力を継続的に処理しながらターゲット言語テキストを生成し、重要なリアルタイム課題を提示する。マルチタスク学習は、SimulSTのパフォーマンスを向上させるためにしばしば使用されるが、一次タスクと補助タスクの最適化競合を導入し、全体的な効率を損なう可能性がある。既存のモデルレベルのコンフリクト解決方法は、非効率を悪化させ、高いGPUメモリ消費をもたらすこのタスクには適していない。これらの課題に対処するため,よりきめ細かいモジュラレベルでの衝突を検知し,勾配予測を用いて解決するMGCM(Modular Gradient Conflict Mitigation)戦略を提案する。実験の結果,MGCMは特に中・高遅延条件下でのSimulST性能を著しく改善し,オフラインタスクにおいて0.68BLEUのスコアアップを達成した。さらにMGCMは、他の競合緩和手法と比較して、GPUメモリ消費を95%以上削減し、SimulSTタスクの堅牢なソリューションとして確立している。

Simultaneous Speech Translation (SimulST) involves generating target language text while continuously processing streaming speech input, presenting significant real-time challenges. Multi-task learning is often employed to enhance SimulST performance but introduces optimization conflicts between primary and auxiliary tasks, potentially compromising overall efficiency. The existing model-level conflict resolution methods are not well-suited for this task which exacerbates inefficiencies and leads to high GPU memory consumption. To address these challenges, we propose a Modular Gradient Conflict Mitigation (MGCM) strategy that detects conflicts at a finer-grained modular level and resolves them utilizing gradient projection. Experimental results demonstrate that MGCM significantly improves SimulST performance, particularly under medium and high latency conditions, achieving a 0.68 BLEU score gain in offline tasks. Additionally, MGCM reduces GPU memory consumption by over 95\% compared to other conflict mitigation methods, establishing it as a robust solution for SimulST tasks.

翻訳日:2024-11-06 19:21:13 公開日:2024-10-18

# ダイヤモンド中の窒素空孔中心を用いた広視野マイクロ波磁界イメージング

Wide-field microwave magnetic field imaging with nitrogen-vacancy centers in diamond ( http://arxiv.org/abs/2409.16528v2 )

ライセンス: Link先を確認

Luca Basso, Pauli Kehayias, Jacob Henshaw, Gajadhar Joshi, Michael P. Lilly, Matthew B. Jordan, Andrew M. Mounce,

(参考訳) マイクロ波(MW)磁場のマイクロスケール横方向分解能の非侵襲イメージングは、MW技術や集積回路故障解析などの様々な応用において重要である。ダイヤモンド窒素空洞(NV)中心磁力計は理想的なツールとして登場し、$\mu$mスケールの解像度、ミリスケールの視野、高感度、様々なサンプルと互換性のある非侵襲イメージングを提供する。しかし、これまでは、静磁場や低周波磁場のイメージングや、MW磁場のイメージングにおいて、NVスピン遷移を駆動するのと同じマイクロ波デバイスを直接特徴付けるために主に用いられてきた。本研究では、ダイヤモンド中のNV中心アンサンブルを用いて、差分測定プロトコルを用いた試験装置によって生成されたMW磁場の広視野イメージングを行う。顕微鏡は、NVスピン状態間のRabi振動を誘導するMWループを備え、装置アンダーテストからのMWフィールドは、Rabi周波数の局所的な偏差によって測定される。この微分プロトコルは2.57 GHz MW の磁場マップを$\sim$ 9 $\mu$T Hz$^{-1/2}$で、合計測定期間は$T = 357$ sで、340\times 340$$\mu$m$^2$ビューと$\mu$mスケールの空間分解能とDUT入力パワーダイナミックレンジが30dBである。この研究は、差動ラビの周波数測定に基づく新しいNV磁気メトリプロトコルを実証し、標準ラビ磁気メトリで直接測定することが難しい弱いMW磁場のイメージングまで、NV広視野イメージング能力を拡張した。

Non-invasive imaging of microwave (MW) magnetic fields with microscale lateral resolution is pivotal for various applications, such as MW technologies and integrated circuit failure analysis. Diamond nitrogen-vacancy (NV) center magnetometry has emerged as an ideal tool, offering $\mu$m-scale resolution, millimeter-scale field of view, high sensitivity, and non-invasive imaging compatible with diverse samples. However, up until now, it has been predominantly used for imaging of static or low-frequency magnetic fields or, concerning MW field imaging, to directly characterize the same microwave device used to drive the NV spin transitions. In this work we leverage an NV center ensemble in diamond for wide-field imaging of MW magnetic fields generated by a test device employing a differential measurement protocol. The microscope is equipped with a MW loop to induce Rabi oscillations between NV spin states, and the MW field from the device-under-test is measured through local deviations in the Rabi frequency. This differential protocol yields magnetic field maps of a 2.57 GHz MW field with a sensitivity of $\sim$ 9 $\mu$T Hz$^{-1/2}$ for a total measurement duration of $T = 357$ s, covering a $340\times340$ $\mu$m$^2$ field of view with a $\mu$m-scale spatial resolution and a DUT input power dynamic range of 30 dB. This work demonstrates a novel NV magnetometry protocol, based on differential Rabi frequency measurement, that extends NV wide-field imaging capabilities to imaging of weak MW magnetic fields that would be difficult to measure directly through standard NV Rabi magnetometry.

翻訳日:2024-11-06 17:30:16 公開日:2024-10-18

# マルチモーダル分類のためのマルチモーダル混合コントラスト学習による共有関係の調和

Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification ( http://arxiv.org/abs/2409.17777v2 )

ライセンス: Link先を確認

Raja Kumar, Raghav Singhal, Pranamya Kulkarni, Deval Mehta, Kshitij Jadhav,

(参考訳) 深いマルチモーダル学習は、対照的な学習を活用して、モダリティをまたいだ明示的な1対1の関係を捉えることで、顕著な成功を収めた。しかし、実世界のデータは単純な対関係を超えて共有関係を示すことが多い。マルチモーダルデータに固有のニュアンス付き共有関係を抽出するマルチモーダル混合コントラスト学習手法であるM3CoLを提案する。我々の重要な貢献はミックスアップに基づくコントラッシブ・ロスであり、あるモダリティから混合サンプルを他のモダリティから対応するサンプルと整列させ、それら間の共有関係を捉えることによって、ロバストな表現を学ぶ。マルチモーダル分類タスクでは,Mixupに基づくコントラスト損失を補足して,統合モジュールと単調予測モジュールを統合してトレーニング中の補助的監視を行うフレームワークを導入する。多様なデータセット(N24News、ROSMAP、BRCA、Food-101)の広範な実験を通じて、M3CoLが共有マルチモーダル関係を効果的に捉え、ドメイン間の一般化を実証する。 N24News、ROSMAP、BRCAでは最先端の手法より優れており、Food-101では同等のパフォーマンスを達成している。我々の研究は、堅牢なマルチモーダル学習のための共有関係の学習の重要性を強調し、将来の研究に有望な道を開く。

Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.

翻訳日:2024-11-06 16:00:56 公開日:2024-10-18

# GPUテンソルコア上の大規模言語モデルの効率的な任意精度高速化

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores ( http://arxiv.org/abs/2409.17870v2 )

ライセンス: Link先を確認

Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang,

(参考訳) 大規模言語モデル(LLM)は広く応用されているが、効率的な推論では課題に直面している。量子化法は計算要求を減らすが、任意の精度の超低ビット量子化はGPUTensor Coreの限られたサポートと非効率的なメモリ管理によって妨げられ、最適以下の加速につながる。これらの課題に対処するために,任意の精度 LLM に対する包括的加速法を提案する。その中心となるのは、並列コンピューティングを容易にし、対称量子化をサポートし、データの冗長性を効果的に低減する新しいバイポーラ-INTデータフォーマットである。これに基づいて、任意の精度行列乗算方式を実装し、ビットレベルで行列を分解・復元し、GPUTensor Coreの利用を最大化しながら柔軟な精度を実現する。さらに,データレイアウトを最適化した効率的な行列前処理手法を開発した。最後に、高速共有メモリを戦略的に活用し、カーネル実行速度を大幅に向上し、メモリアクセスレイテンシを最小化するデータリカバリ指向メモリ管理システムを設計する。実験の結果,NVIDIAのCUTLASSと比較して,行列乗算の最大2.4倍の高速化が得られた。 LLMに組み込むと、最大6.7\timesの推論加速が達成される。これらの改良によりLLM推論効率が大幅に向上し、LLMのより広範かつ応答性の高い応用が可能となった。

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs. At its core, we introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization, effectively reducing data redundancy. Building on this, we implement an arbitrary precision matrix multiplication scheme that decomposes and recovers matrices at the bit level, enabling flexible precision while maximizing GPU Tensor Core utilization. Furthermore, we develop an efficient matrix preprocessing method that optimizes data layout for subsequent computations. Finally, we design a data recovery-oriented memory management system that strategically utilizes fast shared memory, significantly enhancing kernel execution speed and minimizing memory access latency. Experimental results demonstrate our approach's effectiveness, with up to 2.4\times speedup in matrix multiplication compared to NVIDIA's CUTLASS. When integrated into LLMs, we achieve up to 6.7\times inference acceleration. These improvements significantly enhance LLM inference efficiency, enabling broader and more responsive applications of LLMs.

翻訳日:2024-11-06 16:00:56 公開日:2024-10-18

# Show and Guide: Instructional-Plan Grounded Vision and Language Model

Show and Guide: Instructional-Plan Grounded Vision and Language Model ( http://arxiv.org/abs/2409.19074v1 )

ライセンス: Link先を確認

Diogo Glória-Silva, David Semedo, João Magalhães,

(参考訳) 複雑な手続き計画を通じてユーザを誘導することは、視覚的に図示された計画手順を持つことが、効果的な計画ガイダンスを提供するために不可欠である、本質的にマルチモーダルなタスクである。しかしながら、計画追従言語モデル(LM)に関する既存の研究は、しばしばマルチモーダルな入力と出力ができない。本研究では,MM-PlanLLMについて述べる。MM-PlanLLMは,テキスト計画と視覚情報の両方を活用することで,ユーザによる指導作業の実行を支援するための,最初のマルチモーダルLLMである。具体的には、ユーザクエリに基づいて関連するステップビデオセグメントを検索するConversational Video Moment Retrievalと、計画の次のステップを生成するVisually-Informed Step Generationである。 MM-PlanLLMは,マルチタスク・マルチステージ・アプローチを用いて訓練され,マルチモーダル・インストラクショナル・プラン・セマンティック・レイヤにモデルを徐々に公開し,マルチモーダル・テキスト・対話をプラン・グラウンドで実現する。さらに,本モデルでは,テキスト・プラン・ステップとインストラクショナル・ビデオ・モーメントの相互時間的および計画的構造的表現を提供する。

Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.

翻訳日:2024-11-06 04:30:57 公開日:2024-10-18

Diogo Glória-Silva, David Semedo, João Magalhães,

翻訳日:2024-11-06 04:30:57 公開日:2024-10-18

Diogo Glória-Silva, David Semedo, João Magalhães,

翻訳日:2024-11-06 04:30:57 公開日:2024-10-18

# 2D-TPE:大規模言語モデルのための2次元位置符号化によるテーブル理解

2D-TPE: Two-Dimensional Positional Encoding Enhances Table Understanding for Large Language Models ( http://arxiv.org/abs/2409.19700v1 )

ライセンス: Link先を確認

Jia-Nan Li, Jian Guan, Wei Wu, Zhengtao Yu, Rui Yan,

(参考訳) テーブルは、構造化された情報を簡潔に表現するために、様々な領域にまたがってユビキタスである。大きな言語モデル(LLM)を表データの推論に活用することは、積極的に探求された方向性を表している。しかし、典型的なLLMは1次元〜(1D)の入力しかサポートしていないため、既存の手法では2次元〜(2D)のテーブル構造をトークンの列に平らにすることで、空間的関係を著しく破壊し、必然的に重要な文脈情報が失われてしまう。本稿では,2つの厳密なプロキシタスクを通してテーブルの空間情報をキャプチャする際のLCMの性能に対する,そのような平坦化操作の有害な影響を実証的に実証する。次に,この課題に対処するため,単純な位置符号化手法である ``2D-TPE' (Two-dimensional Table Positional Encoding) を導入する。 2D-TPEにより、各アテンションヘッドは、出席するコンテキスト内のトークンの置換順序を動的に選択することができる。 2D-TPEは、計算効率を保ちながら重要な空間情報を失うリスクを効果的に軽減し、テーブル構造をよりよく保存する。 5つのベンチマークによる大規模な実験により、2D-TPEは強いベースラインよりも優れており、テーブル構造を正確なテーブル理解のために保存することの重要性が強調されている。包括的解析により、ベースラインよりも大きなテーブルに対する2D-TPEのスケーラビリティが大幅に向上することが明らかになった。

Tables are ubiquitous across various domains for concisely representing structured information. Empowering large language models (LLMs) to reason over tabular data represents an actively explored direction. However, since typical LLMs only support one-dimensional~(1D) inputs, existing methods often flatten the two-dimensional~(2D) table structure into a sequence of tokens, which can severely disrupt the spatial relationships and result in an inevitable loss of vital contextual information. In this paper, we first empirically demonstrate the detrimental impact of such flattening operations on the performance of LLMs in capturing the spatial information of tables through two elaborate proxy tasks. Subsequently, we introduce a simple yet effective positional encoding method, termed ``2D-TPE'' (Two-Dimensional Table Positional Encoding), to address this challenge. 2D-TPE enables each attention head to dynamically select a permutation order of tokens within the context for attending to them, where each permutation represents a distinct traversal mode for the table, such as column-wise or row-wise traversal. 2D-TPE effectively mitigates the risk of losing essential spatial information while preserving computational efficiency, thus better preserving the table structure. Extensive experiments across five benchmarks demonstrate that 2D-TPE outperforms strong baselines, underscoring the importance of preserving the table structure for accurate table comprehension. Comprehensive analysis further reveals the substantially better scalability of 2D-TPE to large tables than baselines.

翻訳日:2024-11-05 21:29:26 公開日:2024-10-18

Jia-Nan Li, Jian Guan, Wei Wu, Zhengtao Yu, Rui Yan,

翻訳日:2024-11-05 21:29:26 公開日:2024-10-18

Jia-Nan Li, Jian Guan, Wei Wu, Zhengtao Yu, Rui Yan,

翻訳日:2024-11-05 21:29:26 公開日:2024-10-18

# 釣り情報に基づく大規模言語モデルを用いた効率的なカリキュラムフェデレーション学習

Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models ( http://arxiv.org/abs/2410.00131v1 )

ライセンス: Link先を確認

Ji Liu, Jiaxiang Ren, Ruoming Jin, Zijie Zhang, Yang Zhou, Patrick Valduriez, Dejing Dou,

(参考訳) 分散データでモデルを協調的にトレーニングするための有望なパラダイムとして、フェデレートラーニング(FL)は、LLM(Large Language Models)に活用することができる。 LLMは巨大なサイズに対応するが、トレーニングデータの規模は大幅に増加し、膨大な計算量と通信コストがもたらされる。トレーニングデータは一般に非独立で、Identically Distributed(非IID)であり、各デバイスで適応的なデータ処理を必要とする。低ランク適応(LoRA)は、微調整プロセスで更新するパラメータの規模を著しく削減できるが、LLMのすべてのレイヤの低ランクパラメータを転送するのには、まだ十分な時間を要する。本稿では,フィッシャー情報に基づく効率的なカリキュラムフェデレーション学習フレームワーク(FibecFed)について,適応型フェデレーション学習と効率的なスパースパラメータ更新の2つの新しい手法を提案する。まず,各装置内のデータを適応的にサンプリングし,FL微調整プロセスの有効性を向上させるための漁師情報に基づく手法を提案する。第2に,グローバルアグリゲーションのための適切なレイヤとLoRAによる局所更新のためのスパースパラメータを動的に選択し,FL微調整プロセスの効率化を図る。 10のデータセットに基づく大規模な実験結果によると、FibecFedは17のベースラインアプローチと比較して優れた性能(正確性では最大45.35%)と微調整速度(最大98.61%高速)を達成している。

As a promising paradigm to collaboratively train models with decentralized data, Federated Learning (FL) can be exploited to fine-tune Large Language Models (LLMs). While LLMs correspond to huge size, the scale of the training data significantly increases, which leads to tremendous amounts of computation and communication costs. The training data is generally non-Independent and Identically Distributed (non-IID), which requires adaptive data processing within each device. Although Low Rank Adaptation (LoRA) can significantly reduce the scale of parameters to update in the fine-tuning process, it still takes unaffordable time to transfer the low-rank parameters of all the layers in LLMs. In this paper, we propose a Fisher Information-based Efficient Curriculum Federated Learning framework (FibecFed) with two novel methods, i.e., adaptive federated curriculum learning and efficient sparse parameter update. First, we propose a fisher information-based method to adaptively sample data within each device to improve the effectiveness of the FL fine-tuning process. Second, we dynamically select the proper layers for global aggregation and sparse parameters for local update with LoRA so as to improve the efficiency of the FL fine-tuning process. Extensive experimental results based on 10 datasets demonstrate that FibecFed yields excellent performance (up to 45.35% in terms of accuracy) and superb fine-tuning speed (up to 98.61% faster) compared with 17 baseline approaches).

翻訳日:2024-11-05 14:40:28 公開日:2024-10-18

Ji Liu, Jiaxiang Ren, Ruoming Jin, Zijie Zhang, Yang Zhou, Patrick Valduriez, Dejing Dou,

翻訳日:2024-11-05 14:40:28 公開日:2024-10-18

# 反射木探索と自己学習による自律型AIエージェントの改善

Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning ( http://arxiv.org/abs/2410.02052v1 )

ライセンス: Link先を確認

Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu,

(参考訳) 自律エージェントは、複雑な多段階意思決定タスクを自動化する大きな可能性を証明している。しかし、GPT-4oのような最先端のビジョン言語モデル(VLM)でさえ、特に複雑なWeb環境や長期計画タスクにおいて、人間レベルの性能に欠ける。これらの制限に対処するために、GPT-4oを動力とするAIエージェントの能力を高めるために設計された新しいテストタイムアルゴリズムであるReflective Monte Carlo Tree Search (R-MCTS)を導入する。 R-MCTSは従来のMCTSを拡張します 1) 比較反射を取り入れることで、エージェントは過去の相互作用から学び、探索効率を動的に改善することができる。 2) 信頼性のある状態評価を行うためにマルチエージェントの議論を用いる。さらに, R-MCTS 生成木トラバーサルを用いた自己学習により GPT-4o を微調整することにより, エージェントの性能を向上させる。挑戦的な VisualWebArena ベンチマークでは,GPT-4o ベースの R-MCTS エージェントが,従来の最先端技術と比較して,さまざまなタスクに対して 6% から 30% の相対的な改善を実現している。さらに,テストタイム検索から得られる知識を,微調整によりGPT-4oに効果的に戻すことができることを示す。微調整の GPT-4o は R-MCTS の性能の 97% と一致し、テスト時に 4 倍の計算量を削減した。さらに, 微調整GPT-4oモデルでは, 現状が成功に繋がらないことを検知した場合に, 環境探索, 状態評価, 実行可能な状態へのバックトラックを行うことができることを示した。さらに,本研究は,R-MCTSを用いたデータ収集とテスト時間の両方のトレーニングにおける計算スケーリング特性を実証する。これらの結果は,試験時間探索と自己学習によるエージェントアプリケーションに対するVLMの推論と計画能力を高めるための有望な研究方向を示唆している。

Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent's performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97% of R-MCTS's performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' reasoning and planning capabilities for agentic applications via test-time search and self-learning.

翻訳日:2024-11-04 09:15:24 公開日:2024-10-18

# 反射型MCTSと探索学習を用いたAIエージェントの探索

Teaching AI Agents to Search with Reflective-MCTS and Exploratory Learning ( http://arxiv.org/abs/2410.02052v2 )

ライセンス: Link先を確認

Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu,

(参考訳) 自律エージェントは、複雑な多段階意思決定タスクを自動化する大きな可能性を証明している。しかし、GPT-4oのような最先端のビジョン言語モデル(VLM)でさえ、特に複雑なWeb環境や長期計画タスクにおいて、人間レベルの性能に欠ける。これらの制約に対処するため,リフレクティブモンテカルロ木探索 (R-MCTS) と探索学習 (Exploratory Learning) を提案し,エージェントアプリケーションのためのo1ライクなモデルを構築する。 R-MCTSはAIエージェントがその場で意思決定空間を探索する能力を高めるために設計された新しいテストタイムアルゴリズムである。 R-MCTSは従来のMCTSを拡張します 1) 比較反射を取り入れることで、エージェントは過去の相互作用から学び、探索効率を動的に改善することができる。 2) 信頼性のある状態評価を行うためにマルチエージェントの議論を用いる。次に,探索学習(Exploratory Learning)という,外部探索アルゴリズムに頼らずに,エージェントに推論時間での探索を教える新しい学習戦略を紹介する。挑戦的な VisualWebArena ベンチマークでは,GPT-4o ベースの R-MCTS エージェントが,従来の最先端技術と比較して,さまざまなタスクに対して 6% から 30% の相対的な改善を実現している。さらに,テストタイム検索から得られる経験を,微調整によりGPT-4oに効果的に戻すことができることを示す。 GPT-4oの探索学習 1)現在の状態が成功に繋がらないことを検出すると、環境を探索し、状態を評価し、実行可能なものにバックトラックする能力を示す。 2) R-MCTSの性能は87%に相当し, 計算能力は大幅に低下した。特に、我々の研究は、R-MCTSによるデータ収集とテスト時間の両方のトレーニングにおける計算スケーリング特性を実証しています。これらの結果は,試験時間探索と自己学習によるエージェントアプリケーションに対するVLMの推論と計画能力を高めるための有望な研究方向を示唆している。

Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we present Reflective Monte Carlo Tree Search (R-MCTS) and Exploratory Learning to build o1-like models for agentic applications. We first introduce R-MCTS, a novel test-time algorithm designed to enhance the ability of AI agents to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the experience gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. After Exploratory Learning, GPT-4o 1) demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success, and 2) matches 87% of R-MCTS's performance while using significantly less compute. Notably, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' reasoning and planning capabilities for agentic applications via test-time search and self-learning.

翻訳日:2024-11-04 09:15:24 公開日:2024-10-18

# ExACT:AIエージェントにリフレクティブMCTSと探索学習を指導する

ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning ( http://arxiv.org/abs/2410.02052v3 )

ライセンス: Link先を確認

Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu,

(参考訳) 自律エージェントは、複雑な多段階意思決定タスクを自動化する大きな可能性を証明している。しかし、GPT-4oのような最先端のビジョン言語モデル(VLM)でさえ、特に複雑なWeb環境や長期のタスクにおいて、人間レベルのパフォーマンスに欠ける。これらの制約に対処するため,エージェントアプリケーション用のo1ライクなモデルを構築するために,テスト時検索と自己学習を組み合わせたExACTを提案する。リフレクティブモンテカルロ木探索(Reflective Monte Carlo Tree Search, R-MCTS)は、AIエージェントがその場で意思決定空間を探索する能力を高めるために設計された新しいテストタイムアルゴリズムである。 R-MCTSは従来のMCTSを拡張します 1) 比較反射を取り入れることで、エージェントは過去の相互作用から学び、探索効率を動的に改善することができる。 2) 信頼性のある状態評価にマルチエージェントの議論を用いる。次に,探索学習(Exploratory Learning)という,外部探索アルゴリズムに頼らずに,エージェントに推論時間での探索を教える新しい学習戦略を紹介する。挑戦的なVisualWebArenaベンチマークでは、GPT-4oベースのR-MCTSエージェントが、以前の最先端と比較して、さまざまなタスクに対して6%から30%の相対的な改善を実現しています。さらに,テストタイム検索から得られる知識と経験を,微調整により効率的に GPT-4o に戻すことができることを示す。 GPT-4oの探索学習 1)現在の状態が成功に繋がらないことを検出すると、環境を探索し、状態を評価し、実行可能なものにバックトラックする能力を示す。 2) R-MCTSの性能は87%に相当し, 計算能力は大幅に低下した。特に、我々の研究は、R-MCTSによるデータ収集とテスト時間の両方のトレーニングにおける計算スケーリング特性を実証しています。これらの結果は,試験時間探索と自己学習を通じて,エージェントアプリケーションに対するVLMの能力を高めるための有望な研究方向を示唆している。

Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon tasks. To address these limitations, we present ExACT, an approach to combine test-time search and self-learning to build o1-like models for agentic applications. We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test time algorithm designed to enhance AI agents' ability to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate for reliable state evaluation. Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms. On the challenging VisualWebArena benchmark, our GPT-4o based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge and experience gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. After Exploratory Learning, GPT-4o 1) demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success, and 2) matches 87% of R-MCTS's performance while using significantly less compute. Notably, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' capabilities for agentic applications via test-time search and self-learning.

翻訳日:2024-11-04 09:15:24 公開日:2024-10-18

# DomainLynx: 拡張されたドメインスクワット検出のための大規模言語モデルを活用する

DomainLynx: Leveraging Large Language Models for Enhanced Domain Squatting Detection ( http://arxiv.org/abs/2410.02095v1 )

ライセンス: Link先を確認