Fugu-MT 論文翻訳(概要): Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification

論文の概要: Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification

arxiv url: http://arxiv.org/abs/2604.10695v2
Date: Tue, 14 Apr 2026 03:39:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 14:01:13.408014
Title: Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification
Title（参考訳）: Recovering to Recovering: toward Uncomplete Audio-Visual Question Answering through Semantic-Consistent Purification (特集:音声・音声・音声)
Authors: Jiayu Zhang, Shuo Ye, Qilang Ye, Zihan Song, Jiajian Huang, Zitong Yu,
Abstract要約: R$2$ScPは、欠落したモダリティハンドリングのパラダイムを従来の生成的命令から検索ベースのリカバリにシフトする新しいフレームワークである。具体的には、統合されたセマンティック埋め込みによるクロスモーダル検索を利用して、欠落したドメイン固有知識を取得する。
参考スコア（独自算出の注目度）: 30.75902088237621
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent Audio-Visual Question Answering (AVQA) methods have advanced significantly. However, most AVQA methods lack effective mechanisms for handling missing modalities, suffering from severe performance degradation in real-world scenarios with data interruptions. Furthermore, prevailing methods for handling missing modalities predominantly rely on generative imputation to synthesize missing features. While partially effective, these methods tend to capture inter-modal commonalities but struggle to acquire unique, modality-specific knowledge within the missing data, leading to hallucinations and compromised reasoning accuracy. To tackle these challenges, we propose R$^{2}$ScP, a novel framework that shifts the paradigm of missing modality handling from traditional generative imputation to retrieval-based recovery. Specifically, we leverage cross-modal retrieval via unified semantic embeddings to acquire missing domain-specific knowledge. To maximize semantic restoration, we introduce a context-aware adaptive purification mechanism that eliminates latent semantic noise within the retrieved data. Additionally, we employ a two-stage training strategy to explicitly model the semantic relationships between knowledge from different sources. Extensive experiments demonstrate that R$^{2}$ScP significantly improves AVQA and enhances robustness in modal-incomplete scenarios.
Abstract（参考訳）: 近年,AVQA (Audio-Visual Question Answering) 手法が大幅に進歩している。しかし、ほとんどのAVQAメソッドは、データ中断を伴う現実のシナリオにおいて、重大なパフォーマンス劣化に悩まされる、欠落したモダリティを扱う効果的なメカニズムを欠いている。さらに、欠落したモダリティを扱うための一般的な方法は、主に欠落した特徴を合成するために生成的計算に依存している。部分的には効果があるが、これらの手法はモダリティ間の共通点を捕捉する傾向にあるが、欠落したデータの中で独特でモダリティ固有の知識を得るのに苦慮し、幻覚や推論の精度を損なう。これらの課題に対処するために、従来の生成的計算から検索に基づく回復へ、モダリティ処理のパラダイムをシフトさせる新しいフレームワークR$^{2}$ScPを提案する。具体的には、統合されたセマンティック埋め込みによるクロスモーダル検索を利用して、欠落したドメイン固有知識を取得する。セマンティック復元を最大化するために,検索したデータに潜時的なセマンティックノイズを除去するコンテキスト対応適応浄化機構を導入する。さらに、異なる情報源からの知識間の意味的関係を明示的にモデル化するために、2段階のトレーニング戦略を採用する。大規模な実験により、R$^{2}$ScPはAVQAを大幅に改善し、モーダル不完全シナリオの堅牢性を高めることが示されている。

関連論文リスト

Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration [40.720288165545476]
本研究では,機能不足を効果的に回復するために,拡張拡散モデルをプラグ可能な中段階トレーニングモジュールとして導入する。 I)動的モダリティゲーティング(動的モダリティゲーティング)は、条件付き特徴を適応的に活用し、セマンティック一貫性のある特徴の生成を制御し、(II)双方向アライメントを実現するためにデュアルエンコーダのセマンティック空間をブリッジするクロスモーダル相互学習機構である。
論文参考訳（メタデータ） (2026-02-03T06:06:35Z)
Buffer replay enhances the robustness of multimodal learning under missing-modality [9.512378886218395]
本稿では,Replay Prompting (REP)を導入し,ネットワークの深さが増大するにつれて情報損失を軽減し,より深い層で再生する。視覚言語、視覚言語、時間的マルチモーダルベンチマークの実験では、REPはシングルモーダルとマルチモーダルの両方の欠落シナリオにおいて、先行手法よりも一貫して優れていた。これらの結果から、REPは、欠落したモダリティ環境に挑戦する上で、堅牢なマルチモーダル学習のための軽量かつ効果的なパラダイムとして確立されている。
論文参考訳（メタデータ） (2025-11-28T10:55:31Z)
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering [54.72902502486611]
ReAG(Reasoning-Augmented Multimodal RAG)は、粗い部分ときめ細かい部分の検索と、無関係な通路をフィルタリングする批評家モデルを組み合わせた手法である。 ReAGは従来の手法よりも優れており、解答精度が向上し、検索された証拠に根ざした解釈可能な推論を提供する。
論文参考訳（メタデータ） (2025-11-27T19:01:02Z)
Synergistic Prompting for Robust Visual Recognition with Missing Modalities [13.821274074204082]
大規模マルチモーダルモデルは様々な視覚認識タスクにおいて顕著な性能を示した。欠落や不完全なモダリティ入力の存在は、しばしば大きなパフォーマンス劣化を引き起こす。モダリティの欠如を伴い、頑健な視覚認識を実現するための新しいSynergistic Promptingフレームワークを提案する。
論文参考訳（メタデータ） (2025-07-10T14:28:12Z)
ADMC: Attention-based Diffusion Model for Missing Modalities Feature Completion [25.1725138364452]
注意に基づく障害特徴補完(ADMC)のための拡散モデルを提案する。本フレームワークは,各モダリティに対する特徴抽出ネットワークを独立に訓練し,その特性を保ち,オーバーカップリングを回避する。提案手法は,IEMOCAPおよびMIntRecベンチマークの最先端結果を実現し,欠落シナリオと完全モダリティシナリオの両方において,その有効性を示す。
論文参考訳（メタデータ） (2025-07-08T03:08:52Z)
How Far Are We from Generating Missing Modalities with Foundation Models? [49.425856207329524]
欠落したモダリティの再構築に適したエージェントフレームワークを提案する。本手法は, 画像再構成に要するFIDを少なくとも14%, MERを少なくとも10%削減する。
論文参考訳（メタデータ） (2025-06-04T03:22:44Z)
DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery [71.6345505427213]
DPMeshは、人間のメッシュリカバリを排除した革新的なフレームワークである。これは、事前訓練されたテキスト・ツー・イメージ拡散モデルに埋め込まれた対象構造と空間的関係について、より深い拡散に乗じる。
論文参考訳（メタデータ） (2024-04-01T18:59:13Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
AVSR(Advanced Audio-Visual Speech Recognition)システムは、欠落したビデオフレームに敏感であることが観察されている。ビデオモダリティにドロップアウト技術を適用することで、フレーム不足に対するロバスト性が向上する一方、完全なデータ入力を扱う場合、同時に性能損失が発生する。本稿では,MDA-KD(Multimodal Distribution Approximation with Knowledge Distillation)フレームワークを提案する。
論文参考訳（メタデータ） (2024-03-07T06:06:55Z)
Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities [76.08541852988536]
我々は、欠落したモダリティ・イマジネーション・ネットワーク(IF-MMIN)に不変な特徴を用いることを提案する。提案モデルは,不確実なモダリティ条件下で,すべてのベースラインを上回り,全体の感情認識性能を不変に向上することを示す。
論文参考訳（メタデータ） (2022-10-27T12:16:25Z)
Self-attention fusion for audiovisual emotion recognition with incomplete data [103.70855797025689]
視覚的感情認識を応用したマルチモーダルデータ解析の問題点を考察する。本稿では、生データから学習可能なアーキテクチャを提案し、その3つの変種を異なるモダリティ融合機構で記述する。
論文参考訳（メタデータ） (2022-01-26T18:04:29Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。