Fugu-MT 論文翻訳(概要): D$^2$TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

論文の概要: D$^2$TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

arxiv url: http://arxiv.org/abs/2305.12767v2
Date: Mon, 16 Oct 2023 10:04:37 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-17 23:17:41.326749
Title: D$^2$TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization
Title（参考訳）: D$^2$TV:多対多マルチモーダル要約のための二重知識蒸留とターゲット指向ビジョンモデリング
Authors: Yunlong Liang, Fandong Meng, Jiaan Wang, Jinan Xu, Yufeng Chen, Jie Zhou
Abstract要約: many-to-many multimodal summarization (M$3$S) タスクは、どんな言語でも文書入力と対応する画像シーケンスで要約を生成することを目的としている。本稿では,M$3$Sタスクのための二重知識蒸留と目標指向視覚モデリングフレームワークを提案する。
参考スコア（独自算出の注目度）: 113.72253589338472
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many-to-many multimodal summarization (M$^3$S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence, which essentially comprises multimodal monolingual summarization (MMS) and multimodal cross-lingual summarization (MXLS) tasks. Although much work has been devoted to either MMS or MXLS and has obtained increasing attention in recent years, little research pays attention to the M$^3$S task. Besides, existing studies mainly focus on 1) utilizing MMS to enhance MXLS via knowledge distillation without considering the performance of MMS or 2) improving MMS models by filtering summary-unrelated visual features with implicit learning or explicitly complex training objectives. In this paper, we first introduce a general and practical task, i.e., M$^3$S. Further, we propose a dual knowledge distillation and target-oriented vision modeling framework for the M$^3$S task. Specifically, the dual knowledge distillation method guarantees that the knowledge of MMS and MXLS can be transferred to each other and thus mutually prompt both of them. To offer target-oriented visual features, a simple yet effective target-oriented contrastive objective is designed and responsible for discarding needless visual information. Extensive experiments on the many-to-many setting show the effectiveness of the proposed approach. Additionally, we will contribute a many-to-many multimodal summarization (M$^3$Sum) dataset.
Abstract（参考訳）: many-to-many multimodal summarization (M$^3$S) タスクは、任意の言語における文書入力と、MMS(Multimodal monolingual summarization)タスクとMXLS(Multimodal cross-lingual summarization)タスクからなる対応する画像シーケンスを持つ任意の言語における要約を生成することを目的としている。 MMS や MXLS に多くの研究が注がれており、近年注目されているが、M$3$S の課題にはほとんど注目されていない。それに既存の研究は主に 1)MMSを利用した知識蒸留によるMXLSの高度化,又はMMSの性能を考慮せずに 2) 要約非関連視覚特徴を暗黙的な学習, 明示的な複雑な訓練目的でフィルタリングすることにより, MMSモデルを改善する。本稿では,まず,m$^3$sという汎用的かつ実用的な課題について述べる。さらに, m$^3$sタスクのための二重知識蒸留と目標指向視覚モデリングフレームワークを提案する。具体的には、二重知識蒸留法は、MMSとMXLSの知識を相互に伝達できることを保証し、両者を相互に促進する。目標指向の視覚機能を提供するため、単純で効果的な目標指向の対比目的が設計され、不要な視覚情報を破棄する責任がある。多対多設定に関する広範囲な実験により,提案手法の有効性が示された。さらに、多対多のマルチモーダル要約(m$^3$sum)データセットも提供します。

関連論文リスト

True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
MLLM(Multimodal Large Language Models)は、新しいタスクに適応したMICL(Multimodal In-Context Learning)を実現する。現在のMLLMは、視覚的手がかりを無視し、テキストパターンを過度に無視する傾向にあり、真のマルチモーダル適応よりも単なるテキスト模倣に繋がる。視覚的コンテキストへのモデルへの参加を促す,効率的な微調整戦略であるDynamic Attention Reallocation (DARA)を紹介した。
論文参考訳（メタデータ） (2025-07-21T17:08:18Z)
Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales [7.119479942471737]
既存の方法は、画像とテキストの両方からアスペクトや感情に関連する情報を集めるために、事前訓練された小さな言語モデル(SLM)に依存している。我々は,SLMの意思決定能力とMABSAのためのLLMが提供する付加情報を組み合わせた新しいフレームワークLRSAを提案する。
論文参考訳（メタデータ） (2025-05-20T15:28:26Z)
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs [28.20725794099928]
下流の多様なタスクに対する差別表現を学習する新しいフレームワークであるUniMEを紹介する。最初の段階では、強力なLLMベースの教師モデルからテキスト識別的知識蒸留を行う。第2段階では、識別表現学習をさらに進めるために、強陰性強化命令チューニングを導入する。
論文参考訳（メタデータ） (2025-04-24T10:51:52Z)
NoteLLM-2: Multimodal Large Representation Models for Recommendation [60.17448025069594]
マルチモーダルなアイテム・ツー・イテムレコメンデーションにおけるマルチモーダル表現を強化するための大規模言語モデルの可能性について検討する。 1つの実現可能な方法は、表現タスクのためにMLLM(Multimodal Large Language Models)を転送することである。マルチモーダル表現に特化して設計された新しいトレーニングフレームワークNoteLLM-2を提案する。
論文参考訳（メタデータ） (2024-05-27T03:24:01Z)
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI [71.53579367538725]
MMT-Benchは、大規模なマルチモーダルタスクにわたるLVLM(Large Vision-Language Models)を評価するために設計されたベンチマークである。 MMT-Benchは、様々なマルチモーダルシナリオから、巧妙にキュレートされたマルチチョイスの視覚的質問を31,325ドルで提供する。
論文参考訳（メタデータ） (2024-04-24T17:37:05Z)
M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment [0.0]
本稿では,認知負荷評価のためのAVCAffeデータセットに適用した,新しいマルチモーダルマルチタスク学習フレームワークであるM&Mモデルを提案する。 M&Mは、オーディオとビデオの入力のための特別なストリームを特徴とする、デュアル・パスウェイ・アーキテクチャを通じてオーディオヴィジュアル・キューを独自に統合する。重要な革新は多面的マルチヘッドアテンション機構であり、同期マルチタスクの異なるモダリティを融合させる。
論文参考訳（メタデータ） (2024-03-14T14:49:40Z)
Generative Multimodal Models are In-Context Learners [60.50927925426832]
我々は37億のパラメータを持つ生成的マルチモーダルモデルであるEmu2を紹介し、大規模マルチモーダルシーケンスで訓練する。 Emu2は、マルチモーダルなインコンテキスト学習能力を示し、オンザフライ推論を必要とするタスクを解決しようとさえしている。
論文参考訳（メタデータ） (2023-12-20T18:59:58Z)
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models [86.478087039015]
モデル重み、チューニングタスク、視覚埋め込みを併用した多目的多モード大言語モデル(MLLM)を提案する。提案したジョイントミキシングに基づいて,高解像度画像のきめ細かい外観をより正確に捉えるための効率的な手法を提案する。今後のMLLM研究におけるジョイントミキシングの探求に光を当てることを願っている。
論文参考訳（メタデータ） (2023-11-13T18:59:47Z)
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning [42.68425777473114]
大規模言語モデル(LLM)によって強化された視覚言語モデル(VLM)は、急速に人気が高まっている。マルチモーダル・インコンテキスト・ラーニング(MMICL)を用いた視覚言語モデルを導入し,VLMがマルチモーダル入力を効率的に処理できるようにする。実験により,MMICLは多種多様な視覚言語タスクにおいて,最先端のゼロショット性能を実現することを確認した。
論文参考訳（メタデータ） (2023-09-14T17:59:17Z)
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention [101.99313208598569]
視覚と言語(V-L)タスクは、視覚内容と自然言語の両方を理解する必要がある。視覚と言語に対する注意空間を分離したDiMBERT(Disentangled Multimodal-Attention BERT)を提案する。 DiMBERTは3つのタスクに対して最新のパフォーマンスを新たに設定する。
論文参考訳（メタデータ） (2022-10-28T23:00:40Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。