Fugu-MT 論文翻訳(概要): Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

論文の概要: Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

arxiv url: http://arxiv.org/abs/2605.05831v2
Date: Mon, 11 May 2026 06:58:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 19:24:01.328315
Title: Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media
Title（参考訳）: 科学コミュニケーションの統一:科学メディア全体での微粒化対応
Authors: Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar,
Abstract要約: 科学知識のコミュニケーションはますます多モーダルになりつつある。フォーマット間の明示的なリンクの欠如は、概念、視覚、説明がどのように対応するかを追跡するのを難しくする。研究論文、プレゼンテーションビデオ、解説ビデオ、同じ作品のスライドを統合する最初のベンチマーク。我々は, 埋め込み型および視覚言語モデルを用いて, 微粒なクロスフォーマット対応を見つける能力を評価する。
参考スコア（独自算出の注目度）: 40.40617019402065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD
Abstract（参考訳）: 科学知識のコミュニケーションは、研究論文、スライド、記録されたプレゼンテーションなどの資料を通じて、テキスト、視覚、スピーチにまたがって、ますます多モーダルになりつつある。これらの異なる表現は、研究の推論、結果、洞察をまとめて伝え、理解を深める補完的な視点を提供する。しかし、それらの共通目的にもかかわらず、そのような材料は構造的に接続されることは滅多にない。フォーマット間の明示的なリンクがないため、概念、視覚、説明がどのように対応するのかを追跡できなくなり、研究内容の統一的な探索と分析が制限される。このギャップに対処するために、研究論文、プレゼンテーションビデオ、解説ビデオ、スライドを統合した最初のベンチマークであるMultimodal Conference Dataset(MCD)を紹介する。本研究は, 組込み型および視覚言語モデルを用いて, 微粒なクロスフォーマット対応を見つける能力の評価を行い, この課題に対する最初のシステマティック・ベンチマークを確立することを目的とする。この結果から,視覚言語モデルは頑健だが微妙なアライメントに苦しむ一方で,埋め込み型モデルはテキスト視覚対応をよく捉えているが,方程式や記号の内容は埋め込み空間の異なるクラスタを形成することがわかった。これらの知見は、現在のアプローチの強みと限界と、マルチモーダルな科学的理解における将来の研究の鍵となる方向性の両方を浮き彫りにしている。再現性を確保するため、私たちはMCDのリソースをhttps://github.com/meghamariamkm2002/MCDでリリースします。

論文の概要: Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

関連論文リスト