Fugu-MT 論文翻訳(概要): Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos

論文の概要: Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos

arxiv url: http://arxiv.org/abs/2509.20961v1
Date: Thu, 25 Sep 2025 09:54:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:12.828434
Title: Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos
Title（参考訳）: Unlocking Financial Insights:ファイナンシャル・アドバイザリ・ビデオのためのマルチモーダル・アウトプット・フレームワークによる高度なマルチモーダル・サマリゼーション
Authors: Sarmistha Das, R E Zera Marveen Lyngkhoi, Sriparna Saha, Alka Maurya,
Abstract要約: FASTER(Financial Advisory Summariser with Textual Embedded Relevant Image)は、最適化された簡潔な要約を生成するフレームワークである。 FASTERは、セマンティックな視覚的記述にBLIP、テキストパターンにOCR、話者ダイアリゼーションをBOS機能としてWhisperベースの書き起こしにBLIPを使用している。 A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensure precision, Relevance, and factual consistency。
参考スコア（独自算出の注目度）: 11.550322270589952
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER's strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: https://github.com/sarmistha-D/FASTER
Abstract（参考訳）: ソーシャルメディアのダイナミックな伝播は、ポッドキャストビデオを通じて金融アドバイザリーコンテンツの範囲を広げてきたが、長いマルチモーダルセグメント(30～40分)からの洞察を抽出することは依然として困難である。 FASTER(Financial Advisory Summariser with Textual Embedded Relevant Image)は,(1)モダリティ固有の特徴の抽出,(2)最適化,簡潔な要約,(3)視覚的キーフレームと関連するテキストポイントの整合,という3つの課題に対処するモジュラーフレームワークである。 FASTERは、セマンティックな視覚的記述にBLIP、テキストパターンにOCR、話者ダイアリゼーションをBOS機能としてWhisperベースの書き起こしにBLIPを使用している。 A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensure the precision, Relevance and factual consistency against the human-aligned summary。ランク付けに基づく検索機構は、キーフレームを要約された内容と整合させ、解釈可能性とクロスモーダルコヒーレンスを高める。データ資源の不足を認めるため、我々はFin-APTという、470の公開金融アドバイザリー・ペプトーク・ビデオからなるデータセットを導入し、堅牢なマルチモーダル・リサーチを行った。総合的なクロスドメイン実験により、大言語モデル(LLM)や視覚言語モデル(VLM)と比較して、FASTERの強い性能、堅牢性、一般化性が確認されている。 FASTERは、マルチモーダル要約の新しい標準を確立することにより、金融アドバイザリーコンテンツをよりアクセシブルかつ実用的なものにし、研究のための新たな道を開く。データセットとコードは、https://github.com/sarmistha-D/FASTER.comで入手できる。

論文の概要: Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos

関連論文リスト