Fugu-MT 論文翻訳(概要): SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models

論文の概要: SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models

arxiv url: http://arxiv.org/abs/2601.01062v1
Date: Sat, 03 Jan 2026 04:11:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-06 16:25:21.985328
Title: SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models
Title（参考訳）: SPoRC-VIST:視覚・言語モデルにおける生成自然現象評価ベンチマーク
Authors: Yunlin Zeng,
Abstract要約: エンド・ツー・エンドのビジュアルポッドキャスト生成のための新しいパイプラインを提案する。 Qwen3-VL-32Bモデルを4000対の画像対のキュレートデータセット上に微調整する。実験により、細調整された32Bモデルは会話自然性において235Bベースモデルよりも大幅に優れていることが示された。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-Language Models (VLMs) have achieved remarkable success in descriptive tasks such as image captioning and visual question answering (VQA). However, their ability to generate engaging, long-form narratives -- specifically multi-speaker podcast dialogues -- remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational naturalness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual podcast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically generated imagery, and evaluate on real-world photo sequences from the Visual Storytelling Dataset (VIST). This rigorous setup tests the model's ability to generalize from synthetic training data to real-world visual domains. We propose a comprehensive evaluation framework that moves beyond textual overlap, and use AI-as-a-judge (Gemini 3 Pro, Claude Opus 4.5, GPT 5.2) and novel style metrics (average turn length, speaker switch rate) to assess quality. Our experiments demonstrate that our fine-tuned 32B model significantly outperforms a 235B base model in conversational naturalness ($>$80\% win rate) and narrative depth (+50\% turn length), while maintaining identical visual grounding capabilities (CLIPScore: 20.39).
Abstract（参考訳）: 視覚言語モデル (VLM) は画像キャプションや視覚質問応答 (VQA) といった記述的タスクにおいて顕著な成功を収めている。しかし、多話者ポッドキャストの対話など、エンゲージメントのある長文の物語を生成する能力は、まだ未熟であり、評価が難しいままである。 BLEUやROUGEのような標準的なメトリクスは、会話の自然さ、個性、物語の流れのニュアンスを捉えず、しばしばエンゲージメントなストーリーテリングよりも安全で反復的なアウトプットを報いる。そこで本研究では,4000対の画像-対話対のキュレートデータセット上で,エンドツーエンドのビジュアルポッドキャスト生成のための新しいパイプラインと,Qwen3-VL-32Bモデルを微調整する。重要なことは、我々は合成から現実へのトレーニング戦略を用いて、構造化ポッドキャスト研究コーパス(SPoRC)から高品質なポッドキャスト対話と合成生成画像の組み合わせを訓練し、ビジュアルストーリーテリングデータセット(VIST)から実世界の写真シーケンスを評価する。この厳密なセットアップは、合成トレーニングデータから現実のビジュアルドメインへモデルを一般化する能力をテストする。テキストの重なりを超えて、AI-as-a-judge(Gemini 3 Pro, Claude Opus 4.5, GPT 5.2)と新しいスタイルメトリクス(平均ターン長、話者スイッチレート)を用いて品質を評価する総合評価フレームワークを提案する。実験の結果,細調整32Bモデルは,会話自然度(=80\%)と物語深度(+50\%ターン長)において235Bベースモデルよりも有意に優れ,同一の視覚的接地能力(CLIPScore: 20.39)を維持した。

論文の概要: SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models

関連論文リスト