Fugu-MT 論文翻訳(概要): Fine-grained Audible Video Description

論文の概要: Fine-grained Audible Video Description

arxiv url: http://arxiv.org/abs/2303.15616v1
Date: Mon, 27 Mar 2023 22:03:48 GMT
ステータス: 翻訳完了
システム内更新日: 2023-03-29 17:07:54.294743
Title: Fine-grained Audible Video Description
Title（参考訳）: 細粒度可聴映像記述
Authors: Xuyang Shen and Dong Li and Jinxing Zhou and Zhen Qin and Bowen He and Xiaodong Han and Aixuan Li and Yuchao Dai and Lingpeng Kong and Meng Wang and Yu Qiao and Yiran Zhong
Abstract要約: FAVDBench(きめのきめ細かな映像記述ベンチマーク)を構築した。各ビデオクリップについて、まずビデオの1文要約を行い、次に、視覚的詳細を記述した4～6文と、最後に1～2つの音声関連記述を示す。細かなビデオ記述を利用することで、キャプションよりも複雑なビデオが作成できることを実証する。
参考スコア（独自算出の注目度）: 61.81122862375985
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, ie, the caption, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions.
Abstract（参考訳）: 本研究では,FAVDと呼ばれる音声視覚言語モデリングの新しい課題について検討する。対象物の外観や空間的位置,移動対象の動作,映像中の音など,所定の可聴ビデオに関する詳細なテキスト記述を提供することを目的としている。既存の視覚言語モデリングタスクは、言語とオーディオのモダリティを過小評価しながら、ビデオの視覚的手がかりに集中することが多い。一方、FAVDは音声視覚言語モデリングスキルだけでなく、段落レベルの言語生成能力も必要としている。本研究を円滑に進めるため, FAVDBench(きめ細かな映像記述ベンチマーク)を構築した。各ビデオクリップに対して,まずビデオの1文要約,ie,キャプション,続いて4～6文の視覚詳細と1～2文の音声関連記述を提供する。その説明は英語と中国語の両方で書かれている。このタスクのために、視覚的記述におけるエンティティの完全性を評価するEntityScoreと、オーディオ記述を評価するAudioScoreの2つの新しいメトリクスを作成します。この課題に対する予備的アプローチとして,既存の映像キャプションモデルを拡張した音声・視覚言語トランスフォーマを提案する。マスク付き言語モデリングと自動回帰言語モデリングの損失を組み合わせることで、モデル最適化を行い、段落レベルの記述を生成する。従来のキャプション指標と提案指標の両方を用いて,提案したベンチマークと比較し,音声視覚言語モデルにおけるモデルの有効性について述べる。さらに,ビデオ生成モデルのベンチマークを行い,細粒度ビデオ記述を用いることでキャプションよりも複雑な映像を生成できることを実証した。

論文の概要: Fine-grained Audible Video Description

関連論文リスト