Fugu-MT 論文翻訳(概要): Self-Contained Entity Discovery from Captioned Videos

論文の概要: Self-Contained Entity Discovery from Captioned Videos

arxiv url: http://arxiv.org/abs/2208.06662v1
Date: Sat, 13 Aug 2022 14:39:01 GMT
ステータス: 翻訳完了
システム内更新日: 2022-08-16 13:44:18.509406
Title: Self-Contained Entity Discovery from Captioned Videos
Title（参考訳）: カプセル映像からの自己完結型エンティティ発見
Authors: Melika Ayoughi, Pascal Mettes, Paul Groth
Abstract要約: 本稿では、タスク固有の監督やタスク固有の外部知識源を必要とせずに、ビデオにおける視覚的実体発見のタスクを紹介する。 SC-Friends と SC-BBT はFriends と Big Bang Theory TV シリーズをベースにしている。
参考スコア（独自算出の注目度）: 15.641523986669457
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g. faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating faces with entity labels. To bypass the annotation burden of this setup, several works have investigated the problem by utilizing external knowledge sources such as movie databases. While effective, such approaches do not work when task-specific knowledge sources are not provided and can only be applied to movies and TV series. In this work, we take the problem a step further and propose to discover entities in videos from videos and corresponding captions or subtitles. We introduce a three-stage method where we (i) create bipartite entity-name graphs from frame-caption pairs, (ii) find visual entity agreements, and (iii) refine the entity assignment through entity-level prototype construction. To tackle this new problem, we outline two new benchmarks SC-Friends and SC-BBT based on the Friends and Big Bang Theory TV series. Experiments on the benchmarks demonstrate the ability of our approach to discover which named entity belongs to which face or scene, with an accuracy close to a supervised oracle, just from the multimodal information present in videos. Additionally, our qualitative examples show the potential challenges of self-contained discovery of any visual entity for future work. The code and the data are available on GitHub.
Abstract（参考訳）: 本稿では、タスク固有の監督やタスク固有の外部知識源を必要とせずに、ビデオにおける視覚的実体発見のタスクを紹介する。特定の名前をビデオフレーム内のエンティティ(顔、シーン、オブジェクトなど)に割り当てることは、長年の課題です。一般に、この問題は、エンティティラベルを手動でアノテートすることで、教師付き学習目的として対処される。この設定のアノテーション負担を回避すべく,映画データベースなどの外部知識源を活用して,いくつかの研究を行った。効果はあるものの、タスク固有の知識ソースが提供されず、映画やテレビシリーズにしか適用できない場合、このようなアプローチは機能しない。本研究では,この問題をさらに一歩進めて,ビデオや対応する字幕や字幕から動画の実体を発見することを提案する。我々は3段階の手法を導入する。 (i)フレームキャプションペアから2部実体名グラフを作成する。 (ii)視覚的な実体の合意を見つけること、及び (iii)エンティティレベルのプロトタイプ構築によりエンティティの割り当てを洗練すること。この問題に対処するため、我々はFriendsとBig Bang Theory TVシリーズに基づくSC-FriendsとSC-BBTの2つの新しいベンチマークを概説した。ベンチマークにおける実験は、ビデオに現れるマルチモーダル情報から、どの名前付きエンティティがどの顔やシーンに属しているのかを、監督されたオラクルに近い精度で発見する能力を示しています。さらに、我々の定性的な例は、将来の作業のための視覚的実体を自己完結した発見の潜在的な課題を示している。コードとデータはGitHubで公開されている。

論文の概要: Self-Contained Entity Discovery from Captioned Videos

関連論文リスト