The majority of traditional text-to-video retrieval systems operate in static
environments, i.e., there is no interaction between the user and the agent
beyond the initial textual query provided by the user. This can be suboptimal
if the initial query has ambiguities, which would lead to many falsely
retrieved videos. To overcome this limitation, we propose a novel framework for
Video Retrieval using Dialog (ViReD), which enables the user to interact with
an AI agent via multiple rounds of dialog. The key contribution of our
framework is a novel multimodal question generator that learns to ask questions
that maximize the subsequent video retrieval performance. Our multimodal
question generator uses (i) the video candidates retrieved during the last
round of interaction with the user and (ii) the text-based dialog history
documenting all previous interactions, to generate questions that incorporate
both visual and linguistic cues relevant to video retrieval. Furthermore, to
generate maximally informative questions, we propose an Information-Guided
Supervision (IGS), which guides the question generator to ask questions that
would boost subsequent video retrieval accuracy. We validate the effectiveness
of our interactive ViReD framework on the AVSD dataset, showing that our
interactive method performs significantly better than traditional
non-interactive video retrieval systems. Furthermore, we also demonstrate that
our proposed approach also generalizes to the real-world settings that involve
interactions with real humans, thus, demonstrating the robustness and
generality of our framework
Department of Computer Science Department of Computer Science
計算機科学専攻 計算機科学専攻
0.61
Department of Computer Science Avinash Madasu
計算機科学専攻 アビナシュ・マダス(Avinash Madasu)
0.52
UNC Chapel Hill USA
UNCチャペルヒル アメリカ
0.73
avinashm@cs.unc.edu
avinashm@cs.unc.edu
0.29
Junier Oliva
Junier–Oliva
0.32
UNC Chapel Hill USA
UNCチャペルヒル アメリカ
0.73
joliva@cs.unc.edu
joliva@cs.unc.edu
0.29
Gedas Bertasius
Gedas Bertasius
0.42
UNC Chapel Hill USA
UNCチャペルヒル アメリカ
0.73
gedas@cs.unc.edu
gedas@cs.unc.edu
0.29
2 2 0 2 y a M 3 1
2 2 0 2 y a M 3 1
0.43
] V C . s c [
] 略称はC。 sc [
0.39
2 v 9 3 7 5 0
2 v 9 3 7 5 0
0.42
. 5 0 2 2 : v i X r a
. 5 0 2 2 : v i X r a
0.42
ABSTRACT The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user.
To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog.
The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance.
(i) the video candidates retrieved during the last round of interaction with the user and
i) ユーザとのインタラクションの最終ラウンド中に検索したビデオ候補と
0.75
(ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval.
Furthermore, to generate maximally informative questions, we propose an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy.
さらに,最大有意な質問を生成するために,質問生成者を誘導するigs(information-guid ed supervisor)を提案する。 訳抜け防止モード: さらに、最大情報的な質問を生成する。 IGS(Information- Guided Supervision)を提案する。 質問生成装置をガイドして、その後のビデオ検索精度を高める質問を行う。
0.66
We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
Furthermore, we also demonstrate that our proposed approach also generalizes to the real-world settings that involve interactions with real humans, thus, demonstrating the robustness and generality of our framework.
KEYWORDS interactive video retrieval, dialog generation, multi-modal learning
KEYWORDSインタラクティブビデオ検索,ダイアログ生成,マルチモーダル学習
0.78
1 INTRODUCTION The typical (static) video retrieval framework fetches a limited list of candidate videos from a large collection of videos according to a user query (e g ‘cooking videos’).
However, the specificity of this query will likely be limited, and the uncertainty among candidate videos based on the user query is typically opaque (i.e. the user might not know what additional information will yield better results).
For example, consider the scenario where you are deciding what dish to make for dinner on a Friday night.
例えば、金曜日の夜に夕食にどの料理を作るかを決めるシナリオについて考えてみましょう。
0.77
Now also suppose that you have access to an interactive AI agent who can help you with this task by retrieving the videos of relevant dishes and detailed instructions on how to make those dishes.
A user-friendly video retrieval framework will not display all such videos to the user and expect them to sift through hundreds of videos to find the videos that are most relevant to them.
This would then allow the user to provide additional information clarifying some of his/her preferences (e g , plant or meat diet, etc.) so that an AI agent can narrow down its search.
Additionally, the recent work of Cai et al [7] proposed Ask-and-Confirm, a framework that allows the user to confirm if the proposed object is present or absent in the image.
さらに、cai et al [7]の最近の研究は、提案されているオブジェクトが画像に存在しないかどうかをユーザが確認できるフレームワークであるask-and-confirmを提案している。
0.60
One downside of these prior approaches is that they typically require many interaction rounds (e g , > 5), which increases user effort, and degrades user experience.
Furthermore, these approaches significantly limit the form of the user-agent interaction, i.e., the users can only verify the presence or absence of a particular object/attribute in an image but nothing more.
In contrast, our ViReD framework enables the user to interact with an agent using free-form questions, which is a natural form of interaction for most humans.
Our key technical contribution is a multimodal question generator optimized with a novel Information-Guided Supervision (IGS).
我々の重要な技術的貢献は、新しいIGS(Information-Guid ed Supervision)に最適化されたマルチモーダル質問生成器である。
0.57
Unlike text-only question generators, our question generator operates on
テキストのみの質問ジェネレータとは異なり、質問ジェネレータは動作します。
0.49
(i) the entire textual dialog history (if any) and
(i)全文のダイアログ履歴(もしあれば)と
0.69
(ii) previously retrieved top video candidates, which allows it to incorporate relevant visual and linguistic cues into the question generation process.
The agent then searches for relevant videos in the database and returns eight candidate videos.
エージェントはデータベース内の関連動画を検索し、8つの候補ビデオを返す。
0.85
Due to high uncertainty in the initial query, the agent then asks another follow-up question “Which cuisine do you prefer?” for which the user responds: “Mediterranean.”
最初のクエリに不確実性が高いため、エージェントは、ユーザが応答する別のフォローアップ質問“Which dishes do you prefer?”、“Mediterranean”を尋ねる。
0.74
As the number of retrieved video candidates is reduced to four, the agent asks one final question: “Do you like plant or meat diet?”
The user’s response (i.e., “plant diet”) then helps the agent to reduce the search space to the final candidate video, which is then displayed to the user.
model to generate maximally informative questions, thus, leading to higher text-to-video retrieval accuracy.
最大情報的質問を生成するモデルにより,テキスト対ビデオの検索精度が向上する。
0.74
We validate our entire interactive framework ViReD on the Audio-Visual Scene Aware Dialog dataset (AVSD) [3] demonstrating that it outperforms all non-interactive methods by a substantial margin.
We also demonstrate that our approach generalizes to the real-world scenarios involving interactions with real humans, thus, indicating its effectiveness and generality.
また,本手法は実際の人間との相互作用に関わる現実のシナリオに一般化し,その有効性と汎用性を示す。
0.68
Lastly, we thoroughly ablate different design choices of our interactive video retrieval framework to inspire future work in this area.
2 RELATED WORK 2.1 Multimodal Conversational Agents There has been a significant progress in designing multimodal conversational agents especially in the context of image-based visual dialog [4, 10, 11, 34, 35].
2.2 Video Question Answering Following standard visual question answering (VQA) methods in images [1, 2, 30, 44], video based question answering (video QA) aims to answer questions about videos [21, 22, 46, 48].
Compared to visual question answering in images, video question answering is more challenging because it requires complex temporal reasoning.
画像の視覚的質問応答と比較して、複雑な時間的推論を必要とするため、ビデオ的質問応答の方が難しい。
0.61
Le et al [19] introduced a multi-modal transformer model for video QA to incorporate representations from different modalities.
Le et al [19]はビデオQAのためのマルチモーダルトランスフォーマーモデルを導入し、異なるモーダル表現を組み込んだ。
0.68
Additionally, Le et al [20] proposed a bi-directional spatial temporal reasoning model to capture inter dependencies along spatial and temporal dimensions of videos.
さらに、le et al [20]は、ビデオの空間的および時間的次元に沿って相互依存性を捉えるための双方向空間的時間的推論モデルを提案した。
0.67
Recently, Lin et al [27] introduced Vx2Text, a multi-modal transformer-based generative network for video QA.
最近、lin et al [27] はビデオqaのためのマルチモーダルトランスフォーマティブベースの生成ネットワーク vx2text を導入した。
0.67
Compared to these prior methods, we aim to develop a framework for interactive dialog-based video retrieval setting.
これらの先行手法と比較し,対話型対話型ビデオ検索のためのフレームワークの開発を目指している。
0.67
2.3 Multimodal Video Retrieval Most of the recent multimodal video retrieval systems are based on deep neural networks [5, 8, 9, 12, 13, 15, 32].
Specifically, after the initial user query, the first round of retrieved videos are used to generate a question 𝑞𝑡 , which the user then answers with an answer 𝑎𝑡 .
The generated dialog is added to the dialog history 𝐻𝑡 = {𝐻𝑡−1, (𝑞𝑡 , 𝑎𝑡)}, which is then used as additional input in the subsequent rounds of interaction.
Instead, we propose an interactive dialog-based framework for video retrieval.
代わりに,ビデオ検索のための対話型対話型フレームワークを提案する。
0.65
3 VIDEO RETRIEVAL USING DIALOG In this section, we introduce ViReD , our proposed video retrieval framework using dialog.
3 VIDEO RetriEVal using DIALOG ここでは,ダイアログを用いたビデオ検索フレームワークであるViReDを紹介する。
0.78
Formally, given an initial text query 𝑇 specified by the user, and the previously generated dialog history 𝐻𝑡−1, our goal is to retrieve 𝑘 most relevant videos 𝑉1, 𝑉2, ..., 𝑉𝑘.
(iii) a video retrieval module, which takes as inputs the initial textual query and any generated dialog history and retrieves relevant videos from a large video database.
We now describe each of these components in more detail.
これらの各コンポーネントについて、より詳細に説明します。
0.59
3.1 Question Generator As illustrated in Figure 3, at time 𝑡, our question generator takes as inputs
3.1 質問生成装置 図3に示すように、時刻 t では、質問生成器を入力として取ります。 訳抜け防止モード: 3.1 図3に示すように、時刻t, 質問生成装置は入力として
0.74
(i) the initial text query 𝑇 ,
(i)最初のテキストクエリ t ,
0.70
(ii) top 𝑘 retrieved videos at time 𝑡−1, and
(ii)t−1時のトップk検索ビデオ、
0.70
(iii) previously generated dialog history 𝐻𝑡−1.
(iii) 予め生成したダイアログ履歴ht−1。
0.66
To eliminate the need for ad-hoc video-and-text fusion modules [24, 27], we use Vid2Sum video caption model [43] trained on the AVSD dataset to predict textual descriptions for each of the top-𝑘 previously retrieved videos.
Specifically, given a video 𝑉𝑖, the Vid2Sum model provides a detailed textual summary of the video content, which we denote as 𝑆𝑖.
具体的には、ビデオ Vi が与えられた場合、Vid2Sum モデルは、ビデオコンテンツの詳細なテキスト要約を提供する。
0.75
Afterward, the predicted summaries for all 𝑘 videos retrieved at timestep 𝑡 − 1, denoted as 𝑆1, 𝑆2, ..., 𝑆𝑘, are fed into the question generator along with the initial textual query 𝑇 and previous dialog history 𝐻𝑡−1.
その後、timestep t − 1で検索されたすべてのkビデオの予測要約(s1, s2, ..., sk)を、最初のテキストクエリtと前のダイアログ履歴ht−1と共に質問生成器に送付する。
0.69
More precisely, we concatenate the
より正確には、私たちは結合する。
0.47
(i) Figure 3: Illustration of the proposed question generator.
(i) 図3: 提案された質問生成器のイラスト。
0.58
It receives (i) an initial user-specified textual query,
受け取ります (i)初期ユーザ指定のテキストクエリ。
0.70
(ii) topk retrieved candidate videos (from the previous interaction rounds), and
(ii)topkが検索した候補ビデオ(前回のインタラクションラウンドから)、そして
0.79
(iii) the entire dialog history as its inputs.
(iii)全対話履歴を入力とする。
0.67
We then use a pretrained caption generator (Vid2Sum [43]) to map the videos into text.
Afterward, all of the text-based inputs (including the predicted video captions) are fed into an autoregressive BART model for a new question generation.
In contrast, the majority of prior methods [7] are typically constrained to a small set of closed-set question/answer pairs, which makes it difficult to generalize them to diverse real-world dialog scenarios.
In our experimental section 6.4, we also conduct a userstudy evaluation demonstrating that our answer generation oracle effectively replaces a human answering the questions.
Afterward, the generated summary 𝑆𝑖 and the question 𝑞𝑡 are concatenated and passed to a separate BART answer generation model to generate an answer 𝑎𝑡 about the video 𝑉𝑖.
(3) (4) Note that the BART models used for question and answer generation have the same architecture but that their weights are different (i.e., they are trained for two different tasks).
3.3 Text-to-Video Retrieval Model Our video retrieval model (VRM) takes an initial textual query 𝑇 and previous dialog history 𝐻𝑡 and returns a probability distribution 𝑝 ∈ R𝑁 that encodes the (normalized) similarity between each video 𝑉 (𝑖) in the database of 𝑁 videos and the concatenated text query [𝑇 , 𝐻𝑡].
3.3 Text-to-Video Retrieval Model 我々のビデオ検索モデル(VRM)は、初期テキストクエリTと過去のダイアログ履歴Htを取り込み、Nビデオのデータベースと連結テキストクエリ[T, Ht]内の各ビデオV(i)間の(正規化)類似性を符号化した確率分布p ∈ RNを返す。
0.87
Formally, we can write this operation as:
正式にはこの操作を次のように書ける。
0.65
𝑝 = VRM(𝑇 , 𝐻𝑡),
p = VRM(T, Ht)
0.37
(6) where each 𝑝𝑖 value encodes a probability that the 𝑖th video 𝑉 (𝑖) is the correct video associated with the concatenated textual query [𝑇 , 𝐻𝑡].
Our video retrieval model consists of two main components:
我々のビデオ検索モデルは2つの主成分から構成される。
0.60
(i) a visual encoder 𝐹(𝑉 ; 𝜃𝑣) with learnable parameters 𝜃𝑣 and
(i)学習可能なパラメータ θv を持つビジュアルエンコーダ F(V; θv)
0.89
(ii) a textual encoder 𝐺(𝑇 , 𝐻𝑡; 𝜃𝑡) with learnable parameters 𝜃𝑡.
(ii)学習可能なパラメータ θt を持つテキストエンコーダ G(T , Ht; θt) 。
0.90
During training, we assume access to a manually labeled video retrieval dataset X = {(𝑉 (1),𝑇 (1), 𝐻 (1) )}, where 𝑇 (𝑖) and 𝐻 (𝑖) depict textual queries and dialog histories associated with a video 𝑉 (𝑖) respectively.
トレーニング中、手動でラベル付けされたビデオ検索データセット X = {(V (1),T (1), H (1) )} にアクセスすると、T (i) と H (i) はそれぞれビデオ V (i) に関連するテキストクエリとダイアログ履歴を記述する。
0.81
As our visual encoder, we use a video transformer encoder [6] that computes a visual representation 𝑓 (𝑖) = 𝐹(𝑉 (𝑖); 𝜃𝑣) where 𝑓 (𝑖) ∈ R𝑑.
We can jointly train the visual and textual encoders end-to-end by minimizing the sum of videoto-text and text-to-video matching losses as is done in [5]:
𝐵 log (8) Here, 𝐵 is the batch size, and 𝑓 (𝑖), 𝑔( 𝑗) are the embeddings of the 𝑖𝑡ℎ video and 𝑗𝑡ℎ text embeddings (corresponding to the 𝑗𝑡ℎ video) respectively.
𝑖=1 Avinash Madasu, Junier Oliva, and Gedas Bertasius
𝑖=1 Avinash Madasu, Junier Oliva, Gedas Bertasius
0.33
During inference, given an initial user query 𝑇 and the previous dialog history 𝐻𝑡, we extract a textual embedding 𝑔 = 𝐺(𝑇 , 𝐻𝑡; 𝜃𝑡) using our trained textual encoder where 𝑔 ∈ R1×𝑑.
Additionally, we also extract visual embeddings 𝑓 (𝑖) = 𝐹(𝑉 (𝑖); 𝜃𝑣) for every video 𝑉 (𝑖) where 𝑖 = 1 . . . 𝑁 .
さらに、i = 1 . N となるすべてのビデオ V (i) に対して、視覚的埋め込み f (i) = F(V (i); θv) も抽出する。
0.78
We then stack the resulting visual embeddings [𝑓 (1); . . . ; 𝑓 (𝑁)] into a single feature matrix 𝑌 ∈ R𝑁×𝑑.
次に、得られる視覚的埋め込み [f (1); . ; f (N)] を単一の特徴行列 Y ∈ RN×d に積み重ねる。
0.70
Afterward, the video retrieval probability distribution 𝑝 ∈ R1×𝑁 is computed as a normalized dot product between a single textual embedding 𝑔 and all the visual embeddings 𝑌.
For simplicity, throughout the remainder of the draft, we denote this whole operation as 𝑝 = VRM(𝑇 , 𝐻𝑡).
単純性については、ドラフトの残りを通して、この操作全体をp = VRM(T, Ht)と表現する。
0.68
4 INFORMATION-GUIDED SUPERVISION FOR
4 インフォメーションガイドによる監督
0.44
QUESTION GENERATION Our goal in the above-described question generation step is to generate questions that will maximize the subsequent video retrieval performance.
Although providing currently known information and belief over videos is straightforward via the dialogue history and top-𝑘 candidate videos, respectively, comprehending (and planning for) future informative questions is difficult.
A major challenge stems from the free-form nature of questions that may be posed.
主な課題は、提起される可能性のある質問の自由形式の性質に起因している。
0.54
There is a large multitude of valid next questions to pose.
正当な次の質問をする人はたくさんいます。
0.58
Explicitly labeling the potential information gain of all valid next questions shall not scale.
すべての有効な次の質問の潜在的な情報ゲインを明示的にラベル付けることは、スケールしない。
0.47
One may define the task of posing informative queries as a Markov decision process (MDP), where the current state contains known information, actions include possible queries to make, and rewards are based on the number of queries that were made versus the accuracy of the resulting predictions [25, 42].
Previous interactive image retrieval [7, 33] approaches have used similar MDPs optimized through reinforcement learning to train policies that may select next questions from a limited finite list.
However, these reinforcement learning (RL) approaches suffer when the action space is large (as is the case with open-ended question generation) and when rewards are sparse (as is the case with accuracy after final prediction) [25].
Thus, we propose an alternative approach, information-guided question generation supervision (IGS), that bypasses a difficult RL problem, by explicitly defining informative targets for the question generated based on a post-hoc search.
Suppose that for each video 𝑉 (𝑖), 𝑖 ∈ {1, . . . , 𝑁}, we also have 𝑚 distinct human-generated questions/answers relevant to the video 𝐷(𝑖) = {𝐷(𝑖) 𝑚 }.
各ビデオ v (i) i ∈ {1, . . . , n} に対して、ビデオ d(i) = {d(i) m } に関連する m 個の異なる人間生成の質問/回答があるとする。
0.82
Typically, such data is collected independently to any particular video retrieval system; e g in the AVSD [3] dataset, users ask (and answer) multiple questions about the content of a given video (without any particular goals in mind).
However, these human-generated questions can serve as potential targets for our question generator.
しかし、これらの人間が生成した質問は、私たちの質問生成者の潜在的なターゲットとなり得る。
0.47
With IGS, we propose to filter through 𝐷(𝑖) according to the retrospective performance as follows.
IGSでは, 振り返り性能に応じて, D(i) をフィルタリングすることを提案する。
0.68
During training, we collect targets for the question generator at each round of dialogue separately.
学習中、各対話のラウンド毎に質問生成のためのターゲットを別々に収集する。
0.64
Let 𝑇 (𝑖), be an initial textual
T (i) を初期テキストとする
0.64
1 , . . . , 𝐷(𝑖)
1 , . . . , 𝐷(𝑖)
0.42
英語(論文から抽出)
日本語訳
スコア
Learning to Retrieve Videos by Asking Questions
質問によるビデオ検索の学習
0.73
, , query corresponding to ground truth video 𝑉 (𝑖).
, , 地上の真理ビデオV(i)に対応するクエリ。
0.60
Then, also let 𝑆 (𝑖) 𝑡,1 , . . . , 𝑆 (𝑖) 𝑡,𝑘 be our predicted text summaries of top-𝑘 retrieved candidate videos after the 𝑡th rounds of dialogue, 𝐻 (𝑖) (note that 𝐻 (𝑖) 0 = ∅).
次に、S (i) t,1 , . , S (i) t,k を、第12ラウンドの対話の後、トップk検索した候補ビデオの予測テキスト要約として H (i) (注意: H (i) 0 = s) とする。
0.74
We try appending question/answers (𝑞, 𝑎) in 𝐷(𝑖) not in 𝐻 (𝑖) and see which remaining question would most improve 𝑡 retrieval performance.
h(i) にない d(i) にq/answers (q, a) を付加し、どの質問が t の検索性能を最も向上させるかを確かめる。
0.72
That is, we collect to 𝐻 (𝑖)
つまり、収集する h (複数形 hs)
0.61
𝑡 𝑡 (cid:104)
𝑡 𝑡 (cid:104)
0.41
𝑡 ∪ {(𝑞, 𝑎)})(cid:105)
t が {(q, a)})(cid:105)
0.45
, 𝑖 , (10)
, 𝑖 , (10)
0.42
𝑖=1 𝑡+1 , 𝑎∗(𝑖) (𝑞∗(𝑖)
𝑖=1 𝑡+1 , 𝑎∗(𝑖) (𝑞∗(𝑖)
0.64
𝑡+1 ) = argmax
𝑡+1 ) = argmax
0.41
(𝑞,𝑎)∈(𝐷 (𝑖)\𝐻 (𝑖)
(𝑞,𝑎)∈(𝐷 (𝑖)\𝐻 (𝑖)
0.42
) VRM(𝑇 (𝑖), 𝐻 (𝑖)
) VRM(T(i)、H(i)
0.41
𝑡 (cid:104)
𝑡 (cid:104)
0.41
(9) where VRM is our previously described video retrieval model (see Sec. 3.3).
(9)VRMはこれまでに記述したビデオ検索モデルである(Sec.3参照)。
0.76
Note that here, = 𝑝𝑖, which depicts our previously defined retrieval probability between the ground truth video𝑉 (𝑖), and the concatenated text query𝑇 (𝑖), 𝐻 (𝑖) 𝑡 ∪ {(𝑞, 𝑎)}.
ここで、 = pi は、前述した基底真理 videov (i) と連結されたテキストクエリ (i, h (i) t) {(q, a)} の間の検索確率を表す。 訳抜け防止モード: ここでは、地上の真理ビデオV(i)の間の事前定義された検索確率を表す=piに注意。 and the concatenated text queryT ( i ), H ( i ) t t { ( q, a ) } である。
0.81
Each of the retrospective best questions are then set up as a target for the question generator at the 𝑡 + 1th round
are the respective initial query, our predicted text summaries of top-k previous retrievals, and dialogue history that are inputs to the question generator, BART𝑞.
The target question/answers are appended to the histories 𝐻 (𝑖) 𝑡+1 = 𝐻 (𝑖) 𝑡 ∪ {(𝑞∗(𝑖) 𝑡+1 )}, and the next round of target questions D𝑡+2 is similarly collected.
対象の質問/回答は、ヒストリー H (i) t+1 = H (i) t > {(q∗(i) t+1 )} に付加され、対象の質問の次のラウンドDt+2が同様に収集される。
0.77
Please note that D𝑡+1 depends on D𝑡 since we consider appending questions to previous histories.
Dt+1がDtに依存していることに注意してください。
0.53
That is, at each round we look for informative questions based on the histories seen at that round.
つまり、各ラウンドでは、そのラウンドで見た履歴に基づいて、情報的な質問を探します。
0.61
Jointly, the dataset D1 ∪D2 ∪ .
データセット d1 は d2 である。
0.68
. . ∪D𝑀 serve as a supervised dataset to directly train the question generator, BART𝑞, to generate informative questions.
5 EXPERIMENTS 5.1 Dataset We test our model on the audio-visual scene aware dialog dataset (AVSD) [3], which contains ground truth dialog data for every video in the dataset.
Specifically, each video in the AVSD dataset has 10 rounds of human-generated questions and answers describing various details related to the video content (e g , objects, actions, scenes, people, etc.).
We train our question generator using the BART large architecture.
BARTの大規模アーキテクチャを使って質問生成を訓練する。
0.64
We set the maximum sentence length to 120.
我々は最大文長を120に設定した。
0.77
During generation, we use the beam search of size 10.
生成時にサイズ10のビームサーチを用いる。
0.61
The question generator is trained for 5 epochs with a batch size of 32.
質問生成器は、バッチサイズ32の5つのエポックで訓練される。
0.69
5.2.2 Answer Generator.
5.2.2 アンサー発電機。
0.49
We also use the BART large architecture to train our answer generator.
また、BARTの大規模アーキテクチャを使って回答ジェネレータをトレーニングしています。
0.55
Note that the question and answer generators use the same architecture but are trained with two different objectives, thus, resulting in two distinct models.
Additionally, the MeanR and MedianR metrics depict the mean and the median rank of the retrieved ground truth videos respectively (the lower the better).
We fine-tune the Frozen-in-Time (FiT) model to retrieve the correct video using the initial textual query 𝑇 as its input (without using dialog).
本研究では、Frozen-in-Time(FiT)モデルを微調整し、初期テキストクエリTを入力として(ダイアログなしで)正しいビデオを取得する。 訳抜け防止モード: We fine - tune the Frozen - in - Time (FiT ) model 初期テキストクエリTを入力として(ダイアログを使用せずに)正しいビデオを取得する。
0.72
Frozen-in-Time w/ Ground Truth Human Dialog.
フリーズ・イン・タイム w/グランド・トゥルート・ヒューマン・ダイアログ。
0.46
We finetune the Frozen-in-Time model using the textual query and the full 10 rounds of human-generated ground truth dialog history.
Unlike our ViReD approach, which uses our previously introduced question and answer generators to generate dialog, this Frozen-in-Time w/ Dialog baseline uses 10 rounds of manually annotated human dialog history during inference.
我々のViReDアプローチとは違い、このFrozen-in-Time w/Dilogベースラインでは、推論中に手動で注釈付けされた人間の対話履歴を10ラウンド使用しています。 訳抜け防止モード: われわれのViReDアプローチとは違って。 以前導入された質問と回答ジェネレータを使ってダイアログを生成する。 This Frozen - in - Time w/ Dialog baseline using 10 rounds of manual annotated human dialog history during inference .
0.77
In this setting, we concatenate 10 rounds of ground truth dialog with the initial text query, and use the concatenated text for video retrieval.
6 RESULTS AND DISCUSSION 6.1 Quantitative Video Retrieval Results In Table 1, we compare our method with the previously described video retrieval baselines.
Based on these results, we observe that our ViReD approach outperforms all baselines, including a strong Frozenin-Time baseline augmented with 10 rounds of human-generated ground truth dialog.
Figure 4: We study the video retrieval performance (R@1) as a function of the number of dialog rounds.
図4: ダイアログラウンド数の関数として, ビデオ検索性能(R@1)について検討する。
0.72
Based on these results, we observe that the video retrieval accuracy consistently improves as we consider additional rounds of dialog.
これらの結果から,追加の対話ラウンドを考えると,ビデオ検索精度は一貫して向上することがわかった。
0.69
We also note that the performance of our interactive framework reaches its peak after 3 rounds of dialog.
また,対話型フレームワークの性能は3ラウンドの対話後にピークに達することに留意する。
0.72
retrieval performance. Specifically, we note that the original Frozenin-Time (FiT) baseline pretrained on large-scale WebVid2M [5] outperforms the previous state-of-the-art LSTM approach [29] by 1.4% according to R@1 even without using any dialog data.
Next, we demonstrate that dialog is a highly effective cue for the video retrieval task.
次に,ビデオ検索タスクにおいてダイアログが極めて効果的であることを示す。
0.76
Specifically, we first show that the FiT baseline augmented with 10 rounds of humangenerated ground truth dialog performs 5.2% better in R@1 than the same FiT baseline that does not use dialog (Table 1).
This is a significant improvement that highlights the importance of additional information provided by dialog.
これはダイアログが提供する追加情報の重要性を強調する重要な改善である。
0.86
6.1.3 The Number of Dialog Rounds.
6.1.3 ダイアログのラウンド数。
0.66
Next, we observe that despite using only 3 rounds of dialog our ViReD approach outperforms the strong FiT w/ Human Dialog baseline, which uses 10 rounds of human-generated ground truth dialog.
次に、我々のViReDアプローチは、わずか3ラウンドのダイアログを使用するにもかかわらず、10ラウンドの人為的真実ダイアログを使用する強力なFiT w/ Human Dialogベースラインよりも優れていることを観察する。
0.64
It is worth noting that these 10 rounds of dialog were generated in a retrieval-agnostic manner (i.e., without any particular goal in mind), which may explain this result.
Nevertheless, this result indicates that a few questions (e g , 3) generated by our model are as informative as 10 task-agnostic human generated questions.
Furthermore, we note that the performance reaches its peak with 3 rounds of interactions.
さらに,3ラウンドのインタラクションで性能がピークに達することに留意する。
0.62
6.2 Video Question Answering Results 6.2.1 Comparison to the State-of-the-Art.
6.2 ビデオ質問回答結果 6.2.1 現状との比較
0.67
As discussed above, we use our answer generator to simulate human presence in an interactive dialog setting.
上記のように,対話的な対話環境において,人間の存在をシミュレートするために回答生成器を用いる。
0.63
To validate the effectiveness of our answer generator, we evaluate its performance on the video question answering task on AVSD using the same setup as in Simple [41], and Vx2Text [27].
We present these results in Table 2 where we compare our answer generation method with the existing video question answering baselines.
本稿では,これらの結果を表2に示し,既存のビデオ質問応答ベースラインと比較する。
0.74
Our results indicate that our answer generation model significantly outperforms many previous methods, including MA-VDS [16], QUALIFIER [47], Simple [41] and RLM [26].
, , Table 3: To validate the effectiveness of our interactive framework in the real-world setting, we replace our automatic answer generator oracle with several human subjects.
As our baseline we train the question generator to generate questions in a video retrieval-agnostic fashion, i.e., using the same order as the human annotators did when they asked those questions.
Figure 6: We study the video retrieval performance in two settings:
図6:ビデオ検索性能を2つの設定で検討する。
0.81
(i) when the question generator uses top-k retrieved videos as part of its inputs, and
(i)質問生成装置が入力の一部としてトップk検索ビデオを使用する場合
0.82
(ii) when it does not.
(ii)そうでない場合。
0.69
In this case, k is set to 4.
この場合、k は 4 に設定される。
0.77
Based on the results, we observe that including top-k retrieved video candidates as part of the question generator inputs improves video retrieval accuracy for all number of dialog rounds.
6.2.2 Replacing Our Answer Generator with a Human Subject.
6.2.2 応答生成器を人間に置き換える。
0.71
To validate whether our interactive framework generalizes to the realworld setting, we conduct a human study where we replace our proposed answer generator with several human subjects.
We then use the answers of each subject along with the generated questions as input to the video retrieval model (similar to our previously described setup).
In Table 3, we report these results for each of 3 human subjects.
表3では、これらの結果を3人の被験者それぞれに報告する。
0.67
These results suggest that our interactive framework works reliably even with real human subjects.
これらの結果から,対話型フレームワークは実際の人間でも確実に機能することが示唆された。
0.53
Furthermore, we note that compared to the variant that uses an automatic answer generator, the variant with a human in the loop performs only slightly better, thus, indicating the robustness of our automatic answer generation framework.
Note that in this case, the video retrieval is performed only on the subset of 50 selected videos.
なお、この場合、ビデオ検索は、選択された50の動画のサブセットでのみ実行される。
0.75
6.3 Ablation Studies Next, we ablate various design choices of our model.
6.3 アブレーション研究 次に, 様々な設計選択を省略する。
0.75
Specifically, (i) we validate the effectiveness of our proposed Information-guided
具体的には、(i)提案した情報誘導の有効性を検証する。
0.60
Figure 7: We investigate the video retrieval performance as a function of the number of retrieved candidate video inputs that are fed into the question generator.
(ii) the importance of using retrieved candidate videos for question generation, and
(ii)質問生成における検索候補ビデオの利用の重要性
0.75
(iii) how video retrieval performance changes as we vary the number of candidate video inputs to the question generator.
(3) 質問生成器への候補映像の入力数の変化に伴い, 映像検索性能がどう変化するか。
0.77
英語(論文から抽出)
日本語訳
スコア
, , Avinash Madasu, Junier Oliva, and Gedas Bertasius
, , Avinash Madasu, Junier Oliva, Gedas Bertasius
0.39
Figure 8: Qualitative results of our interactive video retrieval framework.
図8: インタラクティブなビデオ検索フレームワークの質的な結果。
0.74
On the left we illustrate the keyframe of the ground truth video 𝑉𝑔𝑡 (i.e., the video that the user wants to retrieve) and the initial textual query for that video.
Furthermore, under each dialog box, we also illustrate the rank of the ground truth video 𝑉𝑔𝑡 among all videos in the database (i.e., the lower the better).
Based on these results, we observe that each dialog round significantly improves video retrieval results (as indicated by the lower rank of the ground truth video).
These results indicate the usefulness of dialog cues.
これらの結果は,ダイアログ手がかりの有用性を示している。
0.54
6.3.1 Effectiveness of IGS.
6.3.1 IGSの有効性
0.63
To show the effectiveness of IGS, we compare the performance of our interactive video retrieval framework when using
IGSの有効性を示すために,対話型ビデオ検索フレームワークの性能を比較した。
0.77
(i) IGS as a training objective to the question generator vs.
(i)質問生成器対質問器の学習目標としてのigs
0.61
(ii) using a video retrieval-agnostic objective.
(ii)ビデオ検索非依存の目的による。
0.63
Specifically, we note the AVSD dataset has 10 pairs of questions and answers associated with each video.
具体的には、AVSDデータセットには、各ビデオに関連する10の質問と回答がある。
0.74
For the retrieval-agnostic baseline, we use the original order of the questions (i.e., as they appear in the dataset) to construct a supervisory signal for the question generator.
In contrast, for our IGS-based objective, we order the questions such that they would maximize the subsequent video retrieval accuracy at each round of questions/answers.
These results suggest that IGS significantly outperforms the retrieval-agnostic baseline, thus, validating the effectiveness of our proposed IGS technique.
These results indicate that the performance gradually increases with every additional video candidate input and reaches the peak when using 𝑘 = 4 retrieved videos.
これらの結果から,k = 4 の検索ビデオを使用すると,追加の映像候補入力毎に性能が徐々に向上し,ピークに達することが示唆された。 訳抜け防止モード: これらの結果は 追加のビデオ候補入力ごとにパフォーマンスが徐々に向上する k = 4の検索ビデオを使用するとピークに達する。
0.81
We also observe that the performance slightly drops if we set 𝑘 larger than 4.
また、k が 4 より大きいと、パフォーマンスがわずかに低下することも観察します。 訳抜け防止モード: 私たちはまた k を 4 よりも大きくすると、パフォーマンスはわずかに低下する。
0.73
We hypothesize that this happens because the input sequence length to the BART question generator becomes too long, potentially causing overfitting or other optimization-related issues.
6.4 Qualitative Results In Figure 8, we also illustrate some of our qualitative interactive video retrieval results.
6.4 質的な結果 図8では、質的なインタラクティブなビデオ検索結果も紹介する。
0.69
On the left we show the keyframe of the ground truth video 𝑉𝑔𝑡 (i.e., the video that the user wants to retrieve) and the initial textual query for that video.
From left to right, we illustrate the three rounds of questions and answers produced by our question and answer generators.
左から右へ、質問と回答ジェネレータによって生成された3つのラウンドの質問と回答を説明します。
0.63
Additionally, under each question/answer box, we also visualize the rank of the video 𝑉𝑔𝑡 among all videos in the database (i.e., the lower the better, where rank of 1 implies that the correct video was retrieved).
We see a dog enter and leave.A man coming home from school and getting ready to do homeworkQ: Is he wearing glasses?
犬が入ってくるのを見る。学校から帰って宿題をする準備をしている男性。彼は眼鏡をかけていますか?
0.66
A: No, he is not wearing glasses.
A: いいえ、彼は眼鏡をかけていません。
0.76
.Q: What color is he wearing?
Q:何色を着ていますか。
0.77
A: He is wearing black color.Q: Is there anything in his hand?
A: 彼は黒色を着ています。Q:手に何かありますか。
0.75
A: Yes, he is holding an item.Rank 385Rank 48Rank 7
A: はい、彼はアイテムを持っています。Rank 385Rank 48Rank 7
0.72
英語(論文から抽出)
日本語訳
スコア
Learning to Retrieve Videos by Asking Questions
質問によるビデオ検索の学習
0.73
in the initial textual query).
最初のテキストクエリで)。
0.51
Lastly, we observe that the questions asked by our model focus on diverse concepts including gender, presence of certain objects, human actions, clothes colors, etc.
We demonstrated that dialog provides valuable cues for video retrieval, thus, leading to significantly better performance compared to the non-interactive baselines.
(ii) information-guided supervision techniques provide significant improvements to our model’s performance.
(II)情報誘導型監視技術は,我々のモデルの性能を著しく改善する。
0.80
In summary, our method is
まとめると、私たちの方法は
0.70
(i) conceptually simple,
(i)概念的には単純で
0.71
(ii) it achieves state-of-the-art results on the interactive video retrieval task on the AVSD dataset, and
(ii)avsdデータセット上のインタラクティブビデオ検索タスクにおける最先端の成果を達成し、
0.74
(iii) it can generalize to the real-world settings involving human subjects.
(iii)人間を対象とする実世界の設定に一般化することができる。
0.67
In the future, we will extend our framework to other video-and-language tasks such as interactive video question answering and interactive temporal moment localization.
2018. Don’t just assume; look and answer: Overcoming priors for visual question answering.
2018. look and answer: 視覚的な質問に答える優先事項を克服する。
0.52
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
IEEE Conference on Computer Vision and Pattern Recognition に参加して
0.35
4971–4980.
4971–4980.
0.35
[3] Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K Marks, Chiori Hori, Peter Anderson, et al 2019.
In Proceedings of the IEEE/CVF International Conference on Computer Vision.
IEEE/CVF国際コンピュータビジョン会議に参加して
0.73
1835–1844.
1835–1844.
0.35
[8] Huizhong Chen, Matthew Cooper, Dhiraj Joshi, and Bernd Girod.
[8]ホイソン・チェン、マシュー・クーパー、ダライ・ジョシ、ベルント・ジャロッド。
0.47
2014. Multimodal language models for lecture video retrieval.
2014. 講義ビデオ検索のためのマルチモーダル言語モデル
0.57
In Proceedings of the 22nd ACM international conference on Multimedia.
第22回ACM国際マルチメディア会議に参加して
0.70
1081–1084.
1081–1084.
0.35
[9] Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, and Yang Liu.
9]Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, Yang Liu。 訳抜け防止モード: 9] ioana croitoru, simion - vlad bogolin, marius leordeanu, ハリン・ジン、アンドリュー・ジッセルマン、サミュエル・アルバニー、ヤン・リウ。
0.66
2021. Teachtext: Crossmodal generalized distillation for text-video retrieval.
2021. teachtext: crossmodal generalized distillation for text-video retrieval(英語)
0.57
In Proceedings of the IEEE/CVF International Conference on Computer Vision.
IEEE/CVF国際コンピュータビジョン会議に参加して
0.73
11583–11593.
11583–11593.
0.35
[10] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra.
2017. Learning cooperative visual dialog agents with deep reinforcement learning.
2017. 深い強化学習による協調的視覚対話エージェントの学習
0.61
In Proceedings of the IEEE international conference on computer vision.
ieee国際コンピュータビジョン会議(ieee international conference on computer vision)に出席。
0.64
2951–2960.
2951–2960.
0.35
[12] Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, and Aleksandr Petiushko.
Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, Aleksandr Petiushko。
0.29
2021. Mdmmt: Multidomain multimodal transformer for video retrieval.
2021. Mdmmt: ビデオ検索のためのマルチドメインマルチモーダルトランス。
0.54
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
IEEE/CVF Conference on Computer Vision and Pattern Recognition に参加して
0.41
3354–3363.
3354–3363.
0.35
[13] Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen.
13]ハン・ファン、pengfei xiong、luhui xu、yu chen
0.48
2021. Clip2video: Mastering
2021. Clip2video: マスタリング
0.55
video-text retrieval via image clip.
画像クリップによるビデオテキスト検索。
0.69
arXiv preprint arXiv:2106.11097 (2021).
arxiv プレプリント arxiv:2106.11097 (2021)
0.46
[14] Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, et al 1995.
[16] Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, et al 2019.
[16] ホリ千織、フダ・アラメリ、ジュエ・ワン、ゴードン・ウィチェルン、ホリ高明、アヌープ・チェリアン、ティム・k・マークス、ヴィンセント・カルティリア、ラファエル・ゴンティホ・ロペス、アブヒシェク・ダス、そして2019年。 訳抜け防止モード: 【16歳】堀千織、アラメリ、王寿恵 ゴードン・ウィッチェルン 堀隆明 アノープ・チェリアン ティム・k・マークス vincent cartillier, raphael gontijo lopes, abhishek das, et al 2019など。
0.46
End-to-end audio visual scene-aware dialog using multimodal attention-based video features.
マルチモーダル注意型ビデオ機能を用いたエンド・ツー・エンド音声視覚シーン認識ダイアログ
0.52
In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
icassp 2019-2019で、ieee international conference on acoustics, speech and signal processing (icassp) が開催された。
0.57
IEEE, 2352–2356.
西暦2352-2356。
0.46
[17] Adriana Kovashka and Kristen Grauman.
17] アドリアナ・コヴァシュカと クリステン・グラウマン
0.47
2013. Attribute pivots for guiding relevance feedback in image search.
2013. 画像検索における関連性フィードバック誘導のための属性ピボット
0.54
In Proceedings of the IEEE International Conference on Computer Vision.
IEEE International Conference on Computer Vision に参加して
0.70
297–304. [18] Adriana Kovashka, Devi Parikh, and Kristen Grauman.
2019. Ocr-vqa: Visual question answering by reading text in images.
2019. Ocr-vqa: 画像中のテキストを読んで答える視覚的質問。
0.56
In 2019 International Conference on Document Analysis and Recognition (ICDAR).
2019年、ICDAR(International Conference on Document Analysis and Recognition)に参加。
0.84
IEEE, 947–952.
IEEE 947-952。
0.44
[31] Ishan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, and Laurens Van Der Maaten.
31]イサン・ミスラ、ロス・ギルシック、ロブ・ファーガス、武術ヘバート、アビナヴ・グプタ、ローレンス・ファン・デル・マーテン 訳抜け防止モード: [31]イシャン・ミスラ、ロス・ギルシック、ロブ・ファーガス Martial Hebert, Abhinav Gupta, and Laurens Van Der Maaten
0.66
2018. Learning by asking questions.
2018. 質問をすることで学ぶ。
0.55
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
IEEE Conference on Computer Vision and Pattern Recognition に参加して
0.35
11–20. [32] Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K RoyChowdhury.
11–20. [32]Niluthpol Chowdhury Mithun、Juncheng Li、Florian Metze、Amit K RoyChowdhury。
0.35
2018. Learning joint embedding with multimodal cues for crossmodal video-text retrieval.
2018. クロスモーダルビデオテキスト検索のためのマルチモーダルキューを用いた共同埋め込み学習
0.52
In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval.
2018 ACM on International Conference on Multimedia Retrieval に参加して
0.71
19–27. [33] Nils Murrugarra-Llerena and Adriana Kovashka.
19–27. [33] nils murrugarra-llerena と adriana kovashka 。
0.55
2021. Image retrieval with mixed initiative and multimodal feedback.
2021. 混合イニシアティブとマルチモーダルフィードバックによる画像検索
0.54
Computer Vision and Image Understanding 207 (2021), 103204.
IEEE Transactions on circuits and systems for video technology 8, 5 (1998), 644–655.
IEEE Transactions on circuits and systems for video technology 8, 5 (1998), 644–655。
0.43
英語(論文から抽出)
日本語訳
スコア
, , Avinash Madasu, Junier Oliva, and Gedas Bertasius
, , Avinash Madasu, Junier Oliva, Gedas Bertasius
0.39
[39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al 2015.
Alga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al 2015 訳抜け防止モード: [39 ]オルガ・ルサコフスキー、ジア・デン、ハオ・スー、 Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang Andrej Karpathy氏、Aditya Khosla氏、Michael Bernstein氏、そして2015年。
0.81
Imagenet large scale visual recognition challenge.
Imagenet 大規模視覚認識チャレンジ。
0.72
International journal of computer vision 115, 3 (2015), 211–252.
international journal of computer vision 115, 3 (2015), 211–252。
0.83
[40] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf.
40]ヴィクター・サン、リサンドル・デビュー、ジュリアン・チャウモンド、トーマス・ウルフ
0.49
2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
0.43
248–256. [48] Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun.
248–256. [48]クオ・ホー・ゼン、チン・ハン・チェン、チン・ヤオ・チュアン、ユアン・ホン・リアオ、フアン・カルロス・ニーブルズ、ミン・サン。 訳抜け防止モード: 248–256. [48 ]クオ-ホー・ゼン・テン・ハン・チェン, Ching - Yao Chuang, Yuan - Hong Liao, Juan Carlos Niebles とMin Sun。
0.54
2017. Leveraging video descriptions to learn video question answering.
2017. ビデオ記述を活用してビデオ質問応答を学習する。
0.56
In Thirty-First AAAI Conference on Artificial Intelligence.
第31回 aaai conference on artificial intelligence に参加して