Identity-Aware Multi-Sentence Video Description
- URL: http://arxiv.org/abs/2008.09791v1
- Date: Sat, 22 Aug 2020 09:50:43 GMT
- Title: Identity-Aware Multi-Sentence Video Description
- Authors: Jae Sung Park, Trevor Darrell, Anna Rohrbach
- Abstract summary: We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips.
One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model.
Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
- Score: 105.13845996039277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standard video and movie description tasks abstract away from person
identities, thus failing to link identities across sentences. We propose a
multi-sentence Identity-Aware Video Description task, which overcomes this
limitation and requires to re-identify persons locally within a set of
consecutive clips. We introduce an auxiliary task of Fill-in the Identity, that
aims to predict persons' IDs consistently within a set of clips, when the video
descriptions are given. Our proposed approach to this task leverages a
Transformer architecture allowing for coherent joint prediction of multiple
IDs. One of the key components is a gender-aware textual representation as well
an additional gender prediction objective in the main model. This auxiliary
task allows us to propose a two-stage approach to Identity-Aware Video
Description. We first generate multi-sentence video descriptions, and then
apply our Fill-in the Identity model to establish links between the predicted
person entities. To be able to tackle both tasks, we augment the Large Scale
Movie Description Challenge (LSMDC) benchmark with new annotations suited for
our problem statement. Experiments show that our proposed Fill-in the Identity
model is superior to several baselines and recent works, and allows us to
generate descriptions with locally re-identified people.
Related papers
- MICap: A Unified Model for Identity-aware Movie Descriptions [16.287294191608893]
We present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks.
Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives.
arXiv Detail & Related papers (2024-05-19T08:54:12Z) - Contextual AD Narration with Interleaved Multimodal Sequence [50.240534605090396]
The task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie.
With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name.
We propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs.
arXiv Detail & Related papers (2024-03-19T17:27:55Z) - Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based
Person Re-Identification [18.01407937934588]
We present a new framework called Multi-Prompts ReID (MP-ReID) based on prompt learning and language models.
MP-ReID learns to hallucinate diverse, informative, and promptable sentences for describing the query images.
Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models.
arXiv Detail & Related papers (2023-12-28T03:00:19Z) - Multiview Identifiers Enhanced Generative Retrieval [78.38443356800848]
generative retrieval generates identifier strings of passages as the retrieval target.
We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage.
Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
arXiv Detail & Related papers (2023-05-26T06:50:21Z) - Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person
Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules.
MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips.
We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z) - MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection [17.74528571088335]
We introduce MINTIME, a video deepfake detection approach that captures spatial and temporal anomalies and handles instances of multiple people in the same video and variations in face sizes.
It achieves state-of-the-art results on the ForgeryNet dataset with an improvement of up to 14% AUC in videos containing multiple people.
arXiv Detail & Related papers (2022-11-20T15:17:24Z) - End-to-end Dense Video Captioning as Sequence Generation [83.90502354328679]
We show how to model the two subtasks of dense video captioning jointly as one sequence generation task.
Experiments on YouCook2 and ViTT show encouraging results and indicate the feasibility of training complex tasks integrated into large-scale pre-trained models.
arXiv Detail & Related papers (2022-04-18T01:30:54Z) - Attribute-aware Identity-hard Triplet Loss for Video-based Person
Re-identification [51.110453988705395]
Video-based person re-identification (Re-ID) is an important computer vision task.
We introduce a new metric learning method called Attribute-aware Identity-hard Triplet Loss (AITL)
To achieve a complete model of video-based person Re-ID, a multi-task framework with Attribute-driven Spatio-Temporal Attention (ASTA) mechanism is also proposed.
arXiv Detail & Related papers (2020-06-13T09:15:38Z) - Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.