Identity-Aware Multi-Sentence Video Description
        - URL: http://arxiv.org/abs/2008.09791v1
- Date: Sat, 22 Aug 2020 09:50:43 GMT
- Title: Identity-Aware Multi-Sentence Video Description
- Authors: Jae Sung Park, Trevor Darrell, Anna Rohrbach
- Abstract summary: We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips.
One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model.
Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
- Score: 105.13845996039277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Standard video and movie description tasks abstract away from person
identities, thus failing to link identities across sentences. We propose a
multi-sentence Identity-Aware Video Description task, which overcomes this
limitation and requires to re-identify persons locally within a set of
consecutive clips. We introduce an auxiliary task of Fill-in the Identity, that
aims to predict persons' IDs consistently within a set of clips, when the video
descriptions are given. Our proposed approach to this task leverages a
Transformer architecture allowing for coherent joint prediction of multiple
IDs. One of the key components is a gender-aware textual representation as well
an additional gender prediction objective in the main model. This auxiliary
task allows us to propose a two-stage approach to Identity-Aware Video
Description. We first generate multi-sentence video descriptions, and then
apply our Fill-in the Identity model to establish links between the predicted
person entities. To be able to tackle both tasks, we augment the Large Scale
Movie Description Challenge (LSMDC) benchmark with new annotations suited for
our problem statement. Experiments show that our proposed Fill-in the Identity
model is superior to several baselines and recent works, and allows us to
generate descriptions with locally re-identified people.
 
      
        Related papers
        - Proteus-ID: ID-Consistent and Motion-Coherent Video Customization [17.792780924370103]
 Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt.<n>This task presents two core challenges: maintaining identity consistency while aligning with the described appearance and actions, and generating natural, fluid motion without unrealistic stiffness.<n>We introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization.
 arXiv  Detail & Related papers  (2025-06-30T11:05:32Z)
- IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for   Multi-shot Scenes [20.662082715151886]
 We introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios.<n>We then contribute a new model IPFormer-VideoLLM, which injection of instance-level features as instance prompts through an efficient attention-based connector.
 arXiv  Detail & Related papers  (2025-06-26T09:30:57Z)
- UNIC: Unified In-Context Video Editing [76.76077875564526]
 UNified In-Context Video Editing (UNIC) is a framework that unifies diverse video editing tasks within a single model in an in-context manner.<n>We introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks.<n>Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.
 arXiv  Detail & Related papers  (2025-06-04T17:57:43Z)
- Facial Dynamics in Video: Instruction Tuning for Improved Facial   Expression Perception and Contextual Awareness [6.634133253472436]
 This paper introduces a new instruction-following dataset tailored for dynamic facial expression caption.
The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens.
We also present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task.
 arXiv  Detail & Related papers  (2025-01-14T09:52:56Z)
- MICap: A Unified Model for Identity-aware Movie Descriptions [16.287294191608893]
 We present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks.
Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives.
 arXiv  Detail & Related papers  (2024-05-19T08:54:12Z)
- Contextual AD Narration with Interleaved Multimodal Sequence [50.240534605090396]
 The task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie.
With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name.
We propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs.
 arXiv  Detail & Related papers  (2024-03-19T17:27:55Z)
- Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based
  Person Re-Identification [18.01407937934588]
 We present a new framework called Multi-Prompts ReID (MP-ReID) based on prompt learning and language models.
MP-ReID learns to hallucinate diverse, informative, and promptable sentences for describing the query images.
Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models.
 arXiv  Detail & Related papers  (2023-12-28T03:00:19Z)
- Multiview Identifiers Enhanced Generative Retrieval [78.38443356800848]
 generative retrieval generates identifier strings of passages as the retrieval target.
We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage.
Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
 arXiv  Detail & Related papers  (2023-05-26T06:50:21Z)
- Edit As You Wish: Video Caption Editing with Multi-grained User Control [61.76233268900959]
 We propose a novel textbfVideo textbfCaption textbfEditing textbf(VCE) task to automatically revise an existing video description guided by multi-grained user requests.
Inspired by human writing-revision habits, we design the user command as a pivotal triplet textitoperation, position, attribute to cover diverse user needs from coarse-grained to fine-grained.
 arXiv  Detail & Related papers  (2023-05-15T07:12:19Z)
- Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person
  Re-identification [78.08536797239893]
 We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules.
MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips.
We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
 arXiv  Detail & Related papers  (2023-01-02T05:17:31Z)
- MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection [17.74528571088335]
 We introduce MINTIME, a video deepfake detection approach that captures spatial and temporal anomalies and handles instances of multiple people in the same video and variations in face sizes.
It achieves state-of-the-art results on the ForgeryNet dataset with an improvement of up to 14% AUC in videos containing multiple people.
 arXiv  Detail & Related papers  (2022-11-20T15:17:24Z)
- End-to-end Dense Video Captioning as Sequence Generation [83.90502354328679]
 We show how to model the two subtasks of dense video captioning jointly as one sequence generation task.
 Experiments on YouCook2 and ViTT show encouraging results and indicate the feasibility of training complex tasks integrated into large-scale pre-trained models.
 arXiv  Detail & Related papers  (2022-04-18T01:30:54Z)
- Attribute-aware Identity-hard Triplet Loss for Video-based Person
  Re-identification [51.110453988705395]
 Video-based person re-identification (Re-ID) is an important computer vision task.
We introduce a new metric learning method called Attribute-aware Identity-hard Triplet Loss (AITL)
To achieve a complete model of video-based person Re-ID, a multi-task framework with Attribute-driven Spatio-Temporal Attention (ASTA) mechanism is also proposed.
 arXiv  Detail & Related papers  (2020-06-13T09:15:38Z)
- Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
 We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
 arXiv  Detail & Related papers  (2020-04-05T14:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.